br Next to compare the influence of different image wise
Next, to compare the influence of different image-wise methods on the whole model, we fixed the patch-wise phase to use Google's Inception-V3 and started training from scratch. The experimental re-sults show that D-AP5 the SVM method achieves better results than the ma-jority voting method.
Then, we compare the influence of the patch-wise pretraining CNN model on the overall result. In the patch-wise phase, we fixed the image-wise phase using the SVM method and used Google's Inception-V3 model pretrained on ImageNet and the model trained from scratch. As seen from the experimental results, the pretrained model achieved better results. Because our dataset is relatively small compared to the natural image dataset, the pretrained model can help us better initialize and converge. This result has been shown many times in other areas of medical imaging .
Furthermore, we verify the overall effectiveness of our proposed algorithm. In the patch-wise, we used Google's Inception-V3 pair with fine-tuning, and in the image-wise phase we used BLSTM with 4 layers. This model achieved the best average accuracy of 90.5% in the test set. Finally, we used richer multilevel features in the patch-wise phase, and the average classification accuracy was further improved by 0.8%.
5.4. Confusion matrix and AUC
The confusion matrix of the predictions on the test set is presented in Fig. 6 using a model trained on the dataset that contains a total of 400 images. As with the experimental setup in section 5.3, we randomly selected 400 images for the training set and tested them on another 100 images from our released dataset for comparability with the previous method. It can be seen from the confusion matrix in Fig. 6 that the categories of normal, benign, in situ and invasive all obtain high clas-sification accuracy. Specifically, the in situ and invasive categories obtain classification accuracy of 95% and 97% respectively. However,
Fig. 6. Four-class confusion matrix using a dataset containing 400 images. The diagonal elements represent the normalized ratio for which the predicted label is equal to the true label, while off-diagonal elements are those that are mis-labeled.
Fig. 7. Receiver Operating Characteristic (ROC) curves and Area Under the Curve (AUC) values for normal tissue, benign carcinoma, in situ carcinoma, and invasive carcinoma. The experimental results were obtained by using our pro-posed method on the same dataset used in Fig. 6.
in comparison, the classification accuracy of normal and benign cate-gories is only 86% and 87%. What is more significantly is that 10% of normal categories are misclassified as benign categories and 10% of benign categories are misclassified as normal categories. The same phenomenon can be found in the figure of Receiver Operating Char-acteristic (ROC) and Area Under the Curve (AUC). Fig. 7 shows the mean AUC value of 89.25%, corresponding to 85%, 86%, 92% and 94% for the four classes based on receiver operating characteristic analysis.
In general, from the two experimental results shown in Figs. 6 and 7, we can see that the classification result of benign and normal is rela-tively lower than in situ carcinoma and invasive carcinoma. The reason for this phenomenon is that the subclass of benign and normal is not only diverse but also closely related to the age of the patient. Therefore, in the case of a limited number of pathological images, it is difficult to cover enough features of pathological images of benign and normal. For this reason, the final classification result is relatively low. To alleviate this problem, our dataset deliberately collected different subclasses of benign pathological images spanning different age groups. Because the majority of patients who go to the hospital for pathological examination are abnormal, there are very few clinical normal records. Therefore, we focused on only the benign categories. We will show the advantages of our dataset in the following experiment section.
5.5. Sensitivity comparison between different datasets
To illustrate the advantages of our proposed dataset, especially the diversity of benign pathological images, we performed experiments on different datasets using the same method. The comparison of the average sensitivity of image-wise results using ‘Google's Inception-V3 + SVM’ method between the Bioimaging2015 dataset and our da-taset is shown in Fig. 8. From the figure, it can be seen that after using a larger dataset, each class of sensitivity is improved, especially the classification sensitivity of benign images is significantly improved from 68.7% to 85.1%. Many previous works have described the pro-blem that the classification sensitivity of benign images was relatively low. For example, the method proposed by Araújo et al. describes that the image-wise sensitivity of benign images is only 66.7%, but the image-wise sensitivity of normal, in situ and invasive is 77.8%, 77.8% and 88.9%, respectively, because the characteristics of benign images are not salient, they can be subdivided into many subcategories. Moreover, their characteristics show greater diversity with age.