Variable Selection Models To Indicate Tumour Malignancy

Shahzaib Ali
Jun 23, 2021
3 min read

Updated: Aug 18, 2021

Lasso Regression Model

The plot (Figure 2.1: R) shows the cross-validation of the lasso regression model. The X-axis is a log(λ), and Y-axis is the area under the curve (AUC), a measurement of accuracy. The numbers on the top represent the non-zero coefficients. The two dashed lines indicate λ.min, which is considered to have the highest AUC, and λ.1se is the value - which is within 1-standard error of λ.min. The red dots indicate the variance of loss metric between the intervals. In this analysis λ.min is the model selected for the whole dataset, for determination of prioritized coefficients. λ.min is determined to be 0.001199285, and reruns in R will be close to this value but not the same. The AUC value is 0.9978 for λ 0.001199285. The λ.1se is 0.017809, and the AUC value is determined to be 0.994.

The coefficients revealed by the lambda.min value was standardized, and according to the values have been plotted for corresponding importance (Figure 2.1, right). Their importance was computed using the variable importance function in R. The top 5 variables revealed by the lasso regression model are: smoothness_se, concave_pts_se, fractal_dim_worst, concavity_se and fractal_dim_mean. This means that the influential attributes to predicting tumour malignancy are: fractal dimension, concavity, and smoothness of the tumours. When fractal dimension increases, the likelihood of malignancy increases, and this is also true for concavity, which was verified in the original work (Street et al. 1993). The original work by Streets et al (1993) also found that when concavity increases, more indentations are found, and the surface is rather tough and smooth.

Random Forest Figure 2.2 (left) shows the mean decrease inaccuracy, which means that if a variable was excluded from the model, to what degree its accuracy would decrease. Concave_pts_worst, perimeter_worst, radius_worst, area_worst and texture_worst are the top 5 variables determined by the random forest model. In this particular model, 1000 trees were fitted, and of the variables mentioned, “perimeter_worst”, “concave_pts_worst”, “radius_worst” and “area_worst” are supported by mean decrease Gini measurement (Figure 2.2, right), which indicates node purity. The relative importance of each variable when splitting nodes can thus be determined from the plots in Figure 2.2. The out of bag (OOB) error rate is lowest at 340 trees with a value of 0.0352. There is a slight fluctuation in the error trend but comes to halt after that. The classification error rate was 0.042% for this model. The confusion matrix indicated 15 benign samples were classified as malignant, and 9 malignant samples were classified as benign.

Variables that were of importance are significantly different between the ordinary lasso and bootstrapped lasso models. The important variables determined by the bootstrapped lasso model are much closer to the random forest model than the ordinary lasso model. The top 5 variables determined by ordinary lasso were least important considered by random forest models. However, the top 5 from bootstrapped lasso model concavity_pts_worst makes it in the top 5 of random forest, and concavity_worst has been given a significant level of importance but not to a degree in comparison to mentioned variables above. If lasso regression were to be a method of choice - to determine prioritized variables in clinical experiments, bootstrapping should be performed. Random forest is often considered as one of the most accurate machine learning algorithms available, and since the bootstrapped lasso model performed comparatively, it may be considered in a positive light.

Variable Selection Models To Indicate Tumour Malignancy

Recent Posts

Comments