wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)
wbcd <- wbcd[-1] # exclude the id column from the dataframe 'cuz it provides no valuable info
table(wbcd$diagnosis) # print out number of benign and malignant diag
B M
357 212
wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B", "M"), labels = c("Benign", "Malignant")) # factor and label
prop.table(table(wbcd$diagnosis)) * 100 # percentage
Benign Malignant
62.74165 37.25835
Normalizing features:
# this shows that features are very inconsitance scales, therefore we must normailize them
summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])
radius_mean area_mean smoothness_mean
Min. : 6.981 Min. : 143.5 Min. :0.05263
1st Qu.:11.700 1st Qu.: 420.3 1st Qu.:0.08637
Median :13.370 Median : 551.1 Median :0.09587
Mean :14.127 Mean : 654.9 Mean :0.09636
3rd Qu.:15.780 3rd Qu.: 782.7 3rd Qu.:0.10530
Max. :28.110 Max. :2501.0 Max. :0.16340
# simple function to normalize data passed in; output is in range of [0.0,1.0]
normalize <- function(x){
return ((x - min(x))/(max(x) - min(x)))
}
# normalize wbcd and store it in wbcd_n
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
# confirm that fatures had been nomalized
summary(wbcd_n[c("area_mean", "texture_mean", "smoothness_mean")])
area_mean texture_mean smoothness_mean
Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.1174 1st Qu.:0.2185 1st Qu.:0.3046
Median :0.1729 Median :0.3088 Median :0.3904
Mean :0.2169 Mean :0.3240 Mean :0.3948
3rd Qu.:0.2711 3rd Qu.:0.4089 3rd Qu.:0.4755
Max. :1.0000 Max. :1.0000 Max. :1.0000
Split normalized dataset into trainging and test dataset:
wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ]
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]
#install.packages("class")
library(class)
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k = 21) # k = sqrt(469) = 21
library(gmodels)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 100
| wbcd_test_pred
wbcd_test_labels | Benign | Malignant | Row Total |
-----------------|-----------|-----------|-----------|
Benign | 61 | 0 | 61 |
| 1.000 | 0.000 | 0.610 |
| 0.968 | 0.000 | |
| 0.610 | 0.000 | |
-----------------|-----------|-----------|-----------|
Malignant | 2 | 37 | 39 |
| 0.051 | 0.949 | 0.390 |
| 0.032 | 1.000 | |
| 0.020 | 0.370 | |
-----------------|-----------|-----------|-----------|
Column Total | 63 | 37 | 100 |
| 0.630 | 0.370 | |
-----------------|-----------|-----------|-----------|
==> As we can see, our k-NN model correctly predicts all Benign test cases (61/61 = 100%), and correctly predicts 37 Malignant test caeses (37/39 = 94.9%). However, it incorectly predicts 2 (2/39 = 5.1%) Malignant cases as Benign.
wbcd_z <- as.data.frame(scale(wbcd[-1])) # standardize dataframe using z-scope
summary(wbcd_z[c("area_mean", "texture_mean", "smoothness_mean")])
area_mean texture_mean smoothness_mean
Min. :-1.4532 Min. :-2.2273 Min. :-3.10935
1st Qu.:-0.6666 1st Qu.:-0.7253 1st Qu.:-0.71034
Median :-0.2949 Median :-0.1045 Median :-0.03486
Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
3rd Qu.: 0.3632 3rd Qu.: 0.5837 3rd Qu.: 0.63564
Max. : 5.2459 Max. : 4.6478 Max. : 4.76672
wbcd_train <- wbcd_z[1:469, ]
wbcd_test <- wbcd_z[470:569, ]
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k = 21) # k = sqrt(469) = 21
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 100
| wbcd_test_pred
wbcd_test_labels | Benign | Malignant | Row Total |
-----------------|-----------|-----------|-----------|
Benign | 61 | 0 | 61 |
| 1.000 | 0.000 | 0.610 |
| 0.924 | 0.000 | |
| 0.610 | 0.000 | |
-----------------|-----------|-----------|-----------|
Malignant | 5 | 34 | 39 |
| 0.128 | 0.872 | 0.390 |
| 0.076 | 1.000 | |
| 0.050 | 0.340 | |
-----------------|-----------|-----------|-----------|
Column Total | 66 | 34 | 100 |
| 0.660 | 0.340 | |
-----------------|-----------|-----------|-----------|
==> Unfortunately, in this case z-scope standardization result in a slightly worse prediction model with 5 (5/39 = 12.8%) Malignant cases predicted as Benign. In the real world, failure in predicting Malignant cancer as Benign can lead patients to believe they are fine and dandy eventhough they need serious treatment!!