Diagnosing Breast Cancer Using Various Predictive Models

Author: Cat Tran

Language: Python

Packages used: numpy, pandas, matplotlib, sklearn, pydotplus, IPython.display

Previously , I have shown how we can use R and its package class to predict breast cancer based on extracted breast mass features. The data set is donated and publicly available at UCI Machine Learning Repository. We were using k-NN algorithm as our predictive model and we got a decent result with such simple k-NN model.

In this example, I will try my luck on using various predictive models in Python to see if we can improve our predictive ability using the same data set. The models that we will be experimenting today include:

  • Decision Tree
  • Random Forest
  • Support Vector Machine

* About the data set:

The data set contains 32 columns. The first column is the id number of each instance; since id number is generated so that researcher can keep track of the data, it gives us no insight about the data set. The second column is the dianosis of each instance which is either M for malignant or B for benign.

Columns 3 to 32 contains the values for 10 features, each feature has 3 computed statistical values such as mean, standard error, and worst. The values represents the physical characteristics of the cell nulei images.

A more detailed description about the data set can be found here

I. Reading and Preprocessing the Data Set:

First we need read in the data set

In [14]:
import numpy as np
import pandas as pd

data_frame = pd.read_csv("./breast_cancer_data.csv")
data_frame.head(10)
Out[14]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean ... texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.30010 0.14710 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 NaN
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.08690 0.07017 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 NaN
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.19740 0.12790 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 NaN
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.24140 0.10520 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 NaN
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.19800 0.10430 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 NaN
5 843786 M 12.45 15.70 82.57 477.1 0.12780 0.17000 0.15780 0.08089 ... 23.75 103.40 741.6 0.1791 0.5249 0.5355 0.1741 0.3985 0.12440 NaN
6 844359 M 18.25 19.98 119.60 1040.0 0.09463 0.10900 0.11270 0.07400 ... 27.66 153.20 1606.0 0.1442 0.2576 0.3784 0.1932 0.3063 0.08368 NaN
7 84458202 M 13.71 20.83 90.20 577.9 0.11890 0.16450 0.09366 0.05985 ... 28.14 110.60 897.0 0.1654 0.3682 0.2678 0.1556 0.3196 0.11510 NaN
8 844981 M 13.00 21.82 87.50 519.8 0.12730 0.19320 0.18590 0.09353 ... 30.73 106.20 739.3 0.1703 0.5401 0.5390 0.2060 0.4378 0.10720 NaN
9 84501001 M 12.46 24.04 83.97 475.9 0.11860 0.23960 0.22730 0.08543 ... 40.68 97.65 711.4 0.1853 1.0580 1.1050 0.2210 0.4366 0.20750 NaN

10 rows × 33 columns

Now we need to drop the id and the Unnamed columns at the end because these two columns serve no purpose in our predictive ability:

In [15]:
data_frame.drop('id', axis=1, inplace=True)
data_frame.drop('Unnamed: 32', axis=1, inplace=True)
data_frame.head(10)
Out[15]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.30010 0.14710 0.2419 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.08690 0.07017 0.1812 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.19740 0.12790 0.2069 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.24140 0.10520 0.2597 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.19800 0.10430 0.1809 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678
5 M 12.45 15.70 82.57 477.1 0.12780 0.17000 0.15780 0.08089 0.2087 ... 15.47 23.75 103.40 741.6 0.1791 0.5249 0.5355 0.1741 0.3985 0.12440
6 M 18.25 19.98 119.60 1040.0 0.09463 0.10900 0.11270 0.07400 0.1794 ... 22.88 27.66 153.20 1606.0 0.1442 0.2576 0.3784 0.1932 0.3063 0.08368
7 M 13.71 20.83 90.20 577.9 0.11890 0.16450 0.09366 0.05985 0.2196 ... 17.06 28.14 110.60 897.0 0.1654 0.3682 0.2678 0.1556 0.3196 0.11510
8 M 13.00 21.82 87.50 519.8 0.12730 0.19320 0.18590 0.09353 0.2350 ... 15.49 30.73 106.20 739.3 0.1703 0.5401 0.5390 0.2060 0.4378 0.10720
9 M 12.46 24.04 83.97 475.9 0.11860 0.23960 0.22730 0.08543 0.2030 ... 15.09 40.68 97.65 711.4 0.1853 1.0580 1.1050 0.2210 0.4366 0.20750

10 rows × 31 columns

Let's encode the diagnosis into numerical values. 1 for malignant and 0 for benign:

In [16]:
data_frame.diagnosis.unique()
data_frame['diagnosis'] = data_frame['diagnosis'].map({'M':1, 'B':0})
data_frame.head(10)
Out[16]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 1 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.30010 0.14710 0.2419 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.08690 0.07017 0.1812 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 1 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.19740 0.12790 0.2069 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 1 11.42 20.38 77.58 386.1 0.14250 0.28390 0.24140 0.10520 0.2597 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 1 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.19800 0.10430 0.1809 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678
5 1 12.45 15.70 82.57 477.1 0.12780 0.17000 0.15780 0.08089 0.2087 ... 15.47 23.75 103.40 741.6 0.1791 0.5249 0.5355 0.1741 0.3985 0.12440
6 1 18.25 19.98 119.60 1040.0 0.09463 0.10900 0.11270 0.07400 0.1794 ... 22.88 27.66 153.20 1606.0 0.1442 0.2576 0.3784 0.1932 0.3063 0.08368
7 1 13.71 20.83 90.20 577.9 0.11890 0.16450 0.09366 0.05985 0.2196 ... 17.06 28.14 110.60 897.0 0.1654 0.3682 0.2678 0.1556 0.3196 0.11510
8 1 13.00 21.82 87.50 519.8 0.12730 0.19320 0.18590 0.09353 0.2350 ... 15.49 30.73 106.20 739.3 0.1703 0.5401 0.5390 0.2060 0.4378 0.10720
9 1 12.46 24.04 83.97 475.9 0.11860 0.23960 0.22730 0.08543 0.2030 ... 15.09 40.68 97.65 711.4 0.1853 1.0580 1.1050 0.2210 0.4366 0.20750

10 rows × 31 columns

Let's take a look at the summary of thes data set

In [17]:
data_frame.describe()
Out[17]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000
mean 0.372583 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 ... 16.269190 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946
std 0.483918 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 ... 4.833242 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061
min 0.000000 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 ... 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040
25% 0.000000 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 ... 13.010000 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460
50% 0.000000 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 ... 14.970000 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040
75% 1.000000 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 ... 18.790000 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080
max 1.000000 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 ... 36.040000 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500

8 rows × 31 columns

Plot the histogram of the diagnosis distribution:

In [19]:
import matplotlib.pyplot as plt

plt.hist(data_frame['diagnosis'])
plt.title('Diagnosis Distribution')
plt.show()

==> As we can see there are significantly more instances with benign diagnosis than malignant diagnosis.

II. Train and Evaluate Classifiers:

II.a. Split data into train and test set

Before we do ahead and train our models. We need to split our processed data set into test and train data set. We will randomly split 70% of the data set into training set, the rest 30% go into testing set:

In [65]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data_frame.loc[:,'radius_mean':'fractal_dimension_worst'],
                                                    data_frame['diagnosis'],
                                                    test_size=0.3, # 30% is testing set
                                                    random_state=42)

print("X_train's count = ",  len(X_train), "\n",X_train.head(3))
print("X_test's count = ",  len(X_test), "\n", X_test.head(3))
print("y_train's count = ",  len(y_train), "\n", y_train.head(5))
print("y_test's count = ",  len(y_test), "\n", y_test.head(5))
X_train's count =  398 
      radius_mean  texture_mean  perimeter_mean  area_mean  smoothness_mean  \
149        13.74         17.91           88.12      585.0          0.07944   
124        13.37         16.39           86.10      553.5          0.07115   
421        14.69         13.98           98.22      656.1          0.10310   

     compactness_mean  concavity_mean  concave points_mean  symmetry_mean  \
149           0.06376         0.02881              0.01329         0.1473   
124           0.07325         0.08092              0.02800         0.1422   
421           0.18360         0.14500              0.06300         0.2086   

     fractal_dimension_mean           ...             radius_worst  \
149                 0.05580           ...                    15.34   
124                 0.05823           ...                    14.26   
421                 0.07406           ...                    16.46   

     texture_worst  perimeter_worst  area_worst  smoothness_worst  \
149          22.46            97.19       725.9           0.09711   
124          22.75            91.99       632.1           0.10250   
421          18.34           114.10       809.2           0.13120   

     compactness_worst  concavity_worst  concave points_worst  symmetry_worst  \
149             0.1824           0.1564               0.06019          0.2350   
124             0.2531           0.3308               0.08978          0.2048   
421             0.3635           0.3219               0.11080          0.2827   

     fractal_dimension_worst  
149                  0.07014  
124                  0.07628  
421                  0.09208  

[3 rows x 30 columns]
X_test's count =  171 
      radius_mean  texture_mean  perimeter_mean  area_mean  smoothness_mean  \
204        12.47         18.60           81.09      481.9          0.09965   
70         18.94         21.31          123.60     1130.0          0.09009   
131        15.46         19.48          101.70      748.9          0.10920   

     compactness_mean  concavity_mean  concave points_mean  symmetry_mean  \
204            0.1058         0.08005              0.03821         0.1925   
70             0.1029         0.10800              0.07951         0.1582   
131            0.1223         0.14660              0.08087         0.1931   

     fractal_dimension_mean           ...             radius_worst  \
204                 0.06373           ...                    14.97   
70                  0.05461           ...                    24.86   
131                 0.05796           ...                    19.26   

     texture_worst  perimeter_worst  area_worst  smoothness_worst  \
204          24.64            96.05       677.9            0.1426   
70           26.58           165.90      1866.0            0.1193   
131          26.00           124.90      1156.0            0.1546   

     compactness_worst  concavity_worst  concave points_worst  symmetry_worst  \
204             0.2378           0.2671                0.1015          0.3014   
70              0.2336           0.2687                0.1789          0.2551   
131             0.2394           0.3791                0.1514          0.2837   

     fractal_dimension_worst  
204                  0.08750  
70                   0.06589  
131                  0.08019  

[3 rows x 30 columns]
y_train's count =  398 
 149    0
124    0
421    0
195    0
545    0
Name: diagnosis, dtype: int64
y_test's count =  171 
 204    0
70     1
131    1
431    0
540    0
Name: diagnosis, dtype: int64

II.b Train and evaluate classification models

Now that we already have training, testing set split into features and class labels. We can go ahead building our classifier models

  • As said before we will begin with Decision Tree model

In [71]:
from sklearn import tree

decision_tree = tree.DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
Out[71]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

Now that we have trained our decision tree, let try to print out the decision tree to have a better understanding of how our model make decisions. With the help, of the export_graphviz, we can print out very nice looking decision tree. If you haven't install neccessary support library for export_graphviz, run the commands bellow to install them:

conda install -c conda-forge pydotplus conda install -c conda-forge graphviz

In [94]:
from IPython.display import Image
import pydotplus

dot_data = tree.export_graphviz(decision_tree, out_file=None, class_names=['0','1'], feature_names=list(X_test.columns.values) ,filled=True, rounded=True)
decision_tree_graph = pydotplus.graph_from_dot_data(dot_data)
Image(decision_tree_graph.create_png())
Out[94]:

Let's see how well our Decision Tree perform by comparing its predicted class label against the real class label in y_test:

In [115]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

y_predicted = decision_tree.predict(X_test)
print(confusion_matrix(y_test, y_predicted, labels=[0, 1]))
print(classification_report(y_test, y_predicted))
print("Model accuracy: ", accuracy_score(y_test, y_predicted))
[[100   8]
 [  2  61]]
             precision    recall  f1-score   support

          0       0.98      0.93      0.95       108
          1       0.88      0.97      0.92        63

avg / total       0.94      0.94      0.94       171

Model accuracy:  0.941520467836

To understand the classification report, we need to under stand the different between "precision", "recall", "f1-score", and "support" score:

  • The precision is the ratio tp / (tp + fp)
  • The recall is the ratio tp / (tp + fn)
  • The f1-score calculated as: 2(precision * recall) / (precision + recall)
  • The support is the number of occurrences of each class in y_true
with tp as True Positive, fp as False Positive, fn as False Negative

==>As we can see, with just a simple Decision Tree model, we achieve quite high precision and recall measurement. The accuracy turned out to be about 94.2%, quite good for a simple Decision Tree model!!!

  • The second classifier model we want to try our luck on is Random Forest

In [131]:
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(n_estimators=10) # we start with having 10 trees in the Random Forest
random_forest.fit(X_train, y_train)
Out[131]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

We will define a helper function to help with model evaluation process. From now on, by calling this helper function, we will make our evaluation process more streamline:

In [132]:
def evaluate_model_performance(classifier, X_test, y_test, labels):
    y_predicted = classifier.predict(X_test)
    print(confusion_matrix(y_test, y_predicted, labels=labels))
    print(classification_report(y_test, y_predicted))
    print("Model accuracy: ", accuracy_score(y_test, y_predicted))

Let's use our newly defined function to evaluate Random Forest

In [133]:
evaluate_model_performance(random_forest, X_test=X_test, y_test= y_test, labels=[0,1])
[[107   1]
 [  5  58]]
             precision    recall  f1-score   support

          0       0.96      0.99      0.97       108
          1       0.98      0.92      0.95        63

avg / total       0.97      0.96      0.96       171

Model accuracy:  0.964912280702

==> With just 10 tree in our random forest, we were able to increase our Decision Tree's performance by about 2%. It might not look much of an improvement, but keep in mind when we started with Decision Tree, we already have a 94% of classification of accuracy.

* What if..?

Now, you (or maybe just me?) might have some burning questions inside: what if we use different numbers of trees in our random forest? How would it affect the performance?

To answer those questions, we have only one way to find out. Build more forests!

We are gonna write a loop that iterate over different number of estimators (trees) in a random forest and print out the performance of each forest

In [141]:
for i in range(1,11):
    random_forest_i = RandomForestClassifier(n_estimators=i*5)
    random_forest_i.fit(X_train, y_train)
    print("Forest with %d trees: " %(5*i))
    evaluate_model_performance(random_forest_i, X_test= X_test, y_test = y_test, labels=[0,1])
Forest with 5 trees: 
[[105   3]
 [  2  61]]
             precision    recall  f1-score   support

          0       0.98      0.97      0.98       108
          1       0.95      0.97      0.96        63

avg / total       0.97      0.97      0.97       171

Model accuracy:  0.970760233918
Forest with 10 trees: 
[[106   2]
 [  3  60]]
             precision    recall  f1-score   support

          0       0.97      0.98      0.98       108
          1       0.97      0.95      0.96        63

avg / total       0.97      0.97      0.97       171

Model accuracy:  0.970760233918
Forest with 15 trees: 
[[106   2]
 [  5  58]]
             precision    recall  f1-score   support

          0       0.95      0.98      0.97       108
          1       0.97      0.92      0.94        63

avg / total       0.96      0.96      0.96       171

Model accuracy:  0.959064327485
Forest with 20 trees: 
[[106   2]
 [  4  59]]
             precision    recall  f1-score   support

          0       0.96      0.98      0.97       108
          1       0.97      0.94      0.95        63

avg / total       0.96      0.96      0.96       171

Model accuracy:  0.964912280702
Forest with 25 trees: 
[[107   1]
 [  4  59]]
             precision    recall  f1-score   support

          0       0.96      0.99      0.98       108
          1       0.98      0.94      0.96        63

avg / total       0.97      0.97      0.97       171

Model accuracy:  0.970760233918
Forest with 30 trees: 
[[106   2]
 [  3  60]]
             precision    recall  f1-score   support

          0       0.97      0.98      0.98       108
          1       0.97      0.95      0.96        63

avg / total       0.97      0.97      0.97       171

Model accuracy:  0.970760233918
Forest with 35 trees: 
[[106   2]
 [  4  59]]
             precision    recall  f1-score   support

          0       0.96      0.98      0.97       108
          1       0.97      0.94      0.95        63

avg / total       0.96      0.96      0.96       171

Model accuracy:  0.964912280702
Forest with 40 trees: 
[[107   1]
 [  4  59]]
             precision    recall  f1-score   support

          0       0.96      0.99      0.98       108
          1       0.98      0.94      0.96        63

avg / total       0.97      0.97      0.97       171

Model accuracy:  0.970760233918
Forest with 45 trees: 
[[107   1]
 [  3  60]]
             precision    recall  f1-score   support

          0       0.97      0.99      0.98       108
          1       0.98      0.95      0.97        63

avg / total       0.98      0.98      0.98       171

Model accuracy:  0.976608187135
Forest with 50 trees: 
[[106   2]
 [  4  59]]
             precision    recall  f1-score   support

          0       0.96      0.98      0.97       108
          1       0.97      0.94      0.95        63

avg / total       0.96      0.96      0.96       171

Model accuracy:  0.964912280702

==>As we can see, increasing the number of trees in Random Forest does not neccessary increase the accuracy score by a significant amount. Even with just 5 trees, we still get a very good classification accuracy. The accuracy scores for forests with higher number of trees (aka estimators) fluctuate between 96% and 97%; this is due to the fact that the algorithm is called RANDOM Forest classifier: features selected for each Decision Tree in each Random Forest is picked by random!

  • After we had some fun with Decision Tree and Random Forest, let turn our attention to Support Vector Machine

The reason we are interested in Support Vector Machince (SVM) is because of its advantages:

  • Quite effective for data set with high number of features (our data set has 30 features)
  • Use only a subset of training data points (support vectors), thus is very memory efficient
  • Can be used with different kernels (include: linear, polynomial, rbf, and sigmoid kernels) to control variance and bias

Another thing we need to keep in mind is that unlike Decision Tree, which considers only one feature every time it make decision, SVM takes into account multiple features at once. Thus it is generally a good idea to scale our data set so that the ranges of the features are consistent to each other; this prevents the features that have significantly higher range of values from overwriting other lower ranges features. We need to normalize our data set:

In [198]:
def normalize_data_frame(data_frame):
    """
    This functions will normalize the passed in data frame, and return a normalized data frame
    """
    data_frame_normalized = (data_frame - data_frame.mean())/(data_frame.max() - data_frame.min())
    return data_frame_normalized

X_test_normalized = normalize_data_frame(X_test)
X_train_normalized = normalize_data_frame(X_train)
print("***Before normalization:***\n", X_test.head(5))
print("\n***After normalization:***\n",X_test_normalized.head(5))
print("\n***Min and Max values:***\n" ,X_test_normalized.max(), X_test_normalized.min())
***Before normalization:***
      radius_mean  texture_mean  perimeter_mean  area_mean  smoothness_mean  \
204        12.47         18.60           81.09      481.9          0.09965   
70         18.94         21.31          123.60     1130.0          0.09009   
131        15.46         19.48          101.70      748.9          0.10920   
431        12.40         17.68           81.47      467.8          0.10540   
540        11.54         14.44           74.65      402.9          0.09984   

     compactness_mean  concavity_mean  concave points_mean  symmetry_mean  \
204            0.1058         0.08005              0.03821         0.1925   
70             0.1029         0.10800              0.07951         0.1582   
131            0.1223         0.14660              0.08087         0.1931   
431            0.1316         0.07741              0.02799         0.1811   
540            0.1120         0.06737              0.02594         0.1818   

     fractal_dimension_mean           ...             radius_worst  \
204                 0.06373           ...                    14.97   
70                  0.05461           ...                    24.86   
131                 0.05796           ...                    19.26   
431                 0.07102           ...                    12.88   
540                 0.06782           ...                    12.26   

     texture_worst  perimeter_worst  area_worst  smoothness_worst  \
204          24.64            96.05       677.9            0.1426   
70           26.58           165.90      1866.0            0.1193   
131          26.00           124.90      1156.0            0.1546   
431          22.91            89.61       515.8            0.1450   
540          19.68            78.78       457.8            0.1345   

     compactness_worst  concavity_worst  concave points_worst  symmetry_worst  \
204             0.2378           0.2671               0.10150          0.3014   
70              0.2336           0.2687               0.17890          0.2551   
131             0.2394           0.3791               0.15140          0.2837   
431             0.2629           0.2403               0.07370          0.2556   
540             0.2118           0.1797               0.06918          0.2329   

     fractal_dimension_worst  
204                  0.08750  
70                   0.06589  
131                  0.08019  
431                  0.09359  
540                  0.08134  

[5 rows x 30 columns]

***After normalization:***
      radius_mean  texture_mean  perimeter_mean  area_mean  smoothness_mean  \
204    -0.084640     -0.047895       -0.079950  -0.093184         0.027965   
70      0.270095      0.082771        0.252914   0.280469        -0.085211   
131     0.079295     -0.005464        0.081432   0.060751         0.141023   
431    -0.088478     -0.092253       -0.076974  -0.101313         0.096037   
540    -0.135629     -0.248473       -0.130376  -0.138730         0.030215   

     compactness_mean  concavity_mean  concave points_mean  symmetry_mean  \
204         -0.002971       -0.024496            -0.064092       0.050658   
70          -0.011978        0.041053             0.159756      -0.135149   
131          0.048278        0.131578             0.167127       0.053908   
431          0.077164       -0.030687            -0.119485      -0.011097   
540          0.016286       -0.054233            -0.130597      -0.007305   

     fractal_dimension_mean           ...             radius_worst  \
204                0.019299           ...                -0.051180   
70                -0.264019           ...                 0.351507   
131               -0.159949           ...                 0.123494   
431                0.245767           ...                -0.136278   
540                0.146357           ...                -0.161522   

     texture_worst  perimeter_worst  area_worst  smoothness_worst  \
204      -0.041442        -0.067072   -0.061519          0.058396   
70        0.014514         0.359910    0.304411         -0.095470   
131      -0.002215         0.109283    0.085734          0.137641   
431      -0.091341        -0.106439   -0.111445          0.074245   
540      -0.184505        -0.172641   -0.129308          0.004906   

     compactness_worst  concavity_worst  concave points_worst  symmetry_worst  \
204          -0.027669        -0.010926             -0.051176        0.018563   
70           -0.031744        -0.009648              0.214803       -0.100952   
131          -0.026117         0.078531              0.120301       -0.027126   
431          -0.003317        -0.032332             -0.146709       -0.099661   
540          -0.052895        -0.080734             -0.162242       -0.158257   

     fractal_dimension_worst  
204                 0.017492  
70                 -0.127688  
131                -0.031618  
431                 0.058405  
540                -0.023892  

[5 rows x 30 columns]

***Min and Max values:***
 radius_mean                0.614412
texture_mean               0.555770
perimeter_mean             0.627983
area_mean                  0.711717
smoothness_mean            0.471318
compactness_mean           0.741221
concavity_mean             0.787770
concave points_mean        0.728807
symmetry_mean              0.582077
fractal_dimension_mean     0.600225
radius_se                  0.753522
texture_se                 0.804335
perimeter_se               0.776567
area_se                    0.835367
smoothness_se              0.630213
compactness_se             0.792587
concavity_se               0.897819
concave points_se          0.653998
symmetry_se                0.835777
fractal_dimension_se       0.880539
radius_worst               0.662175
texture_worst              0.608111
perimeter_worst            0.653938
area_worst                 0.786732
smoothness_worst           0.586693
compactness_worst          0.768093
concavity_worst            0.775735
concave points_worst       0.600026
symmetry_worst             0.644789
fractal_dimension_worst    0.823672
dtype: float64 radius_mean               -0.385588
texture_mean              -0.444230
perimeter_mean            -0.372017
area_mean                 -0.288283
smoothness_mean           -0.528682
compactness_mean          -0.258779
concavity_mean            -0.212230
concave points_mean       -0.271193
symmetry_mean             -0.417923
fractal_dimension_mean    -0.399775
radius_se                 -0.246478
texture_se                -0.195665
perimeter_se              -0.223433
area_se                   -0.164633
smoothness_se             -0.369787
compactness_se            -0.207413
concavity_se              -0.102181
concave points_se         -0.346002
symmetry_se               -0.164223
fractal_dimension_se      -0.119461
radius_worst              -0.337825
texture_worst             -0.391889
perimeter_worst           -0.346062
area_worst                -0.213268
smoothness_worst          -0.413307
compactness_worst         -0.231907
concavity_worst           -0.224265
concave points_worst      -0.399974
symmetry_worst            -0.355211
fractal_dimension_worst   -0.176328
dtype: float64

Now that we have our data normalized, we can go ahead and train our Support Vector Machine

In [174]:
from sklearn.svm import SVC

support_vector_machine = SVC()
support_vector_machine
Out[174]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
In [175]:
support_vector_machine.fit(X_train_normalized, y_train)
Out[175]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
In [180]:
evaluate_model_performance(classifier = support_vector_machine,
                           X_test=X_test_normalized,
                           y_test=y_test, labels=[0,1])
[[108   0]
 [  7  56]]
             precision    recall  f1-score   support

          0       0.94      1.00      0.97       108
          1       1.00      0.89      0.94        63

avg / total       0.96      0.96      0.96       171

Model accuracy:  0.959064327485

==> Our Support Vector Machine also achieved a similar performance compare to our previous Random Forest model. The SVM was able to correctly identify 108+56 = 164 instances out of 171 test instances with a 96% accuracy!

* What if..?

Now, I know some of you with a more curious mind have wanted to ask: "What happen if we use the un-normalized data to train and evaluate our SVM model? How is it gonna be different compare to SVC trained with normalized data?". I got you farm!

We will try to train and evaluate our model using the original/un-normalized X_train, and X_test to see how it affects our SVC performance:

In [187]:
support_vector_machine_tempt = SVC()
support_vector_machine_tempt.fit(X_train, y_train)
evaluate_model_performance(classifier=support_vector_machine_tempt,
                           X_test=X_test,
                           y_test=y_test, labels=[0,1])
[[108   0]
 [ 63   0]]
             precision    recall  f1-score   support

          0       0.63      1.00      0.77       108
          1       0.00      0.00      0.00        63

avg / total       0.40      0.63      0.49       171

Model accuracy:  0.631578947368
/home/tony-stark/miniconda3/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

==> As expected, training SVM using un-normalized data results in bad performance. We can see that our SVM only has a 63% accuracy, not to mentioned it failed to identify any True Positive instance. Basically, it classifies all instances as "Benign" whether they are actually "Malignant" or not!! This kind of mistake will cost human lives as far as this data set concern because it predicts that 63 of the patients are fine and dandy even though they have breast cancer!

* What if..?

The default SVM uses a kernel called Gaussian RBF kernel, however, there are other kernels that we can use including 'sigmoid', 'linear', and 'polynomial'. What happen to the SVM performance if we use different kernels other than the default 'rbf'?

In [197]:
for i in ['linear', 'poly', 'sigmoid']:
    support_vector_machine_2 = SVC(kernel=i)
    support_vector_machine_2.fit(X_train_normalized, y_train)
    print("\n***Kernel used: ", i)
    evaluate_model_performance(classifier=support_vector_machine_2,
                               X_test= X_test_normalized,
                               y_test=y_test, labels=[0,1])
***Kernel used:  linear
[[108   0]
 [  3  60]]
             precision    recall  f1-score   support

          0       0.97      1.00      0.99       108
          1       1.00      0.95      0.98        63

avg / total       0.98      0.98      0.98       171

Model accuracy:  0.982456140351

***Kernel used:  poly
[[108   0]
 [ 63   0]]
             precision    recall  f1-score   support

          0       0.63      1.00      0.77       108
          1       0.00      0.00      0.00        63

avg / total       0.40      0.63      0.49       171

Model accuracy:  0.631578947368

***Kernel used:  sigmoid
[[108   0]
 [  8  55]]
             precision    recall  f1-score   support

          0       0.93      1.00      0.96       108
          1       1.00      0.87      0.93        63

avg / total       0.96      0.95      0.95       171

Model accuracy:  0.953216374269
/home/tony-stark/miniconda3/lib/python3.5/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

==> Suprisingly, 'linear' kernel offers even slightly better performance compare to the default 'rbf' kernel! Mean while, 'sigmoid' kernel offers similar performance, and the 'poly' kernel offer the worst performance out of all 4 kernels. The fact that 'linear' kernel has a the best performance compare to other more complex kernels implies that sometimes simpler kernel is the right kernel to use for Support Vector Machine!!!

III. Conclusion:

After we have experimented different classification models on the Breast Cancer data set, there are a few note worthy points that I want to make:

  • Decision Tree can be a very powerful classification model if the data set contains numerial features and categorical target classes. In addition, one of the advantages of Decision Tree is that it is easy to explain the decision making process due to the fact that we can print out the Decision Tree using appropriate Python packages. If the task requires a model that can be visually explained to other people, Decision Tree is a very promissing candidate!
  • The performance of Decision Tree model can be further improved with Random Forest. Recall, a Random Forest is bassically a group of Decision Trees built upon randomly selected features. Trees in Random forest can compliment each other and boost the performance of a single Decision Tree!
  • Support Vector Machine is another powerful method for data set such as this. One of the things that we need to be careful when using SVM is that we have to make sure that our data set is processed appropriately; in this case, since the ranges of values of each feautres are wide-spread, we need to normalize the data before we can train a SVM. Otherwise, our SVM will not perform as well as expected!
  • Different kernels used for SVM will yield different results. Depend on the data set, more complex kernels might be a better choice. However, this is not the case for our data set; as it turned out a simple 'linear' kernel offers the best performance compare to other more complex kernels. It is a good idea to try different kernels to see which one is the best fit for the data set at hand!