Author: Cat Tran
Language: Python
Packages used: numpy, pandas, matplotlib, sklearn, pydotplus, IPython.display
Previously , I have shown how we can use R and its package class to predict breast cancer based on extracted breast mass features. The data set is donated and publicly available at UCI Machine Learning Repository. We were using k-NN algorithm as our predictive model and we got a decent result with such simple k-NN model.
In this example, I will try my luck on using various predictive models in Python to see if we can improve our predictive ability using the same data set. The models that we will be experimenting today include:
The data set contains 32 columns. The first column is the id number of each instance; since id number is generated so that researcher can keep track of the data, it gives us no insight about the data set. The second column is the dianosis of each instance which is either M for malignant or B for benign.
Columns 3 to 32 contains the values for 10 features, each feature has 3 computed statistical values such as mean, standard error, and worst. The values represents the physical characteristics of the cell nulei images.
A more detailed description about the data set can be found here
First we need read in the data set
import numpy as np
import pandas as pd
data_frame = pd.read_csv("./breast_cancer_data.csv")
data_frame.head(10)
Now we need to drop the id and the Unnamed columns at the end because these two columns serve no purpose in our predictive ability:
data_frame.drop('id', axis=1, inplace=True)
data_frame.drop('Unnamed: 32', axis=1, inplace=True)
data_frame.head(10)
Let's encode the diagnosis into numerical values. 1 for malignant and 0 for benign:
data_frame.diagnosis.unique()
data_frame['diagnosis'] = data_frame['diagnosis'].map({'M':1, 'B':0})
data_frame.head(10)
Let's take a look at the summary of thes data set
data_frame.describe()
Plot the histogram of the diagnosis distribution:
import matplotlib.pyplot as plt
plt.hist(data_frame['diagnosis'])
plt.title('Diagnosis Distribution')
plt.show()
==> As we can see there are significantly more instances with benign diagnosis than malignant diagnosis.
Before we do ahead and train our models. We need to split our processed data set into test and train data set. We will randomly split 70% of the data set into training set, the rest 30% go into testing set:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_frame.loc[:,'radius_mean':'fractal_dimension_worst'],
data_frame['diagnosis'],
test_size=0.3, # 30% is testing set
random_state=42)
print("X_train's count = ", len(X_train), "\n",X_train.head(3))
print("X_test's count = ", len(X_test), "\n", X_test.head(3))
print("y_train's count = ", len(y_train), "\n", y_train.head(5))
print("y_test's count = ", len(y_test), "\n", y_test.head(5))
Now that we already have training, testing set split into features and class labels. We can go ahead building our classifier models
from sklearn import tree
decision_tree = tree.DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
Now that we have trained our decision tree, let try to print out the decision tree to have a better understanding of how our model make decisions. With the help, of the export_graphviz, we can print out very nice looking decision tree. If you haven't install neccessary support library for export_graphviz, run the commands bellow to install them:
conda install -c conda-forge pydotplus conda install -c conda-forge graphviz
from IPython.display import Image
import pydotplus
dot_data = tree.export_graphviz(decision_tree, out_file=None, class_names=['0','1'], feature_names=list(X_test.columns.values) ,filled=True, rounded=True)
decision_tree_graph = pydotplus.graph_from_dot_data(dot_data)
Image(decision_tree_graph.create_png())
Let's see how well our Decision Tree perform by comparing its predicted class label against the real class label in y_test:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
y_predicted = decision_tree.predict(X_test)
print(confusion_matrix(y_test, y_predicted, labels=[0, 1]))
print(classification_report(y_test, y_predicted))
print("Model accuracy: ", accuracy_score(y_test, y_predicted))
To understand the classification report, we need to under stand the different between "precision", "recall", "f1-score", and "support" score:
==>As we can see, with just a simple Decision Tree model, we achieve quite high precision and recall measurement. The accuracy turned out to be about 94.2%, quite good for a simple Decision Tree model!!!
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=10) # we start with having 10 trees in the Random Forest
random_forest.fit(X_train, y_train)
We will define a helper function to help with model evaluation process. From now on, by calling this helper function, we will make our evaluation process more streamline:
def evaluate_model_performance(classifier, X_test, y_test, labels):
y_predicted = classifier.predict(X_test)
print(confusion_matrix(y_test, y_predicted, labels=labels))
print(classification_report(y_test, y_predicted))
print("Model accuracy: ", accuracy_score(y_test, y_predicted))
Let's use our newly defined function to evaluate Random Forest
evaluate_model_performance(random_forest, X_test=X_test, y_test= y_test, labels=[0,1])
==> With just 10 tree in our random forest, we were able to increase our Decision Tree's performance by about 2%. It might not look much of an improvement, but keep in mind when we started with Decision Tree, we already have a 94% of classification of accuracy.
* What if..?
Now, you (or maybe just me?) might have some burning questions inside: what if we use different numbers of trees in our random forest? How would it affect the performance?
To answer those questions, we have only one way to find out. Build more forests!
We are gonna write a loop that iterate over different number of estimators (trees) in a random forest and print out the performance of each forest
for i in range(1,11):
random_forest_i = RandomForestClassifier(n_estimators=i*5)
random_forest_i.fit(X_train, y_train)
print("Forest with %d trees: " %(5*i))
evaluate_model_performance(random_forest_i, X_test= X_test, y_test = y_test, labels=[0,1])
==>As we can see, increasing the number of trees in Random Forest does not neccessary increase the accuracy score by a significant amount. Even with just 5 trees, we still get a very good classification accuracy. The accuracy scores for forests with higher number of trees (aka estimators) fluctuate between 96% and 97%; this is due to the fact that the algorithm is called RANDOM Forest classifier: features selected for each Decision Tree in each Random Forest is picked by random!
The reason we are interested in Support Vector Machince (SVM) is because of its advantages:
Another thing we need to keep in mind is that unlike Decision Tree, which considers only one feature every time it make decision, SVM takes into account multiple features at once. Thus it is generally a good idea to scale our data set so that the ranges of the features are consistent to each other; this prevents the features that have significantly higher range of values from overwriting other lower ranges features. We need to normalize our data set:
def normalize_data_frame(data_frame):
"""
This functions will normalize the passed in data frame, and return a normalized data frame
"""
data_frame_normalized = (data_frame - data_frame.mean())/(data_frame.max() - data_frame.min())
return data_frame_normalized
X_test_normalized = normalize_data_frame(X_test)
X_train_normalized = normalize_data_frame(X_train)
print("***Before normalization:***\n", X_test.head(5))
print("\n***After normalization:***\n",X_test_normalized.head(5))
print("\n***Min and Max values:***\n" ,X_test_normalized.max(), X_test_normalized.min())
Now that we have our data normalized, we can go ahead and train our Support Vector Machine
from sklearn.svm import SVC
support_vector_machine = SVC()
support_vector_machine
support_vector_machine.fit(X_train_normalized, y_train)
evaluate_model_performance(classifier = support_vector_machine,
X_test=X_test_normalized,
y_test=y_test, labels=[0,1])
==> Our Support Vector Machine also achieved a similar performance compare to our previous Random Forest model. The SVM was able to correctly identify 108+56 = 164 instances out of 171 test instances with a 96% accuracy!
* What if..?
Now, I know some of you with a more curious mind have wanted to ask: "What happen if we use the un-normalized data to train and evaluate our SVM model? How is it gonna be different compare to SVC trained with normalized data?". I got you farm!
We will try to train and evaluate our model using the original/un-normalized X_train, and X_test to see how it affects our SVC performance:
support_vector_machine_tempt = SVC()
support_vector_machine_tempt.fit(X_train, y_train)
evaluate_model_performance(classifier=support_vector_machine_tempt,
X_test=X_test,
y_test=y_test, labels=[0,1])
==> As expected, training SVM using un-normalized data results in bad performance. We can see that our SVM only has a 63% accuracy, not to mentioned it failed to identify any True Positive instance. Basically, it classifies all instances as "Benign" whether they are actually "Malignant" or not!! This kind of mistake will cost human lives as far as this data set concern because it predicts that 63 of the patients are fine and dandy even though they have breast cancer!
* What if..?
The default SVM uses a kernel called Gaussian RBF kernel, however, there are other kernels that we can use including 'sigmoid', 'linear', and 'polynomial'. What happen to the SVM performance if we use different kernels other than the default 'rbf'?
for i in ['linear', 'poly', 'sigmoid']:
support_vector_machine_2 = SVC(kernel=i)
support_vector_machine_2.fit(X_train_normalized, y_train)
print("\n***Kernel used: ", i)
evaluate_model_performance(classifier=support_vector_machine_2,
X_test= X_test_normalized,
y_test=y_test, labels=[0,1])
==> Suprisingly, 'linear' kernel offers even slightly better performance compare to the default 'rbf' kernel! Mean while, 'sigmoid' kernel offers similar performance, and the 'poly' kernel offer the worst performance out of all 4 kernels. The fact that 'linear' kernel has a the best performance compare to other more complex kernels implies that sometimes simpler kernel is the right kernel to use for Support Vector Machine!!!
After we have experimented different classification models on the Breast Cancer data set, there are a few note worthy points that I want to make: