This is a demonstration of the power of Support Vector Machine (SVM) in recognizing optical characters. We will be implementing an Optical Characters Recognizer (OCR) to identify optical characters. Image a scanned document is saved onto a computer, a software that implement OCR can “read” the scanned image of the document, it will then split the document into a matrix, each cell of the matrix contains the optical image of each character that presents on that document; ultimately, the characters that have been identify can be put together to create a digital copy of the scanned document.

The data set that we will be using is donated to the UCI Machine Learning Repository. The date set is called Letter Recognition Data Set and it is available to public at https://archive.ics.uci.edu/ml/datasets/Letter+Recognition. To have a better understanding of what we should expect in this data set, let’s take a look at the description and the context of the data set.

This data set contains 20000 samples. Each sample has 16 attributes which contains information about each character image and 1 output which is the character label of each sample. Each sample represent one of the 26 uppercase letters in English alphabet in various font and each font was randomly distorted to simulate the real world applications in which letter are not perfectly straight when it is scanned. We are NOT dealing with the original optical image of each character, rather, each character image has been processed and converted into 16 numerical attributes on which we will build our classifier model.

Note: This is not an example of how to process an optical image into numeral attributes, rather it is assumed that images of the scanned characters have been processed, and our task is to build a support vector machine classifier to classify each sample based on their numerical attribute. A summary of the data set and its column name is listed below:
1. lettr : capital letter (26 values from A to Z)
2. x-box : horizontal position of box (integer)
3. y-box : vertical position of box (integer)
4. width : width of box (integer)
5. high  : height of box (integer)
6. onpix : total # on pixels (integer)
7. x-bar : mean x of on pixels in box (integer)
8. y-bar : mean y of on pixels in box (integer)
9. x2bar : mean x variance (integer)
10. y2bar: mean y variance (integer)
11. xybar: mean x y correlation (integer)
12. x2ybr: mean of x * x * y (integer)
13. xy2br: mean of x * y * y (integer)
14. x-ege: mean edge count left to right (integer)
15. xegvy: correlation of x-ege with y (integer)
16. y-ege: mean edge count bottom to top (integer)
17. yegvx: correlation of y-ege with x (integer)

Step 1: Load the data set

First we need to download the data set into local directory by going to the link: https://archive.ics.uci.edu/ml/datasets/Letter+Recognition. Once we have the data file, we need to load it into R (note: A header row is added to the original data file to make it easier for us to see what each column represents):

letters <- read.csv("letter-recognition.data", header = TRUE) # File contains a header row

Step 2: Explore and Prepare the data set

str(letters)
## 'data.frame':    20000 obs. of  17 variables:
##  $ lettr: Factor w/ 26 levels "A","B","C","D",..: 20 9 4 14 7 19 2 1 10 13 ...
##  $ x.box: int  2 5 4 7 2 4 4 1 2 11 ...
##  $ y.box: int  8 12 11 11 1 11 2 1 2 15 ...
##  $ width: int  3 3 6 6 3 5 5 3 4 13 ...
##  $ high : int  5 7 8 6 1 8 4 2 4 9 ...
##  $ onpix: int  1 2 6 3 1 3 4 1 2 7 ...
##  $ x.bar: int  8 10 10 5 8 8 8 8 10 13 ...
##  $ y.bar: int  13 5 6 9 6 8 7 2 6 2 ...
##  $ x2bar: int  0 5 2 4 6 6 6 2 2 6 ...
##  $ y2bar: int  6 4 6 6 6 9 6 2 6 2 ...
##  $ xybar: int  6 13 10 4 6 5 7 8 12 12 ...
##  $ x2ybr: int  10 3 3 4 5 6 6 2 4 1 ...
##  $ xy2br: int  8 9 7 10 9 6 6 8 8 9 ...
##  $ x.ege: int  0 2 3 6 1 0 2 1 1 8 ...
##  $ xegvy: int  8 8 7 10 7 8 8 6 6 1 ...
##  $ y.ege: int  0 4 3 2 5 9 7 2 1 1 ...
##  $ yegvx: int  8 10 9 8 10 7 10 7 7 8 ...

As we can see, by calling str() on the data set, the data set has exactly 20000 samples and 17 columns. The first column is the character label of each sample and it has 26 “levels” which represent the 26 letters in the alphabet; the rest of the columns are the statistical values of each sample image.

Since Support Vector Machine can only works on numerical data and our data is already converted into integer values, there are little processing needed to be done. Sometimes, when dealing with data set that contains numerical values with inconsistent ranges, it is often a good idea to scale the data before using it to train a support vector machine model. However, the R package that we will be using will automatically scale the data for us, so again, little processing is needed with this particular data set!

Let’s split the data set into training (80%) and testing (20%) data set:

letters_train <- letters[1:16000, ] # first 80% of the samples are training data set
letters_test <- letters[16001:20000, ] # last 20% of sample are testing data set

Mow that we have the training and testing data set. Let’s go ahead and build our SVM model!

Step 3: Train SVM model on the data set

The package that we will be using to train our SVM model is called kernlab, more information about this package can be found here https://cran.r-project.org/web/packages/kernlab/index.html. First we need to install and load the package into R:

#install.packages('kernlab')
library(kernlab)

Once we have the package loaded, let’s initialize and train our SVM model. We will try our hand with a simple linear kernel first:

letter_classifier <- ksvm(lettr ~ ., data = letters_train, kernel = "vanilladot")
##  Setting default kernel parameters
letter_classifier
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 7037 
## 
## Objective Function Value : -14.1746 -20.0072 -23.5628 -6.2009 -7.5524 -32.7694 -49.9786 -18.1824 -62.1111 -32.7284 -16.2209 -32.2837 -28.9777 -51.2195 -13.276 -35.6217 -30.8612 -16.5256 -14.6811 -32.7475 -30.3219 -7.7956 -11.8138 -32.3463 -13.1262 -9.2692 -153.1654 -52.9678 -76.7744 -119.2067 -165.4437 -54.6237 -41.9809 -67.2688 -25.1959 -27.6371 -26.4102 -35.5583 -41.2597 -122.164 -187.9178 -222.0856 -21.4765 -10.3752 -56.3684 -12.2277 -49.4899 -9.3372 -19.2092 -11.1776 -100.2186 -29.1397 -238.0516 -77.1985 -8.3339 -4.5308 -139.8534 -80.8854 -20.3642 -13.0245 -82.5151 -14.5032 -26.7509 -18.5713 -23.9511 -27.3034 -53.2731 -11.4773 -5.12 -13.9504 -4.4982 -3.5755 -8.4914 -40.9716 -49.8182 -190.0269 -43.8594 -44.8667 -45.2596 -13.5561 -17.7664 -87.4105 -107.1056 -37.0245 -30.7133 -112.3218 -32.9619 -27.2971 -35.5836 -17.8586 -5.1391 -43.4094 -7.7843 -16.6785 -58.5103 -159.9936 -49.0782 -37.8426 -32.8002 -74.5249 -133.3423 -11.1638 -5.3575 -12.438 -30.9907 -141.6924 -54.2953 -179.0114 -99.8896 -10.288 -15.1553 -3.7815 -67.6123 -7.696 -88.9304 -47.6448 -94.3718 -70.2733 -71.5057 -21.7854 -12.7657 -7.4383 -23.502 -13.1055 -239.9708 -30.4193 -25.2113 -136.2795 -140.9565 -9.8122 -34.4584 -6.3039 -60.8421 -66.5793 -27.2816 -214.3225 -34.7796 -16.7631 -135.7821 -160.6279 -45.2949 -25.1023 -144.9059 -82.2352 -327.7154 -142.0613 -158.8821 -32.2181 -32.8887 -52.9641 -25.4937 -47.9936 -6.8991 -9.7293 -36.436 -70.3907 -187.7611 -46.9371 -89.8103 -143.4213 -624.3645 -119.2204 -145.4435 -327.7748 -33.3255 -64.0607 -145.4831 -116.5903 -36.2977 -66.3762 -44.8248 -7.5088 -217.9246 -12.9699 -30.504 -2.0369 -6.126 -14.4448 -21.6337 -57.3084 -20.6915 -184.3625 -20.1052 -4.1484 -4.5344 -0.828 -121.4411 -7.9486 -58.5604 -21.4878 -13.5476 -5.646 -15.629 -28.9576 -20.5959 -76.7111 -27.0119 -94.7101 -15.1713 -10.0222 -7.6394 -1.5784 -87.6952 -6.2239 -99.3711 -101.0906 -45.6639 -24.0725 -61.7702 -24.1583 -52.2368 -234.3264 -39.9749 -48.8556 -34.1464 -20.9664 -11.4525 -123.0277 -6.4903 -5.1865 -8.8016 -9.4618 -21.7742 -24.2361 -123.3984 -31.4404 -88.3901 -30.0924 -13.8198 -9.2701 -3.0823 -87.9624 -6.3845 -13.968 -65.0702 -105.523 -13.7403 -13.7625 -50.4223 -2.933 -8.4289 -80.3381 -36.4147 -112.7485 -4.1711 -7.8989 -1.2676 -90.8037 -21.4919 -7.2235 -47.9557 -3.383 -20.433 -64.6138 -45.5781 -56.1309 -6.1345 -18.6307 -2.374 -72.2553 -111.1885 -106.7664 -23.1323 -19.3765 -54.9819 -34.2953 -64.4756 -20.4115 -6.689 -4.378 -59.141 -34.2468 -58.1509 -33.8665 -10.6902 -53.1387 -13.7478 -20.1987 -55.0923 -3.8058 -60.0382 -235.4841 -12.6837 -11.7407 -17.3058 -9.7167 -65.8498 -17.1051 -42.8131 -53.1054 -25.0437 -15.302 -44.0749 -16.9582 -62.9773 -5.204 -5.2963 -86.1704 -3.7209 -6.3445 -1.1264 -122.5771 -23.9041 -355.0145 -31.1013 -32.619 -4.9664 -84.1048 -134.5957 -72.8371 -23.9002 -35.3077 -11.7119 -22.2889 -1.8598 -59.2174 -8.8994 -150.742 -1.8533 -1.9711 -9.9676 -0.5207 -26.9229 -30.429 -5.6289 
## Training error : 0.130062

Step 4: Evaluate SVM model performance

Once we have our SVM model trained, we can evaluate its performance using the test data set. The predict() function can be use to predict the class labels on the test data set. After we have the predicted class labels, we can create a cross table to evaluate the accuracy of the model:

letter_predictions <- predict(letter_classifier, letters_test[ , 2:17])
table(letter_predictions, letters_test$lettr)
##                   
## letter_predictions   A   B   C   D   E   F   G   H   I   J   K   L   M   N
##                  A 144   0   0   0   0   0   0   0   0   1   0   0   1   2
##                  B   0 121   0   5   2   0   1   2   0   0   1   0   1   0
##                  C   0   0 120   0   4   0  10   2   2   0   1   3   0   0
##                  D   2   2   0 156   0   1   3  10   4   3   4   3   0   5
##                  E   0   0   5   0 127   3   1   1   0   0   3   4   0   0
##                  F   0   0   0   0   0 138   2   2   6   0   0   0   0   0
##                  G   1   1   2   1   9   2 123   2   0   0   1   2   1   0
##                  H   0   0   0   1   0   1   0 102   0   2   3   2   3   4
##                  I   0   1   0   0   0   1   0   0 141   8   0   0   0   0
##                  J   0   1   0   0   0   1   0   2   5 128   0   0   0   0
##                  K   1   1   9   0   0   0   2   5   0   0 118   0   0   2
##                  L   0   0   0   0   2   0   1   1   0   0   0 133   0   0
##                  M   0   0   1   1   0   0   1   1   0   0   0   0 135   4
##                  N   0   0   0   0   0   1   0   1   0   0   0   0   0 145
##                  O   1   0   2   1   0   0   1   2   0   1   0   0   0   1
##                  P   0   0   0   1   0   2   1   0   0   0   0   0   0   0
##                  Q   0   0   0   0   0   0   8   2   0   0   0   3   0   0
##                  R   0   7   0   0   1   0   3   8   0   0  13   0   0   1
##                  S   1   1   0   0   1   0   3   0   1   1   0   1   0   0
##                  T   0   0   0   0   3   2   0   0   0   0   1   0   0   0
##                  U   1   0   3   1   0   0   0   2   0   0   0   0   0   0
##                  V   0   0   0   0   0   1   3   4   0   0   0   0   1   2
##                  W   0   0   0   0   0   0   1   0   0   0   0   0   2   0
##                  X   0   1   0   0   2   0   0   1   3   0   1   6   0   0
##                  Y   3   0   0   0   0   0   0   1   0   0   0   0   0   0
##                  Z   2   0   0   0   1   0   0   0   3   4   0   0   0   0
##                   
## letter_predictions   O   P   Q   R   S   T   U   V   W   X   Y   Z
##                  A   2   0   5   0   1   1   1   0   1   0   0   1
##                  B   0   2   2   3   5   0   0   2   0   1   0   0
##                  C   2   0   0   0   0   0   0   0   0   0   0   0
##                  D   5   3   1   4   0   0   0   0   0   3   3   1
##                  E   0   0   2   0  10   0   0   0   0   2   0   3
##                  F   0  16   0   0   3   0   0   1   0   1   2   0
##                  G   1   2   8   2   4   3   0   0   0   1   0   0
##                  H  20   0   2   3   0   3   0   2   0   0   1   0
##                  I   0   1   0   0   3   0   0   0   0   5   1   1
##                  J   1   1   3   0   2   0   0   0   0   1   0   6
##                  K   0   1   0   7   0   1   3   0   0   5   0   0
##                  L   0   0   1   0   5   0   0   0   0   0   0   1
##                  M   0   0   0   0   0   0   3   0   8   0   0   0
##                  N   0   0   0   3   0   0   1   0   2   0   0   0
##                  O  99   3   3   0   0   0   3   0   0   0   0   0
##                  P   2 130   0   0   0   0   0   0   0   0   1   0
##                  Q   3   1 124   0   5   0   0   0   0   0   2   0
##                  R   1   1   0 138   0   1   0   1   0   0   0   0
##                  S   0   0  14   0 101   3   0   0   0   2   0  10
##                  T   0   0   0   0   3 133   1   0   0   0   2   2
##                  U   1   0   0   0   0   0 152   0   0   1   1   0
##                  V   1   0   3   1   0   0   0 126   1   0   4   0
##                  W   0   0   0   0   0   0   4   4 127   0   0   0
##                  X   1   0   0   0   1   0   0   0   0 137   1   1
##                  Y   0   7   0   0   0   3   0   0   0   0 127   0
##                  Z   0   0   0   0  18   3   0   0   0   0   0 132

With the cross table, it’s much easier to see how our SVM model perform on the test data set. The row of the table indicate the predicted labels, the column indicates the true labels. Thus, the values in the diagonal is the number of predicted labels that match exactly with the true label. For example, the model correctly classified 120 ‘C’ samples, 156 ‘D’ samples, and 132 ‘Z’ samples… Anything that does not fall onto the diagonal is the miss-classified samples. For example, there are 5 ‘D’ samples miss-classified as ‘B’, 2 ‘I’ samples miss-classified as ‘C, and 3 ’T’ samples miss-classified as ‘Z’…

A simple summary can also be created to see how many matching samples are classified:

matches <- letter_predictions == letters_test$lettr
table(matches)
## matches
## FALSE  TRUE 
##   643  3357
prop.table(table(matches))
## matches
##   FALSE    TRUE 
## 0.16075 0.83925

==> Our model was able to correctly classify 3357 out of 4000 samples which account for about 83.93% of accuracy, and miss-classify 643 out of 4000 samples which account for 16.08% of failure. So even with a simple linear kernel, our SVM is already doing fairly well!!!

The kernlab package also implements several different kernels such as Gaussian RBF, Laplacian, Tangent, Spline kernels… Even though our previous model preformed fairly well with just a simple linear kernel, we could still improve its performance by choosing more complex kernel. Let’s try our hand on a different kernel!!

Step 5: Improve SVM model performance

We will use a little more complex kernel to see how much we can improve our SVM model’s performance. For our convenience, we will also define a function to make the evaluation process more streamline. We will called the function evaluateSVM(..) and it accepts the SVM classifier and test data set as arguments:

# define a helper function
evaluateSVM <- function ( classifier, test_data){
  predictiions <- predict(classifier, test_data)
  print(table(predictiions, test_data[ , 1]))
  matches <- predictiions == test_data[ , 1]
  print(table(matches))
  prop.table(table(matches))
}

Now we will try our luck using three different kernels: Gaussian RBF, Laplacian, and Polynomial kernels

# first we try a Gaussian RBF kernel
letter_RBFClassifier <- ksvm(lettr ~ ., data = letters_train, kernel = "rbfdot")
evaluateSVM(letter_RBFClassifier, test_data = letters_test)
##             
## predictiions   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O
##            A 151   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##            B   0 128   0   3   0   1   0   2   0   0   0   1   2   1   0
##            C   0   0 132   0   3   0   1   0   2   0   0   1   0   0   0
##            D   1   1   0 161   0   0   2   8   2   3   1   0   0   1   1
##            E   0   0   3   0 137   2   0   0   0   1   0   4   0   0   0
##            F   0   0   0   0   0 148   0   0   3   0   0   0   0   0   0
##            G   0   0   2   0   8   0 154   2   0   0   0   2   2   0   2
##            H   0   1   0   1   0   0   2 125   0   1   2   1   1   3   0
##            I   0   0   0   0   0   0   0   0 151   3   0   0   0   0   0
##            J   0   0   0   0   0   0   0   0   3 136   0   0   0   0   0
##            K   0   0   1   0   0   0   0   5   0   0 132   0   0   1   0
##            L   0   0   0   0   0   0   1   0   0   0   0 141   0   0   0
##            M   0   0   0   0   0   0   1   1   0   0   0   0 138   1   0
##            N   0   0   0   0   0   2   0   0   0   0   0   0   0 150   0
##            O   0   0   2   0   0   0   0   0   0   1   0   0   0   5 129
##            P   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0
##            Q   0   0   0   0   0   0   0   1   0   0   0   0   0   0   3
##            R   0   3   1   1   0   0   2   5   0   0   9   1   0   3   2
##            S   0   2   0   0   0   0   0   0   1   2   0   2   0   0   0
##            T   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##            U   0   0   1   1   0   0   0   1   0   0   0   0   0   0   0
##            V   0   0   0   0   0   0   0   0   0   0   0   0   1   1   0
##            W   0   0   0   0   0   0   1   0   0   0   0   0   0   0   2
##            X   0   1   0   0   1   0   0   0   0   0   2   4   0   0   0
##            Y   4   0   0   0   0   0   0   1   0   0   0   0   0   0   0
##            Z   0   0   0   0   3   0   0   0   2   1   0   0   0   0   0
##             
## predictiions   P   Q   R   S   T   U   V   W   X   Y   Z
##            A   0   3   0   0   1   0   0   0   0   0   0
##            B   2   1   3   3   0   0   4   1   1   0   0
##            C   0   0   0   0   0   0   0   0   0   0   0
##            D   3   1   3   0   2   0   0   0   2   3   0
##            E   1   0   0   2   1   0   0   0   0   0   2
##            F  11   0   0   1   0   0   1   0   0   0   0
##            G   1   0   0   0   2   0   0   0   0   0   0
##            H   1   1   0   0   2   0   0   0   0   0   0
##            I   0   0   0   0   0   0   0   0   1   0   0
##            J   0   0   0   0   0   0   0   0   0   0   3
##            K   0   0   3   0   0   0   0   0   2   0   0
##            L   0   0   0   1   0   0   0   0   0   0   0
##            M   0   0   0   0   0   1   0   2   0   0   0
##            N   0   0   2   0   0   0   0   1   0   0   0
##            O   2   4   0   0   0   1   0   0   0   0   0
##            P 141   0   0   0   0   0   0   0   0   0   0
##            Q   3 158   0   0   0   0   0   0   0   0   0
##            R   1   0 150   0   1   0   0   0   0   0   0
##            S   0   0   0 152   0   0   0   0   0   0   2
##            T   0   0   0   0 140   0   0   0   0   1   0
##            U   0   0   0   0   0 161   0   0   0   1   0
##            V   0   0   0   0   0   2 131   0   0   1   0
##            W   0   0   0   0   0   3   0 135   0   0   0
##            X   0   0   0   1   1   0   0   0 153   1   1
##            Y   2   0   0   0   1   0   0   0   0 138   0
##            Z   0   0   0   1   0   0   0   0   0   0 150
## matches
## FALSE  TRUE 
##   278  3722
## matches
##  FALSE   TRUE 
## 0.0695 0.9305
letter_LaplacianClassifier <- ksvm(lettr ~ ., data = letters_train, kernel = "laplacedot")
evaluateSVM(letter_LaplacianClassifier, test_data = letters_test)
##             
## predictiions   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O
##            A 146   0   0   0   0   0   0   0   0   1   0   0   0   0   0
##            B   0 128   0   3   1   3   1   4   0   1   2   1   2   0   0
##            C   0   0 120   0   1   0   1   0   2   0   1   2   0   0   0
##            D   1   1   0 159   0   1   5  11   5   4   5   0   0   2   1
##            E   0   0   8   0 134   2   0   0   0   1   0   7   0   0   0
##            F   0   0   0   0   0 143   2   1   3   0   0   0   0   0   0
##            G   0   0   4   0   8   1 144   4   0   0   0   2   2   0   2
##            H   0   0   0   1   0   0   1 110   0   1   1   1   1   2   1
##            I   0   0   0   0   0   0   0   0 143   3   0   0   0   0   0
##            J   0   0   0   0   0   0   0   1   3 130   0   0   0   0   0
##            K   0   0   4   0   0   0   0   3   0   0 121   0   0   0   0
##            L   0   0   0   0   0   0   0   0   0   0   0 135   0   0   0
##            M   0   0   1   1   0   0   1   2   0   0   0   0 137   3   0
##            N   0   0   0   0   0   2   0   0   0   0   0   0   1 147   0
##            O   0   0   2   0   0   0   1   2   0   1   0   0   0   5 128
##            P   0   0   0   1   0   0   0   0   1   0   0   0   0   0   0
##            Q   0   0   0   1   0   0   1   2   0   1   1   2   0   0   2
##            R   1   4   0   1   1   0   3   6   0   0  11   1   0   4   2
##            S   2   2   0   0   1   0   1   0   3   2   0   1   0   0   0
##            T   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##            U   1   0   2   0   0   0   0   2   0   0   1   0   0   0   0
##            V   0   0   1   0   0   0   1   1   0   0   0   0   0   1   0
##            W   1   0   0   0   0   1   1   0   0   0   0   0   1   1   3
##            X   0   1   0   0   1   0   1   0   5   1   3   5   0   0   0
##            Y   4   0   0   0   0   0   0   2   0   0   0   0   0   1   0
##            Z   0   0   0   0   5   0   0   0   0   2   0   0   0   0   0
##             
## predictiions   P   Q   R   S   T   U   V   W   X   Y   Z
##            A   0   3   0   0   1   0   0   0   0   0   0
##            B   2   2   7   5   0   0   5   2   2   0   0
##            C   0   0   0   1   0   0   0   0   0   0   0
##            D   3   1   3   0   1   1   0   0   2   3   2
##            E   0   1   0   8   0   0   0   0   0   0   1
##            F  11   0   0   2   1   0   0   0   0   0   1
##            G   4   0   0   0   4   1   0   0   0   0   0
##            H   1   1   0   0   2   0   0   0   0   0   0
##            I   0   0   0   0   0   0   0   0   0   0   0
##            J   0   0   0   0   0   0   0   0   0   0   3
##            K   1   0   4   0   0   2   0   0   3   0   0
##            L   0   0   0   1   0   0   0   0   0   0   1
##            M   0   0   0   0   0   4   0   7   0   1   0
##            N   0   0   2   0   0   0   0   1   0   0   0
##            O   1   9   0   0   0   3   0   0   0   0   0
##            P 134   0   0   0   0   0   0   0   0   1   0
##            Q   2 151   0   0   0   0   0   0   0   1   1
##            R   1   0 145   0   3   0   1   0   0   0   0
##            S   0   0   0 141   2   0   0   0   0   0   6
##            T   0   0   0   2 135   0   0   0   0   3   1
##            U   0   0   0   0   0 152   0   0   2   0   0
##            V   0   0   0   0   0   0 125   0   0   3   0
##            W   0   0   0   0   0   5   5 129   0   1   0
##            X   0   0   0   0   1   0   0   0 150   1   1
##            Y   8   0   0   0   1   0   0   0   0 131   0
##            Z   0   0   0   1   0   0   0   0   0   0 141
## matches
## FALSE  TRUE 
##   441  3559
## matches
##   FALSE    TRUE 
## 0.11025 0.88975
letter_PolinomialClassifier <- ksvm(lettr ~ ., data = letters_train, kernel = "polydot")
##  Setting default kernel parameters
evaluateSVM(letter_PolinomialClassifier, test_data = letters_test)
##             
## predictiions   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O
##            A 144   0   0   0   0   0   0   0   0   1   0   0   1   2   2
##            B   0 121   0   5   2   0   1   2   0   0   1   0   1   0   0
##            C   0   0 120   0   4   0  10   2   2   0   1   3   0   0   2
##            D   2   2   0 156   0   1   3  10   4   3   4   3   0   5   5
##            E   0   0   5   0 127   3   1   1   0   0   3   4   0   0   0
##            F   0   0   0   0   0 138   2   2   6   0   0   0   0   0   0
##            G   1   1   2   1   9   2 123   2   0   0   1   2   1   0   1
##            H   0   0   0   1   0   1   0 102   0   2   3   2   3   4  20
##            I   0   1   0   0   0   1   0   0 141   8   0   0   0   0   0
##            J   0   1   0   0   0   1   0   2   5 128   0   0   0   0   1
##            K   1   1   9   0   0   0   2   5   0   0 118   0   0   2   0
##            L   0   0   0   0   2   0   1   1   0   0   0 133   0   0   0
##            M   0   0   1   1   0   0   1   1   0   0   0   0 135   4   0
##            N   0   0   0   0   0   1   0   1   0   0   0   0   0 145   0
##            O   1   0   2   1   0   0   1   2   0   1   0   0   0   1  99
##            P   0   0   0   1   0   2   1   0   0   0   0   0   0   0   2
##            Q   0   0   0   0   0   0   8   2   0   0   0   3   0   0   3
##            R   0   7   0   0   1   0   3   8   0   0  13   0   0   1   1
##            S   1   1   0   0   1   0   3   0   1   1   0   1   0   0   0
##            T   0   0   0   0   3   2   0   0   0   0   1   0   0   0   0
##            U   1   0   3   1   0   0   0   2   0   0   0   0   0   0   1
##            V   0   0   0   0   0   1   3   4   0   0   0   0   1   2   1
##            W   0   0   0   0   0   0   1   0   0   0   0   0   2   0   0
##            X   0   1   0   0   2   0   0   1   3   0   1   6   0   0   1
##            Y   3   0   0   0   0   0   0   1   0   0   0   0   0   0   0
##            Z   2   0   0   0   1   0   0   0   3   4   0   0   0   0   0
##             
## predictiions   P   Q   R   S   T   U   V   W   X   Y   Z
##            A   0   5   0   1   1   1   0   1   0   0   1
##            B   2   2   3   5   0   0   2   0   1   0   0
##            C   0   0   0   0   0   0   0   0   0   0   0
##            D   3   1   4   0   0   0   0   0   3   3   1
##            E   0   2   0  10   0   0   0   0   2   0   3
##            F  16   0   0   3   0   0   1   0   1   2   0
##            G   2   8   2   4   3   0   0   0   1   0   0
##            H   0   2   3   0   3   0   2   0   0   1   0
##            I   1   0   0   3   0   0   0   0   5   1   1
##            J   1   3   0   2   0   0   0   0   1   0   6
##            K   1   0   7   0   1   3   0   0   5   0   0
##            L   0   1   0   5   0   0   0   0   0   0   1
##            M   0   0   0   0   0   3   0   8   0   0   0
##            N   0   0   3   0   0   1   0   2   0   0   0
##            O   3   3   0   0   0   3   0   0   0   0   0
##            P 130   0   0   0   0   0   0   0   0   1   0
##            Q   1 124   0   5   0   0   0   0   0   2   0
##            R   1   0 138   0   1   0   1   0   0   0   0
##            S   0  14   0 101   3   0   0   0   2   0  10
##            T   0   0   0   3 133   1   0   0   0   2   2
##            U   0   0   0   0   0 152   0   0   1   1   0
##            V   0   3   1   0   0   0 126   1   0   4   0
##            W   0   0   0   0   0   4   4 127   0   0   0
##            X   0   0   0   1   0   0   0   0 137   1   1
##            Y   7   0   0   0   3   0   0   0   0 127   0
##            Z   0   0   0  18   3   0   0   0   0   0 132
## matches
## FALSE  TRUE 
##   643  3357
## matches
##   FALSE    TRUE 
## 0.16075 0.83925

Conclusion:

So what do we get from this? First, Support Vector Machine can be a really powerful machine learning technique that can work wonder on numerical data set. Even with a simple linear kernel, our model achieved about 84% of accuracy; it was able to identify the alphabet letters based solely on their 16 statistical values that have been extracted from their original image. What this means is our SVM model does not actually get to “see” the actual images of each character, rather it only get to “see” the “measurement” of their images.

Secondly, we were also be able to significantly improve our SVM model’s performance by choosing more complex kernels. While Gaussian RBF kernel provided a significant improvement in accuracy (+ 10%), Laplacian kernel only provided a boost of 5%, and Polynomial does not improve the model’s performance at all. With that begins said, for this particular data set, Gaussian RBF kernel seems to provide the best performance!