This is a little implementation that can filter out spam from sms messages. The dataset is provided and made available to public under SMS Spam Collection, and can be downloaded at http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ . This dataset contains a bunch of sms messages. Junk messages have been labled as “spam” and legit messages as “ham”. The task is to train a Bayesian machine learning system to filter out “spam” sms.

Step 1: Reading in the dataset

# NOTE: the dataset has been modified slightly so that it can be easier to work with
sms_raw <- read.csv('sms_spam.csv', stringsAsFactors = FALSE)

Step 2: Exploring and preparing data

str(sms_raw) # structure of dataset
'data.frame':   5559 obs. of  2 variables:
 $ type: chr  "ham" "ham" "ham" "spam" ...
 $ text: chr  "Hope you are having a good week. Just checking in" "K..give back my thanks." "Am also doing in cbe only. But have to pay." "complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out"| __truncated__ ...
sms_raw$type <- factor(sms_raw$type) # convert $type into factor variable
str(sms_raw) # structure of dataset
'data.frame':   5559 obs. of  2 variables:
 $ type: Factor w/ 2 levels "ham","spam": 1 1 1 2 2 1 1 1 2 1 ...
 $ text: chr  "Hope you are having a good week. Just checking in" "K..give back my thanks." "Am also doing in cbe only. But have to pay." "complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out"| __truncated__ ...
table(sms_raw$type)

 ham spam 
4812  747 
#install.packages("tm", dependencies = TRUE)
#install.packages("SnowballC") # this package help in stemming words; for ex: it makes "installed"", "installing"", and "installs"" become just "install""
library(tm)
Loading required package: NLP
library(SnowballC)
sms_corpus <- VCorpus(VectorSource(sms_raw$text)) # create a corpus for the text
print(sms_corpus)
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 5559
inspect(sms_corpus[1:2])
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 2

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 49

[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 23
lapply(sms_corpus[1:2], as.character) # print out first 2 sms text
$`1`
[1] "Hope you are having a good week. Just checking in"

$`2`
[1] "K..give back my thanks."
# We begin cleaning and standardizing the text messages in our corpus
# +transform all raw text to lowercase
sms_corpus_clean <- tm_map(sms_corpus, content_transformer(tolower))
lapply(sms_corpus[1:2], as.character) # print out first 2 sms text
$`1`
[1] "Hope you are having a good week. Just checking in"

$`2`
[1] "K..give back my thanks."
lapply(sms_corpus_clean[1:2], as.character) # print out first 2 sms text
$`1`
[1] "hope you are having a good week. just checking in"

$`2`
[1] "k..give back my thanks."
# + remove numbers from our corpus texts
sms_corpus_clean <- tm_map(sms_corpus_clean, removeNumbers)
# +remove stopwords sice they reveal little information about a text
sms_corpus_clean <- tm_map(sms_corpus_clean, removeWords, stopwords("english"))
# + remove punctuation
sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation)
# + stems words in sms text to their root
sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument)
# + now remove whitespaces left behind from the cleaning process above
sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace)
lapply(sms_corpus[1:10], as.character)
$`1`
[1] "Hope you are having a good week. Just checking in"

$`2`
[1] "K..give back my thanks."

$`3`
[1] "Am also doing in cbe only. But have to pay."

$`4`
[1] "complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out! Box434SK38WP150PPM18+"

$`5`
[1] "okmail: Dear Dave this is your final notice to collect your 4* Tenerife Holiday or #5000 CASH award! Call 09061743806 from landline. TCs SAE Box326 CW25WX 150ppm"

$`6`
[1] "Aiya we discuss later lar... Pick u up at 4 is it?"

$`7`
[1] "Are you this much buzy"

$`8`
[1] "Please ask mummy to call father"

$`9`
[1] "Marvel Mobile Play the official Ultimate Spider-man game (£4.50) on ur mobile right now. Text SPIDER to 83338 for the game & we ll send u a FREE 8Ball wallpaper"

$`10`
[1] "fyi I'm at usf now, swing by the room whenever"
lapply(sms_corpus_clean[1:10], as.character)
$`1`
[1] "hope good week just check"

$`2`
[1] "kgive back thank"

$`3`
[1] " also cbe pay"

$`4`
[1] "complimentari star ibiza holiday â cash need urgent collect now landlin lose boxskwpppm"

$`5`
[1] "okmail dear dave final notic collect tenerif holiday cash award call landlin tcs sae box cwwx ppm"

$`6`
[1] "aiya discuss later lar pick u "

$`7`
[1] " much buzi"

$`8`
[1] "pleas ask mummi call father"

$`9`
[1] "marvel mobil play offici ultim spiderman game â ur mobil right now text spider game ll send u free ball wallpap"

$`10`
[1] "fyi usf now swing room whenev"
We have to create a Document-Term-Matrix which is basically a table that has the text messages as rows and thw words in the corpus as column.
sms_dtm <- DocumentTermMatrix(sms_corpus_clean)
inspect(sms_dtm[1:30,815:825]) # take a peek at the new DTM
<<DocumentTermMatrix (documents: 30, terms: 11)>>
Non-/sparse entries: 4/326
Sparsity           : 99%
Maximal term length: 11
Weighting          : term frequency (tf)

    Terms
Docs california call callback callcost callcoz calld calldrov caller callertun callfreefon callin
  1           0    0        0        0       0     0        0      0         0           0      0
  2           0    0        0        0       0     0        0      0         0           0      0
  3           0    0        0        0       0     0        0      0         0           0      0
  4           0    0        0        0       0     0        0      0         0           0      0
  5           0    1        0        0       0     0        0      0         0           0      0
  6           0    0        0        0       0     0        0      0         0           0      0
  7           0    0        0        0       0     0        0      0         0           0      0
  8           0    1        0        0       0     0        0      0         0           0      0
  9           0    0        0        0       0     0        0      0         0           0      0
  10          0    0        0        0       0     0        0      0         0           0      0
  11          0    0        0        0       0     0        0      0         0           0      0
  12          0    0        0        0       0     0        0      0         0           0      0
  13          0    1        0        0       0     0        0      0         0           0      0
  14          0    0        0        0       0     0        0      0         0           0      0
  15          0    0        0        0       0     0        0      0         0           0      0
  16          0    0        0        0       0     0        0      0         0           0      0
  17          0    0        0        0       0     0        0      0         0           0      0
  18          0    0        0        0       0     0        0      0         0           0      0
  19          0    0        0        0       0     0        0      0         0           0      0
  20          0    0        0        0       0     0        0      0         0           0      0
  21          0    0        0        0       0     0        0      0         0           0      0
  22          0    0        0        0       0     0        0      0         0           0      0
  23          0    0        0        0       0     0        0      0         0           0      0
  24          0    0        0        0       0     0        0      0         0           0      0
  25          0    0        0        0       0     0        0      0         0           0      0
  26          0    0        0        0       0     0        0      0         0           0      0
  27          0    1        0        0       0     0        0      0         0           0      0
  28          0    0        0        0       0     0        0      0         0           0      0
  29          0    0        0        0       0     0        0      0         0           0      0
  30          0    0        0        0       0     0        0      0         0           0      0
# split our DocumentTermMatrix into test and training dataset
sms_dtm_train <- sms_dtm[1:4169, ]
sms_dtm_test <- sms_dtm[4170:5559, ]
# also append a labels column from the raw dataframe
sms_train_labels <- sms_raw[1:4169, ]$type
sms_test_labels <- sms_raw[4170:5559, ]$type
prop.table(table(sms_train_labels)) # show percentage of ham and spam in training dataset
sms_train_labels
      ham      spam 
0.8647158 0.1352842 
prop.table(table(sms_test_labels)) # show percentage of ham and spam in test dataset
sms_test_labels
      ham      spam 
0.8683453 0.1316547 
A word clouds is a visualiztion technique that display words randomly or orderedly in a cloud, words with higher frequency will appear lager than the ones with lower frequency
#install.packages("wordcloud")
library(wordcloud)
Loading required package: RColorBrewer
wordcloud(sms_corpus_clean, min.freq = 50, scale = c(4,.4), random.order = FALSE)

It’s also very helpful to compare the wordclouds between spam and ham messages

spam <- subset(sms_raw, type == "spam")
ham <- subset(sms_raw, type == "ham")
wordcloud(spam$text, min.freq = 20, scale = c(4,0.3), random.order = FALSE)

wordcloud(ham$text, min.freq = 40, scale = c(4,0.1), random.order = FALSE)

As of right now, we have over 6500 features, each feature for each word that appear in the DTM. However, some words that appear in less than 5 messages (or 5/5999 = 0.1%) and do not make a strong feature indicator; these are the words that we should exclude out of our DTM to simplify the DTM.
sms_fre_words <- findFreqTerms(sms_dtm_train, 5) # find terms that has at least 5 appearances
str(sms_fre_words) # let's see how many frequent words we have
 chr [1:1139] "â‚“""| __truncated__ "abiola" "abl" "abt" "accept" "access" "account" "across" "act" "activ" ...

So we have reduced the number of features from over 6500 down to 1139 by only accepting words with frequency = 5 of higher. Now we use this chr vector to filter out our DTM

sms_dtm_freq_train <- sms_dtm_train[ , sms_fre_words]
sms_dtm_freq_test <- sms_dtm_test[ , sms_fre_words]

Since Bayes classifier works on categorical dataset and our test and train dataset contains numeric values, we have to covnert these dataset into categorical variable by setting the variables to “yes”" or “no” depending on if it appear in each message.

convert_counts <- function(x){
  x <- ifelse(x >0, "Yes", "No") # if the word appear in a text message, then change its value to Yes, otherwise, No
}
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)
sms_test <- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts)

Step 3: Training Naive Bayes Classifier

This Naive Bayes classifier will estimate the probability of whether a text message is a spam or ham base on the presence and absence of words in a text. The Naive Bayes classifier used is from ‘e1071’ package developed by Vienna University of Technology

#install.packages('e1071')
library(e1071)
sms_classifier <- naiveBayes(sms_train, sms_train_labels)

Step 4: Evaluating the performance of our Naive Bayes classifier

# perform prediction on the test dataset using the resulted Naive Bayes classifier
sms_test_pred <- predict(sms_classifier, sms_test)
# elvaluate model's performance with CrossTable
library(gmodels)
CrossTable(sms_test_pred, sms_test_labels, prop.chisq = FALSE, prop.t = FALSE, dnn = c("Predicted", "Actual"))

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|-------------------------|

 
Total Observations in Table:  1390 

 
             | Actual 
   Predicted |       ham |      spam | Row Total | 
-------------|-----------|-----------|-----------|
         ham |      1201 |        30 |      1231 | 
             |     0.976 |     0.024 |     0.886 | 
             |     0.995 |     0.164 |           | 
-------------|-----------|-----------|-----------|
        spam |         6 |       153 |       159 | 
             |     0.038 |     0.962 |     0.114 | 
             |     0.005 |     0.836 |           | 
-------------|-----------|-----------|-----------|
Column Total |      1207 |       183 |      1390 | 
             |     0.868 |     0.132 |           | 
-------------|-----------|-----------|-----------|

 

==> As we can see, our Naive Bayes classifier predicted actual spam messages as “spam” with accuracy of 96.2%, and actual ham messages as “ham” with accuracy of 97.6% !!! It did incorrectly predicted text messages 36/1390 = 2.58%. Not a bad record for our little spam text classifier :-)

Step 4: Improving our Naive Bayes classifier performance

# using laplace estimator to improve accuracy
sms_classifier2 <- naiveBayes(sms_train, sms_train_labels, laplace = 1)
sms_test_pred2 <- predict(sms_classifier2, sms_test)
CrossTable(sms_test_pred2, sms_test_labels, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c("Predicted", "Actual"))

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Col Total |
|-------------------------|

 
Total Observations in Table:  1390 

 
             | Actual 
   Predicted |       ham |      spam | Row Total | 
-------------|-----------|-----------|-----------|
         ham |      1202 |        28 |      1230 | 
             |     0.996 |     0.153 |           | 
-------------|-----------|-----------|-----------|
        spam |         5 |       155 |       160 | 
             |     0.004 |     0.847 |           | 
-------------|-----------|-----------|-----------|
Column Total |      1207 |       183 |      1390 | 
             |     0.868 |     0.132 |           | 
-------------|-----------|-----------|-----------|

 

==> By setting laplace estimator to 1, we were able to improve the accuracyof our Naive Bayes classifier. Specifically, the number of ham messages incorrectly identifies as “spam” went down from 30 to 28, and the number of spam messages incorrectly identified as “ham” went dowm from 6 to 5.

Let’s try to increase laplace estimator to see how it changes the accuracy + with laplace estimator = 2

sms_classifier3 <- naiveBayes(sms_train, sms_train_labels, laplace = 2)
sms_test_pred3 <- predict(sms_classifier3, sms_test)
CrossTable(sms_test_pred3, sms_test_labels, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c("Predicted", "Actual"))

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Col Total |
|-------------------------|

 
Total Observations in Table:  1390 

 
             | Actual 
   Predicted |       ham |      spam | Row Total | 
-------------|-----------|-----------|-----------|
         ham |      1204 |        34 |      1238 | 
             |     0.998 |     0.186 |           | 
-------------|-----------|-----------|-----------|
        spam |         3 |       149 |       152 | 
             |     0.002 |     0.814 |           | 
-------------|-----------|-----------|-----------|
Column Total |      1207 |       183 |      1390 | 
             |     0.868 |     0.132 |           | 
-------------|-----------|-----------|-----------|

 
sms_classifier4 <- naiveBayes(sms_train, sms_train_labels, laplace = 3)
sms_test_pred4 <- predict(sms_classifier4, sms_test)
CrossTable(sms_test_pred4, sms_test_labels, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c("Predicted", "Actual"))

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Col Total |
|-------------------------|

 
Total Observations in Table:  1390 

 
             | Actual 
   Predicted |       ham |      spam | Row Total | 
-------------|-----------|-----------|-----------|
         ham |      1202 |        40 |      1242 | 
             |     0.996 |     0.219 |           | 
-------------|-----------|-----------|-----------|
        spam |         5 |       143 |       148 | 
             |     0.004 |     0.781 |           | 
-------------|-----------|-----------|-----------|
Column Total |      1207 |       183 |      1390 | 
             |     0.868 |     0.132 |           | 
-------------|-----------|-----------|-----------|

 
sms_classifier5 <- naiveBayes(sms_train, sms_train_labels, laplace = 4)
sms_test_pred5 <- predict(sms_classifier5, sms_test)
CrossTable(sms_test_pred5, sms_test_labels, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c("Predicted", "Actual"))

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Col Total |
|-------------------------|

 
Total Observations in Table:  1390 

 
             | Actual 
   Predicted |       ham |      spam | Row Total | 
-------------|-----------|-----------|-----------|
         ham |      1202 |        46 |      1248 | 
             |     0.996 |     0.251 |           | 
-------------|-----------|-----------|-----------|
        spam |         5 |       137 |       142 | 
             |     0.004 |     0.749 |           | 
-------------|-----------|-----------|-----------|
Column Total |      1207 |       183 |      1390 | 
             |     0.868 |     0.132 |           | 
-------------|-----------|-----------|-----------|

 

==> It seems that with laplace estimator = 1, our Naive Bayes classifier is the most accurate. We also observed that, by increasing laplace estimator, the number of spam messagees incorrectly identified as “ham” increases. This shows that it is important to pick the right laplace estimator so that our Naive Bayes classifier does not filter out spam too aggresively (which leads to ham misclassified as “spam”) or too passively (which causes spam misclassified as “ham”)!

Reference:

Machine Learning with R by Brett Lantz.

