Author: Cat Tran University of North Texas Date: March 2017 Language used: [R] Packages used: apiori Technique used: Market Basket Analysis (aka Association Rules)
Association Rules learning is the core behind many online recommendation engines such as online stores such movies, musics, games services, and fashion retails…etc. For example, if site users database shows that sites visitors tend to watch animation shows, then perhaps more kids shows should pop up on the recommendation list.
Many groceries companies also use this technique to improve their inventory allocation to maximize customer’s spending in their stores. For example, if transactions database showing a trend that shoppers tend to buy coffee and creamers together, perhaps, inventory managers should place coffee and creamers in the same aisle to maximize customers’ convenience thus increase their spending.
In this tutorial, we will employ Market Basket Analysis technique on groceries transactions data to identify consumers’ shopping behaviors.
The groceries data set that we will be using is collected from a real groceries store. It contains 9835 transactions, about 327 transactions were made in the span of one month (9835 / 327 = 30 days). This groceries data set is bundled in the arules R package.
We will be using the arules package, so we first need to install and load it into memory.
#install.packages('arules')
library(arules)
Now we need to load the dataset files in csv format. NOTE: The approach for loading transactions data is different than loading normal data frame with predetermined number of features. Each row in transactions dataset has different number of items, thus we cannot load transactions data into a normal data frame. The reason is when we use the normal read.csv(…) functions, R will assume the dataset only has a fixed number of features (columns), it does this by looking at the first row of the dataset. In our case, the first row contains 4 items, thus R will assume the rest of the dataset’s rows has only 4 features (columns) and it will cut off the rest of items in rows that have more than 4 items.
We will instead use another similar function implemented in the arules package which allow us to load Groceries dataset into a sparse matrix.
groceries <- read.transactions("../Datasets/groceries.csv", sep = ",")
summary(groceries)
transactions as itemMatrix in sparse format with
9835 rows (elements/itemsets/transactions) and
169 columns (items) and a density of 0.02609146
most frequent items:
whole milk other vegetables rolls/buns soda yogurt (Other)
2513 1903 1809 1715 1372 34055
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46 29 14 14 9 11 4 6 1
26 27 28 29 32
1 1 1 3 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 3.000 4.409 6.000 32.000
includes extended item information - examples:
labels
1 abrasive cleaner
2 artif. sweetener
3 baby cosmetics
As we can see, in our case, a sparse matrix that represent the transactions dataset has 169 features (columns); each features is an item that appears in the dataset, thus there are 169 possible items in the dataset. Each row contains binary values which indicate whether or not each item (represented by the column names) appear in that transaction. For example, the second transaction of the dataset has three items: tropical fruit, yogurt, and coffee, thus the second row of the sparse matrix will has values ‘1’ for columns tropical fruit, yogurt, and coffee, the rest of the columns will has values ‘0’.
A few insight can also be extracted by looking at the summary of the data above. For example, there are 2513 ‘whole milk’ , and 1903 ‘other vegetables’ transactions. There are 2159 transactions where 1 item was bought, and 1 transactions where 32 items were bought. Obviously, the least item bought is 1 and the most items bought is 32, and the average number of items bought is 4.4 items.
Let’s take a peek at the data matrix. We can use inspect(..) function from the arules package to take a look at the sparse matrix.
inspect(groceries[1:7])
items
[1] {citrus fruit,margarine,ready soups,semi-finished bread}
[2] {coffee,tropical fruit,yogurt}
[3] {whole milk}
[4] {cream cheese,meat spreads,pip fruit,yogurt}
[5] {condensed milk,long life bakery product,other vegetables,whole milk}
[6] {abrasive cleaner,butter,rice,whole milk,yogurt}
[7] {rolls/buns}
We can also extract the frequency of items which represented by columns:
itemFrequency(groceries[,1:10])
abrasive cleaner artif. sweetener baby cosmetics baby food bags baking powder bathroom cleaner
0.0035587189 0.0032536858 0.0006100661 0.0001016777 0.0004067107 0.0176919166 0.0027452974
beef berries beverages
0.0524656838 0.0332486019 0.0260294865
==> this output show the frequencies of the first 10 items (ordered alphabetically) in the sparse matrix. As we can see, items such as “beef”, “berries”, and “beverages” are a lot more frequently bought compare to items such as “abrasive cleaner”, “baby food”, “bags”, and so on.
Plotting the items with their frequencies also helps us have a visual understanding of how often items are bought. The arules package also provides a convenient method to do so. Since, the number of items usually are very large depends on the datasets, plotting all of them in one plot is not a good idea; instead, we should only plot the items that have a frequency more than a threshold that we choose.
# only items with frequency higher than the chosen support values will be plotted
itemFrequencyPlot(groceries, support=0.05)
We can also limit the number of the items to be plotted by specifying the parameter ‘topN’:
itemFrequencyPlot(groceries, topN=20)
What can we say from this plot? Whole milk is bought the most frequent, this alone makes sense if we think about how often we bought milk compare to other items.
In addition, we can also draw an image of the sparse matrix using function image() implemented in arules package.
image(groceries[1:100])
image(sample(groceries, 100))
Each black dot in the image above represent the appearance of an item in each transaction. We observe that there are some consecutive transactions in which a common item were bought. The more an item appear in a vertical line, the more often that item was bought.
Once we satisfied with the general understanding of data, we can now go ahead and train a model to identify associations between groceries items. We will do this with the help of the Apriori algorithm implemented in the arules package.
The general usage of apriori(..) function can is as followed: apriori(data, parameter = NULL, appearance = NULL, control = NULL)
. The parameter
argument can be used to set the minimum support, confidence, maximum of items, and maximal time checking a subset. Let go with the default parameter
first with support=0.1, confidence=0.8, maximum of 10 items, and maximal time for subset checking of 5 seconds:
apriori(groceries)
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.8 0.1 1 none FALSE TRUE 5 0.1 1 10 rules FALSE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 983
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [8 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [0 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
set of 0 rules
==> We can see that with the default minimum support level (frequency) of 0.1, no rule was learned since there is no itemset with frequency of 0.1 or higher. However, our effort has not been wasted since now that we know which direction we should change our support level: we need to lower our minimum support value.
groceriesRules <- apriori(groceries, parameter = list(support=0.006, confidence=0.25, minlen=2))
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.25 0.1 1 none FALSE TRUE 5 0.006 2 10 rules FALSE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 59
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [109 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [463 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
groceriesRules
set of 463 rules
==> As we can see by adjusting the parameters such as minimum support, confidence levels, and number of items, we have dramatically increased the number of association rules discovered by the algorithm. Whether these rules are useful, we need to look deeper into them.
Let’s take a look at the summary of the rules object created:
summary(groceriesRules)
set of 463 rules
rule length distribution (lhs + rhs):sizes
2 3 4
150 297 16
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.000 3.000 2.711 3.000 4.000
summary of quality measures:
support confidence lift
Min. :0.006101 Min. :0.2500 Min. :0.9932
1st Qu.:0.007117 1st Qu.:0.2971 1st Qu.:1.6229
Median :0.008744 Median :0.3554 Median :1.9332
Mean :0.011539 Mean :0.3786 Mean :2.0351
3rd Qu.:0.012303 3rd Qu.:0.4495 3rd Qu.:2.3565
Max. :0.074835 Max. :0.6600 Max. :3.9565
mining info:
data ntransactions support confidence
groceries 9835 0.006 0.25
What can we say about this summary? There are 150 rules with 2 associated items, 297 rules with 3 items, and 16 rules with 4 items. The statistic of support, confidence, and lift levels also shown. The most interesting of these three is the lift statistic. What lift represents is the ratio of the confidence level of an item and its support level. The higher the lift statistic, the stronger the association rule between items in concern. For example, if the frequency of milk is 0.1 out of all transactions and the frequency milk appears whenever bread is bought is 0.5, then support(milk) = 0.1, confidence(milk->bread) = 0.5, therefore its lift factor is 0.5/0.1 = 5. As we can see, the higher the lift factor of an item given another item, the stronger the association between these two items!!
We can also look at a few specific rules in the groceriesRules
object using the inspect(..)
function.
inspect(groceriesRules[1:3])
lhs rhs support confidence lift
[1] {pot plants} => {whole milk} 0.006914082 0.4000000 1.565460
[2] {pasta} => {whole milk} 0.006100661 0.4054054 1.586614
[3] {herbs} => {root vegetables} 0.007015760 0.4312500 3.956477
The first rule can be read as “if shoppers buy pot plant then they also buy whole milk” with support level of 0.007, confidence level of 0.4 and lift factor of 1.57. Even though having a high lift factor, from our experience there isn’t any reason why shopper want to buy pot plant and while milk together (unless they want to water the plants with whole milk!). However, if we consider the rule “if shoppers buy pasta, they also buy whole milk” with support, confidence, and lift similar to the first rule, this rule actually makes more sense since as we know Mac and Cheese is a popular recipe and it can be made using pasta and whole milk.
sort(..)
and specifying by which statistic do we want to sort the rules by:
inspect(sort(groceriesRules, by = "lift")[1:10])
lhs rhs support confidence lift
[1] {herbs} => {root vegetables} 0.007015760 0.4312500 3.956477
[2] {berries} => {whipped/sour cream} 0.009049314 0.2721713 3.796886
[3] {other vegetables,tropical fruit,whole milk} => {root vegetables} 0.007015760 0.4107143 3.768074
[4] {beef,other vegetables} => {root vegetables} 0.007930859 0.4020619 3.688692
[5] {other vegetables,tropical fruit} => {pip fruit} 0.009456024 0.2634561 3.482649
[6] {beef,whole milk} => {root vegetables} 0.008032537 0.3779904 3.467851
[7] {other vegetables,pip fruit} => {tropical fruit} 0.009456024 0.3618677 3.448613
[8] {pip fruit,yogurt} => {tropical fruit} 0.006405694 0.3559322 3.392048
[9] {citrus fruit,other vegetables} => {root vegetables} 0.010371124 0.3591549 3.295045
[10] {other vegetables,whole milk,yogurt} => {tropical fruit} 0.007625826 0.3424658 3.263712
Here we have sorted the rules based on the highest value of lift factor. The rule with the highest lift factor turned out to be {herbs ==> root vegetables}, this could mean that customers who buy root vegetables are about 4 times more likely to buy some herbs too; why do you buy herbs and root vegetables together? Maybe to make some soup? Similarly, customers are 3.7 times more likely to buy some berries when they are buying whipped/sour cream. Have you ever eaten berries with whipped cream? If you have, then you probably know why these two items tend to appear in the same transactions.
subset(..)
. Let say we are interested in all the rules that concern “whole milk”, we could list all these rules easily.
berryRules <- subset(groceriesRules, subset = (lhs %in% "whole milk" | rhs %in% "whole milk") & lift > 3)
inspect(berryRules)
lhs rhs support confidence lift
[1] {beef,whole milk} => {root vegetables} 0.008032537 0.3779904 3.467851
[2] {root vegetables,tropical fruit,whole milk} => {other vegetables} 0.007015760 0.5847458 3.022057
[3] {other vegetables,tropical fruit,whole milk} => {root vegetables} 0.007015760 0.4107143 3.768074
[4] {other vegetables,tropical fruit,whole milk} => {yogurt} 0.007625826 0.4464286 3.200164
[5] {other vegetables,whole milk,yogurt} => {tropical fruit} 0.007625826 0.3424658 3.263712
[6] {other vegetables,whole milk,yogurt} => {root vegetables} 0.007829181 0.3515982 3.225716
[7] {other vegetables,rolls/buns,whole milk} => {root vegetables} 0.006202339 0.3465909 3.179778
What we see here is all the rules that contain “whole milk” which appears either in the left hand side or in the right hand side of the rules with the constraint that the lift factor has to be at least 3.000. We could do the same with the lift factor starting from 2.00 but that would result in a really long list considering how popular “whole milk” is in our transactions dataset. This will be left as an exercise for you readers.
It’s good to know that you can also save the rules learned by the algorithm to local files so that it’s easier to share your result to others. We can do this by calling the function write
:
write(groceriesRules, file = "groceriesRules.csv", sep=",", quote = TRUE, row.names = FALSE)
# we can also write this rules object back into a dataframe in case we want to process the data further by doing:
groceriesRulesDF <- as(groceriesRules, "data.frame")
We have tried our hand on Market Basket Analysis technique with the apriori package. Market Basket Analysis is the foundation behind many recommendation engine such as movies, music, clothing recommendation systems. We also gained a basic understanding of support, confidence, and lift factor. In addition, with the help of apriori package, we were able to gain some insight in the shopping patterns of consumers, such as most frequently bought items, itemsets that are likely to be bought together…etc. Even though we have only worked with Groceries transactions dataset, this technique is not limited to groceries setting but it can and should also be employed in tasks where we need to find patterns and associations rules between items in categorical datasets.
Machine Learning with R, by Brett Lantz