This is a little script that help identifying poisonous mushrooms base on rule-based learning. The data set is available at https://archive.ics.uci.edu/ml/datasets/Mushroom . This data set contains 8124 mushrooms samples from 23 different species. Each sample has 22 features. The data set is modified from original data set so that it’s easier to work with in R. Each sample in the modified data set is classified as either “edible” or “poisonous”. The objective is to build a rule-based classifier to classify whether a certain mushrooms is poisonous.

Step 1: Reading in data

# This dataset contains only characters vectors and no numeric values, and since rule-based learning with better with categorical data, hence setting stringAsFactors = TRUE makes sense
mushrooms <- read.csv("mushrooms.csv", stringsAsFactors = TRUE)

Step 2: Exploring and Preparing data

str(mushrooms)
'data.frame':   8124 obs. of  23 variables:
 $ type                    : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
 $ cap_shape               : Factor w/ 6 levels "bell","conical",..: 3 3 1 3 3 3 1 1 3 1 ...
 $ cap_surface             : Factor w/ 4 levels "fibrous","grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
 $ cap_color               : Factor w/ 10 levels "brown","buff",..: 1 10 9 9 4 10 9 9 9 10 ...
 $ bruises                 : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
 $ odor                    : Factor w/ 9 levels "almond","anise",..: 8 1 2 8 7 1 1 2 8 1 ...
 $ gill_attachment         : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
 $ gill_spacing            : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
 $ gill_size               : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
 $ gill_color              : Factor w/ 12 levels "black","brown",..: 1 1 2 2 1 2 5 2 8 5 ...
 $ stalk_shape             : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
 $ stalk_root              : Factor w/ 5 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ...
 $ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ stalk_color_above_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ stalk_color_below_ring  : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ veil_type               : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ...
 $ veil_color              : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ ring_number             : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
 $ ring_type               : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
 $ spore_print_color       : Factor w/ 9 levels "black","brown",..: 1 2 2 1 2 1 1 2 1 1 ...
 $ population              : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
 $ habitat                 : Factor w/ 7 levels "grasses","leaves",..: 5 1 3 5 1 1 3 3 1 3 ...
summary(mushrooms$veil_type)
partial 
   8124 

It looks like veil_type variable only contains 1 value that is “partial” it will not contribute to the learning process, we should remove it from the data set:

mushrooms$veil_type <- NULL

Step 3: Training the model

The rule classifier that we will be using contained within “RWeka” package, so we need to install “RWeka” package and load it into memory:

# install.packages('RWeka')
library(RWeka)
# we're gonna use OneR rule classifier
mushrooms1R <- OneR(data = mushrooms, type ~ .)
# we could see the rule learned by this classifier
mushrooms1R
odor:
    almond  -> edible
    anise   -> edible
    creosote    -> poisonous
    fishy   -> poisonous
    foul    -> poisonous
    musty   -> poisonous
    none    -> edible
    pungent -> poisonous
    spicy   -> poisonous
(8004/8124 instances correct)
==> The OneR classifier correctly identified 8004/8124 instances even though it only used 1 feature which is “odor” in this case. This is quiet impressive for such simplistic rules.

Step 4: Evaluating our rule classifier

# let take a look at the summary of the classifier
summary(mushrooms1R)

=== Summary ===

Correctly Classified Instances        8004               98.5229 %
Incorrectly Classified Instances       120                1.4771 %
Kappa statistic                          0.9704
Mean absolute error                      0.0148
Root mean squared error                  0.1215
Relative absolute error                  2.958  %
Root relative squared error             24.323  %
Total Number of Instances             8124     

=== Confusion Matrix ===

    a    b   <-- classified as
 4208    0 |    a = edible
  120 3796 |    b = poisonous

==> We observed that the rule classifier correctly identified 8004 out of 8124 mushrooms samples which is about 98.5% accuracy. However, the rule classifier incorrectly identified 120 out of 8124 mushrooms samples as “edible” which are actually “poisonous”. Even though 98.5% accuracy might seem really impressive, those false negative classifications could cost lives because if people who entirely rely on this model to pick out mushrooms could get poisoned if they happen to pick the poisonous mushrooms those of which classified as “edible” by the model.

Step 5: Improving rule classifier’s performance

We already saw that rule based classified can be really effective even it only uses one feature to generate rule. But in some real world situations where human lives involve 98.5% accuracy is not gonna cut it. We need to improve the model’s performance so that we can reduce the error rate (especially false negative errors) as small as possible.

To do that, we should employ more than one feature to generate rule. We are gonna use the RIPPER rule learning algorithm included in “RWeka” package:

library(RWeka)
mushrooms_JRip <- JRip(mushrooms$type ~ ., data = mushrooms) # class to be predicted is mushrooms$type and predictors = '.' means all predictors (features) will be use for classification
mushrooms_JRip
JRIP rules:
===========

(odor = foul) => mushrooms$type=poisonous (2160.0/0.0)
(gill_size = narrow) and (gill_color = buff) => mushrooms$type=poisonous (1152.0/0.0)
(gill_size = narrow) and (odor = pungent) => mushrooms$type=poisonous (256.0/0.0)
(odor = creosote) => mushrooms$type=poisonous (192.0/0.0)
(spore_print_color = green) => mushrooms$type=poisonous (72.0/0.0)
(stalk_surface_below_ring = scaly) and (stalk_surface_above_ring = silky) => mushrooms$type=poisonous (68.0/0.0)
(habitat = leaves) and (cap_color = white) => mushrooms$type=poisonous (8.0/0.0)
(stalk_color_above_ring = yellow) => mushrooms$type=poisonous (8.0/0.0)
 => mushrooms$type=edible (4208.0/0.0)

Number of Rules : 9
summary(mushrooms_JRip)

=== Summary ===

Correctly Classified Instances        8124              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1     
Mean absolute error                      0     
Root mean squared error                  0     
Relative absolute error                  0      %
Root relative squared error              0      %
Total Number of Instances             8124     

=== Confusion Matrix ===

    a    b   <-- classified as
 4208    0 |    a = edible
    0 3916 |    b = poisonous

==> Taking a look at the output and summary of the RIPPER classifier, we can see that this algorithm has 100% accuracy in identifying poisonous and edible mushrooms. Rules learned by this classifier now can be used by public to pick out edible mushrooms assuming that all kind of known mushrooms species are contained in this data set.

==> The rules learned by the RIPPER algorithm can be interpreted in simple English as follow:
+ if odor = foul, then mushroom is poisonous
+ if gill_size = narrow and gill_color = buff, then poisonous
+ if gill_size = narrow and odor = pungent, then poisonous
+ if odor = creosote, then poisonous
+ if spore_print_color = green, then poisonous
+ if stalk_surface_below_ring = scaly and stalk_surface_above_ring = silky, then poisonous
+ if habitat = leaves and cap_color = white, then poisonous
+ if stalk_color_above_ring = yellow, then poisonous
+ anything else that dos not fit into those rules above are classified as “edible”

Reference:

Machine Learning with R by Brett Lantz

LS0tDQp0aXRsZTogIklkZW50aWZ5aW5nIFBvaXNvbm91cyBNdXNocm9vbXMgd2l0aCBSdWxlLWJhc2VkIENsYXNzaWZpZXIiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQpUaGlzIGlzIGEgbGl0dGxlIHNjcmlwdCB0aGF0IGhlbHAgaWRlbnRpZnlpbmcgcG9pc29ub3VzIG11c2hyb29tcyBiYXNlIG9uIHJ1bGUtYmFzZWQgbGVhcm5pbmcuIFRoZSBkYXRhIHNldCBpcyBhdmFpbGFibGUgYXQgaHR0cHM6Ly9hcmNoaXZlLmljcy51Y2kuZWR1L21sL2RhdGFzZXRzL011c2hyb29tIC4gVGhpcyBkYXRhIHNldCBjb250YWlucyA4MTI0IG11c2hyb29tcyBzYW1wbGVzIGZyb20gMjMgZGlmZmVyZW50IHNwZWNpZXMuIEVhY2ggc2FtcGxlIGhhcyAyMiBmZWF0dXJlcy4gVGhlIGRhdGEgc2V0IGlzIG1vZGlmaWVkIGZyb20gb3JpZ2luYWwgZGF0YSBzZXQgc28gdGhhdCBpdCdzIGVhc2llciB0byB3b3JrIHdpdGggaW4gUi4gRWFjaCBzYW1wbGUgaW4gdGhlIG1vZGlmaWVkIGRhdGEgc2V0IGlzIGNsYXNzaWZpZWQgYXMgZWl0aGVyICJlZGlibGUiIG9yICJwb2lzb25vdXMiLiBUaGUgb2JqZWN0aXZlIGlzIHRvIGJ1aWxkIGEgcnVsZS1iYXNlZCBjbGFzc2lmaWVyIHRvIGNsYXNzaWZ5IHdoZXRoZXIgYSBjZXJ0YWluIG11c2hyb29tcyBpcyBwb2lzb25vdXMuDQoNCjxoND48dT5TdGVwIDE6IFJlYWRpbmcgaW4gZGF0YTwvdT48L2g0Pg0KYGBge3J9DQojIFRoaXMgZGF0YXNldCBjb250YWlucyBvbmx5IGNoYXJhY3RlcnMgdmVjdG9ycyBhbmQgbm8gbnVtZXJpYyB2YWx1ZXMsIGFuZCBzaW5jZSBydWxlLWJhc2VkIGxlYXJuaW5nIHdpdGggYmV0dGVyIHdpdGggY2F0ZWdvcmljYWwgZGF0YSwgaGVuY2Ugc2V0dGluZyBzdHJpbmdBc0ZhY3RvcnMgPSBUUlVFIG1ha2VzIHNlbnNlDQptdXNocm9vbXMgPC0gcmVhZC5jc3YoIm11c2hyb29tcy5jc3YiLCBzdHJpbmdzQXNGYWN0b3JzID0gVFJVRSkNCmBgYA0KDQo8aDQ+PHU+U3RlcCAyOiBFeHBsb3JpbmcgIGFuZCBQcmVwYXJpbmcgZGF0YTwvdT48L2g0Pg0KYGBge3J9DQpzdHIobXVzaHJvb21zKQ0Kc3VtbWFyeShtdXNocm9vbXMkdmVpbF90eXBlKQ0KYGBgDQpJdCBsb29rcyBsaWtlIHZlaWxfdHlwZSB2YXJpYWJsZSBvbmx5IGNvbnRhaW5zIDEgdmFsdWUgdGhhdCBpcyAicGFydGlhbCIgaXQgd2lsbCBub3QgY29udHJpYnV0ZSB0byB0aGUgbGVhcm5pbmcgcHJvY2Vzcywgd2Ugc2hvdWxkIHJlbW92ZSBpdCBmcm9tIHRoZSBkYXRhIHNldDoNCmBgYHtyfQ0KbXVzaHJvb21zJHZlaWxfdHlwZSA8LSBOVUxMDQpgYGANCg0KPGg0Pjx1PlN0ZXAgMzogVHJhaW5pbmcgdGhlIG1vZGVsPC91PjwvaDQ+DQpUaGUgcnVsZSBjbGFzc2lmaWVyIHRoYXQgd2Ugd2lsbCBiZSB1c2luZyBjb250YWluZWQgd2l0aGluICJSV2VrYSIgcGFja2FnZSwgc28gd2UgbmVlZCB0byBpbnN0YWxsICJSV2VrYSIgcGFja2FnZSBhbmQgbG9hZCBpdCBpbnRvIG1lbW9yeToNCmBgYHtyfQ0KIyBpbnN0YWxsLnBhY2thZ2VzKCdSV2VrYScpDQpsaWJyYXJ5KFJXZWthKQ0KIyB3ZSdyZSBnb25uYSB1c2UgT25lUiBydWxlIGNsYXNzaWZpZXINCm11c2hyb29tczFSIDwtIE9uZVIoZGF0YSA9IG11c2hyb29tcywgdHlwZSB+IC4pDQojIHdlIGNvdWxkIHNlZSB0aGUgcnVsZSBsZWFybmVkIGJ5IHRoaXMgY2xhc3NpZmllcg0KbXVzaHJvb21zMVINCmBgYA0KPT0+IFRoZSBPbmVSIGNsYXNzaWZpZXIgY29ycmVjdGx5IGlkZW50aWZpZWQgODAwNC84MTI0IGluc3RhbmNlcyBldmVuIHRob3VnaCBpdCBvbmx5IHVzZWQgMSBmZWF0dXJlIHdoaWNoIGlzICJvZG9yIiBpbiB0aGlzIGNhc2UuIFRoaXMgaXMgcXVpZXQgaW1wcmVzc2l2ZSBmb3Igc3VjaCBzaW1wbGlzdGljIHJ1bGVzLg0KPGg0Pjx1PlN0ZXAgNDogRXZhbHVhdGluZyBvdXIgcnVsZSBjbGFzc2lmaWVyPC91PjwvaDQ+DQpgYGB7cn0NCiMgbGV0IHRha2UgYSBsb29rIGF0IHRoZSBzdW1tYXJ5IG9mIHRoZSBjbGFzc2lmaWVyDQpzdW1tYXJ5KG11c2hyb29tczFSKQ0KYGBgDQo9PT4gV2Ugb2JzZXJ2ZWQgdGhhdCB0aGUgcnVsZSBjbGFzc2lmaWVyIGNvcnJlY3RseSBpZGVudGlmaWVkIDgwMDQgb3V0IG9mIDgxMjQgbXVzaHJvb21zIHNhbXBsZXMgd2hpY2ggaXMgYWJvdXQgOTguNSUgYWNjdXJhY3kuIEhvd2V2ZXIsIHRoZSBydWxlIGNsYXNzaWZpZXIgaW5jb3JyZWN0bHkgaWRlbnRpZmllZCAxMjAgb3V0IG9mIDgxMjQgbXVzaHJvb21zIHNhbXBsZXMgYXMgImVkaWJsZSIgd2hpY2ggYXJlIGFjdHVhbGx5ICJwb2lzb25vdXMiLiBFdmVuIHRob3VnaCA5OC41JSBhY2N1cmFjeSBtaWdodCBzZWVtIHJlYWxseSBpbXByZXNzaXZlLCB0aG9zZSBmYWxzZSBuZWdhdGl2ZSBjbGFzc2lmaWNhdGlvbnMgY291bGQgY29zdCBsaXZlcyBiZWNhdXNlIGlmIHBlb3BsZSB3aG8gZW50aXJlbHkgcmVseSBvbiB0aGlzIG1vZGVsIHRvIHBpY2sgb3V0IG11c2hyb29tcyBjb3VsZCBnZXQgcG9pc29uZWQgaWYgdGhleSBoYXBwZW4gdG8gcGljayB0aGUgcG9pc29ub3VzIG11c2hyb29tcyB0aG9zZSBvZiB3aGljaCBjbGFzc2lmaWVkIGFzICJlZGlibGUiIGJ5IHRoZSBtb2RlbC4NCg0KPGg0Pjx1PlN0ZXAgNTogSW1wcm92aW5nIHJ1bGUgY2xhc3NpZmllcidzIHBlcmZvcm1hbmNlPC91PjwvaDQ+DQoNCldlIGFscmVhZHkgc2F3IHRoYXQgcnVsZSBiYXNlZCBjbGFzc2lmaWVkIGNhbiBiZSByZWFsbHkgZWZmZWN0aXZlIGV2ZW4gaXQgb25seSB1c2VzIG9uZSBmZWF0dXJlIHRvIGdlbmVyYXRlIHJ1bGUuIEJ1dCBpbiBzb21lIHJlYWwgd29ybGQgc2l0dWF0aW9ucyB3aGVyZSBodW1hbiBsaXZlcyBpbnZvbHZlIDk4LjUlIGFjY3VyYWN5IGlzIG5vdCBnb25uYSBjdXQgaXQuIFdlIG5lZWQgdG8gaW1wcm92ZSB0aGUgbW9kZWwncyBwZXJmb3JtYW5jZSBzbyB0aGF0IHdlIGNhbiByZWR1Y2UgdGhlIGVycm9yIHJhdGUgKGVzcGVjaWFsbHkgZmFsc2UgbmVnYXRpdmUgZXJyb3JzKSBhcyBzbWFsbCBhcyBwb3NzaWJsZS4NCg0KVG8gZG8gdGhhdCwgd2Ugc2hvdWxkIGVtcGxveSBtb3JlIHRoYW4gb25lIGZlYXR1cmUgdG8gZ2VuZXJhdGUgcnVsZS4gV2UgYXJlIGdvbm5hIHVzZSB0aGUgUklQUEVSIHJ1bGUgbGVhcm5pbmcgYWxnb3JpdGhtIGluY2x1ZGVkIGluICJSV2VrYSIgcGFja2FnZToNCmBgYHtyfQ0KbGlicmFyeShSV2VrYSkNCm11c2hyb29tc19KUmlwIDwtIEpSaXAobXVzaHJvb21zJHR5cGUgfiAuLCBkYXRhID0gbXVzaHJvb21zKSAjIGNsYXNzIHRvIGJlIHByZWRpY3RlZCBpcyBtdXNocm9vbXMkdHlwZSBhbmQgcHJlZGljdG9ycyA9ICcuJyBtZWFucyBhbGwgcHJlZGljdG9ycyAoZmVhdHVyZXMpIHdpbGwgYmUgdXNlIGZvciBjbGFzc2lmaWNhdGlvbg0KbXVzaHJvb21zX0pSaXANCnN1bW1hcnkobXVzaHJvb21zX0pSaXApDQpgYGANCj09PiBUYWtpbmcgYSBsb29rIGF0IHRoZSBvdXRwdXQgYW5kIHN1bW1hcnkgb2YgdGhlIFJJUFBFUiBjbGFzc2lmaWVyLCB3ZSBjYW4gc2VlIHRoYXQgdGhpcyBhbGdvcml0aG0gaGFzIDEwMCUgYWNjdXJhY3kgaW4gaWRlbnRpZnlpbmcgcG9pc29ub3VzIGFuZCBlZGlibGUgbXVzaHJvb21zLiBSdWxlcyBsZWFybmVkIGJ5IHRoaXMgY2xhc3NpZmllciBub3cgY2FuIGJlIHVzZWQgYnkgcHVibGljIHRvIHBpY2sgb3V0IGVkaWJsZSBtdXNocm9vbXMgYXNzdW1pbmcgdGhhdCBhbGwga2luZCBvZiBrbm93biBtdXNocm9vbXMgc3BlY2llcyBhcmUgY29udGFpbmVkIGluIHRoaXMgZGF0YSBzZXQuPC9icj4NCg0KPT0+IFRoZSBydWxlcyBsZWFybmVkIGJ5IHRoZSBSSVBQRVIgYWxnb3JpdGhtIGNhbiBiZSBpbnRlcnByZXRlZCBpbiBzaW1wbGUgRW5nbGlzaCBhcyBmb2xsb3c6PC9icj4NCiAgICArIGlmIG9kb3IgPSBmb3VsLCB0aGVuIG11c2hyb29tIGlzIHBvaXNvbm91czwvYnI+DQogICAgKyBpZiBnaWxsX3NpemUgPSBuYXJyb3cgYW5kIGdpbGxfY29sb3IgPSBidWZmLCB0aGVuIHBvaXNvbm91czwvYnI+DQogICAgKyBpZiBnaWxsX3NpemUgPSBuYXJyb3cgYW5kIG9kb3IgPSBwdW5nZW50LCB0aGVuIHBvaXNvbm91czwvYnI+DQogICAgKyBpZiBvZG9yID0gY3Jlb3NvdGUsIHRoZW4gcG9pc29ub3VzPC9icj4NCiAgICArIGlmIHNwb3JlX3ByaW50X2NvbG9yID0gZ3JlZW4sIHRoZW4gcG9pc29ub3VzPC9icj4NCiAgICArIGlmIHN0YWxrX3N1cmZhY2VfYmVsb3dfcmluZyA9IHNjYWx5IGFuZCBzdGFsa19zdXJmYWNlX2Fib3ZlX3JpbmcgPSBzaWxreSwgdGhlbiBwb2lzb25vdXM8L2JyPg0KICAgICsgaWYgaGFiaXRhdCA9IGxlYXZlcyBhbmQgY2FwX2NvbG9yID0gd2hpdGUsIHRoZW4gcG9pc29ub3VzPC9icj4NCiAgICArIGlmIHN0YWxrX2NvbG9yX2Fib3ZlX3JpbmcgPSB5ZWxsb3csIHRoZW4gcG9pc29ub3VzPC9icj4NCiAgICArIGFueXRoaW5nIGVsc2UgdGhhdCBkb3Mgbm90IGZpdCBpbnRvIHRob3NlIHJ1bGVzIGFib3ZlIGFyZSBjbGFzc2lmaWVkIGFzICJlZGlibGUiPC9icj4NCiAgICANCjxoND5SZWZlcmVuY2U6PC9oND4NCjxlbT5NYWNoaW5lIExlYXJuaW5nIHdpdGggUjwvZW0+IGJ5IEJyZXR0IExhbnR6