This is a little script that help identifying poisonous mushrooms base on rule-based learning. The data set is available at https://archive.ics.uci.edu/ml/datasets/Mushroom . This data set contains 8124 mushrooms samples from 23 different species. Each sample has 22 features. The data set is modified from original data set so that it’s easier to work with in R. Each sample in the modified data set is classified as either “edible” or “poisonous”. The objective is to build a rule-based classifier to classify whether a certain mushrooms is poisonous.
# This dataset contains only characters vectors and no numeric values, and since rule-based learning with better with categorical data, hence setting stringAsFactors = TRUE makes sense
mushrooms <- read.csv("mushrooms.csv", stringsAsFactors = TRUE)
str(mushrooms)
'data.frame': 8124 obs. of 23 variables:
$ type : Factor w/ 2 levels "edible","poisonous": 2 1 1 2 1 1 1 1 2 1 ...
$ cap_shape : Factor w/ 6 levels "bell","conical",..: 3 3 1 3 3 3 1 1 3 1 ...
$ cap_surface : Factor w/ 4 levels "fibrous","grooves",..: 4 4 4 3 4 3 4 3 3 4 ...
$ cap_color : Factor w/ 10 levels "brown","buff",..: 1 10 9 9 4 10 9 9 9 10 ...
$ bruises : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
$ odor : Factor w/ 9 levels "almond","anise",..: 8 1 2 8 7 1 1 2 8 1 ...
$ gill_attachment : Factor w/ 2 levels "attached","free": 2 2 2 2 2 2 2 2 2 2 ...
$ gill_spacing : Factor w/ 2 levels "close","crowded": 1 1 1 1 2 1 1 1 1 1 ...
$ gill_size : Factor w/ 2 levels "broad","narrow": 2 1 1 2 1 1 1 1 2 1 ...
$ gill_color : Factor w/ 12 levels "black","brown",..: 1 1 2 2 1 2 5 2 8 5 ...
$ stalk_shape : Factor w/ 2 levels "enlarging","tapering": 1 1 1 1 2 1 1 1 1 1 ...
$ stalk_root : Factor w/ 5 levels "bulbous","club",..: 3 2 2 3 3 2 2 2 3 2 ...
$ stalk_surface_above_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
$ stalk_surface_below_ring: Factor w/ 4 levels "fibrous","scaly",..: 4 4 4 4 4 4 4 4 4 4 ...
$ stalk_color_above_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
$ stalk_color_below_ring : Factor w/ 9 levels "brown","buff",..: 8 8 8 8 8 8 8 8 8 8 ...
$ veil_type : Factor w/ 1 level "partial": 1 1 1 1 1 1 1 1 1 1 ...
$ veil_color : Factor w/ 4 levels "brown","orange",..: 3 3 3 3 3 3 3 3 3 3 ...
$ ring_number : Factor w/ 3 levels "none","one","two": 2 2 2 2 2 2 2 2 2 2 ...
$ ring_type : Factor w/ 5 levels "evanescent","flaring",..: 5 5 5 5 1 5 5 5 5 5 ...
$ spore_print_color : Factor w/ 9 levels "black","brown",..: 1 2 2 1 2 1 1 2 1 1 ...
$ population : Factor w/ 6 levels "abundant","clustered",..: 4 3 3 4 1 3 3 4 5 4 ...
$ habitat : Factor w/ 7 levels "grasses","leaves",..: 5 1 3 5 1 1 3 3 1 3 ...
summary(mushrooms$veil_type)
partial
8124
It looks like veil_type variable only contains 1 value that is “partial” it will not contribute to the learning process, we should remove it from the data set:
mushrooms$veil_type <- NULL
The rule classifier that we will be using contained within “RWeka” package, so we need to install “RWeka” package and load it into memory:
# install.packages('RWeka')
library(RWeka)
# we're gonna use OneR rule classifier
mushrooms1R <- OneR(data = mushrooms, type ~ .)
# we could see the rule learned by this classifier
mushrooms1R
odor:
almond -> edible
anise -> edible
creosote -> poisonous
fishy -> poisonous
foul -> poisonous
musty -> poisonous
none -> edible
pungent -> poisonous
spicy -> poisonous
(8004/8124 instances correct)
==> The OneR classifier correctly identified 8004/8124 instances even though it only used 1 feature which is “odor” in this case. This is quiet impressive for such simplistic rules.
# let take a look at the summary of the classifier
summary(mushrooms1R)
=== Summary ===
Correctly Classified Instances 8004 98.5229 %
Incorrectly Classified Instances 120 1.4771 %
Kappa statistic 0.9704
Mean absolute error 0.0148
Root mean squared error 0.1215
Relative absolute error 2.958 %
Root relative squared error 24.323 %
Total Number of Instances 8124
=== Confusion Matrix ===
a b <-- classified as
4208 0 | a = edible
120 3796 | b = poisonous
==> We observed that the rule classifier correctly identified 8004 out of 8124 mushrooms samples which is about 98.5% accuracy. However, the rule classifier incorrectly identified 120 out of 8124 mushrooms samples as “edible” which are actually “poisonous”. Even though 98.5% accuracy might seem really impressive, those false negative classifications could cost lives because if people who entirely rely on this model to pick out mushrooms could get poisoned if they happen to pick the poisonous mushrooms those of which classified as “edible” by the model.
We already saw that rule based classified can be really effective even it only uses one feature to generate rule. But in some real world situations where human lives involve 98.5% accuracy is not gonna cut it. We need to improve the model’s performance so that we can reduce the error rate (especially false negative errors) as small as possible.
To do that, we should employ more than one feature to generate rule. We are gonna use the RIPPER rule learning algorithm included in “RWeka” package:
library(RWeka)
mushrooms_JRip <- JRip(mushrooms$type ~ ., data = mushrooms) # class to be predicted is mushrooms$type and predictors = '.' means all predictors (features) will be use for classification
mushrooms_JRip
JRIP rules:
===========
(odor = foul) => mushrooms$type=poisonous (2160.0/0.0)
(gill_size = narrow) and (gill_color = buff) => mushrooms$type=poisonous (1152.0/0.0)
(gill_size = narrow) and (odor = pungent) => mushrooms$type=poisonous (256.0/0.0)
(odor = creosote) => mushrooms$type=poisonous (192.0/0.0)
(spore_print_color = green) => mushrooms$type=poisonous (72.0/0.0)
(stalk_surface_below_ring = scaly) and (stalk_surface_above_ring = silky) => mushrooms$type=poisonous (68.0/0.0)
(habitat = leaves) and (cap_color = white) => mushrooms$type=poisonous (8.0/0.0)
(stalk_color_above_ring = yellow) => mushrooms$type=poisonous (8.0/0.0)
=> mushrooms$type=edible (4208.0/0.0)
Number of Rules : 9
summary(mushrooms_JRip)
=== Summary ===
Correctly Classified Instances 8124 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 8124
=== Confusion Matrix ===
a b <-- classified as
4208 0 | a = edible
0 3916 | b = poisonous
==> Taking a look at the output and summary of the RIPPER classifier, we can see that this algorithm has 100% accuracy in identifying poisonous and edible mushrooms. Rules learned by this classifier now can be used by public to pick out edible mushrooms assuming that all kind of known mushrooms species are contained in this data set.
==> The rules learned by the RIPPER algorithm can be interpreted in simple English as follow: + if odor = foul, then mushroom is poisonous + if gill_size = narrow and gill_color = buff, then poisonous + if gill_size = narrow and odor = pungent, then poisonous + if odor = creosote, then poisonous + if spore_print_color = green, then poisonous + if stalk_surface_below_ring = scaly and stalk_surface_above_ring = silky, then poisonous + if habitat = leaves and cap_color = white, then poisonous + if stalk_color_above_ring = yellow, then poisonous + anything else that dos not fit into those rules above are classified as “edible”
Machine Learning with R by Brett Lantz