Skip to content

Commit ece5ca4

Browse files
author
h2o
committed
Create tutorials/glm/glm.md
1 parent 1bb0fbc commit ece5ca4

1 file changed

Lines changed: 262 additions & 0 deletions

File tree

tutorials/glm/glm.md

Lines changed: 262 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,262 @@
1+
* Introduction
2+
* Installation and Startup
3+
* Cover Type Dataset
4+
* Multinomial Model
5+
* Binomial Model
6+
* Adding extra features
7+
* Multinomial Model Revisited
8+
9+
10+
## Introduction
11+
This tutorial shows how a H2O [GLM](http://en.wikipedia.org/wiki/Generalized_linear_model) model can be used to do binary and multi-class classification. This tutorial covers usage of H2O from R. A python version of this tutorial will be available as well in a separate document. This file is available in plain R, R markdown and regular markdown formats, and the plots are available as PDF files. All documents are available [on Github](https://github.com/h2oai/h2o-world-2015-training/raw/master/tutorials/glm/).
12+
13+
If run from plain R, execute R in the directory of this script. If run from RStudio, be sure to `setwd()` to the location of this script. `h2o.init()` starts H2O in R's current working directory. h2o.importFile() looks for files from the perspective of where H2O was started.
14+
15+
More examples and explanations can be found in our [H2O GLM booklet](http://h2o.ai/resources/) and on our [H2O Github Repository](http://github.com/h2oai/h2o-3/).
16+
17+
### H2O R Package
18+
19+
Load the H2O R package:
20+
21+
```{r load_library}
22+
## R installation instructions are at http://h2o.ai/download
23+
library(h2o)
24+
```
25+
26+
### Start H2O
27+
Start up a 1-node H2O server on your local machine, and allow it to use all CPU cores and up to 2GB of memory:
28+
29+
```{r start_h2o}
30+
h2o.init(nthreads=-1, max_mem_size="2G")
31+
h2o.removeAll() ## clean slate - just in case the cluster was already running
32+
```
33+
## Cover Type Data
34+
Predicting forest cover type from cartographic variables only (no remotely sensed data).
35+
Let's import the dataset:
36+
37+
```{r import_data}
38+
D = h2o.importFile(path = normalizePath("../data/covtype.full.csv"))
39+
h2o.summary(D)
40+
```
41+
We have 11 numeric and two categorical features. Response is "Cover_Type" and has 7 classes.
42+
Let's split the data into Train/Test/Validation with train having 70% and Test and Validation 15% each:
43+
44+
```{r split_data}
45+
data = h2o.splitFrame(D,ratios=c(.7,.15),destination_frames = c("train","test","valid"))
46+
names(data) <- c("Train","Test","Valid")
47+
y = "Cover_Type"
48+
x = names(data$Train)
49+
x = x[-which(x==y)]
50+
```
51+
## Multinomial Model
52+
53+
We imported our data, so let's run GLM. As we mentioned previously, Cover_Type is the response and we use all other columns as predictors.
54+
We have multi-class problem so we pick family=multinomial. L-BFGS solver tends to be faster on multinomial problems, so we pick L-BFGS for our first try.
55+
The rest can use the default settings.
56+
57+
```{r build_model1}
58+
m1 = h2o.glm(training_frame = data$Train, validation_frame = data$Valid, x = x, y = y,family='multinomial',solver='L_BFGS')
59+
h2o.confusionMatrix(m1, valid=TRUE)
60+
```
61+
The model predicts only the majority class so it's not useful at all! Maybe we regularized it too much, let's try again without regularization:
62+
63+
```{r build_model2}
64+
m2 = h2o.glm(training_frame = data$Train, validation_frame = data$Valid, x = x, y = y,family='multinomial',solver='L_BFGS', lambda = 0)
65+
h2o.confusionMatrix(m2, valid=FALSE) # get confusion matrix in the training data
66+
h2o.confusionMatrix(m2, valid=TRUE) # get confusion matrix in the validation data
67+
```
68+
No overfitting (as train and test performance are the same), regularization is not needed in this case.
69+
70+
This model is actually useful. It got 28% classification error, down from 51% obtained by predicting majority class only.
71+
72+
##Binomial Model
73+
Since multinomial models are difficult and time consuming, let's try a simpler binary classification.
74+
We'll take a subset of the data with only `class_1` and `class_2` (the two majority classes) and build a binomial model deciding between them.
75+
76+
```{r get_and_split_binomial_data}
77+
D_binomial = D[D$Cover_Type %in% c("class_1","class_2"),]
78+
h2o.setLevels(D_binomial$Cover_Type,c("class_1","class_2"))
79+
# split to train/test/validation again
80+
data_binomial = h2o.splitFrame(D_binomial,ratios=c(.7,.15),destination_frames = c("train_b","test_b","valid_b"))
81+
names(data_binomial) <- c("Train","Test","Valid")
82+
```
83+
We can run a binomial model now:
84+
85+
```{r build_binomial_model_1}
86+
m_binomial = h2o.glm(training_frame = data_binomial$Train, validation_frame = data_binomial$Valid, x = x, y = y, family='binomial',lambda=0)
87+
h2o.confusionMatrix(m_binomial, valid = TRUE)
88+
h2o.confusionMatrix(m_binomial, valid = TRUE)
89+
```
90+
The output for a binomial problem is slightly different from multinomial. The confusion matrix now has a threshold attached to it.
91+
92+
The model produces probability of `class_1` and `class_2` similarly to multinomial example earlier. However, this time we only have two classes and we can tune the classification to our needs.
93+
94+
The classification errors in binomial cases have a particular meaning: we call them false-positive and false negative. In reality, each can have a different cost associated with it, so we want to tune our classifier accordingly.
95+
96+
The common way to evaluate a binary classifier performance is to look at its [ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). The ROC curve plots the true positive rate versus false positive rate. We can plot it from the H2O model output:
97+
98+
```{r build_binomial_model_output_1}
99+
fpr = m_binomial@model$training_metrics@metrics$thresholds_and_metric_scores$fpr
100+
tpr = m_binomial@model$training_metrics@metrics$thresholds_and_metric_scores$tpr
101+
fpr_val = m_binomial@model$validation_metrics@metrics$thresholds_and_metric_scores$fpr
102+
tpr_val = m_binomial@model$validation_metrics@metrics$thresholds_and_metric_scores$tpr
103+
plot(fpr,tpr, type='l')
104+
title('AUC')
105+
lines(fpr_val,tpr_val,type='l',col='red')
106+
legend("bottomright",c("Train", "Validation"),col=c("black","red"),lty=c(1,1),lwd=c(3,3))
107+
```
108+
109+
The area under the ROC curve (AUC) is a common "good fit" metric for binary classifiers. For this example, the results were:
110+
111+
```{r build_binomial_model_output_2}
112+
h2o.auc(m_binomial,valid=FALSE) # on train
113+
h2o.auc(m_binomial,valid=TRUE) # on test
114+
```
115+
116+
The default confusion matrix is computed at thresholds that optimize the [F1 score](https://en.wikipedia.org/wiki/F1_score). We can choose different thresholds - the H2O output shows optimal thresholds for some common metrics.
117+
118+
```{r build_binomial_model_output_3}
119+
m_binomial@model$training_metrics@metrics$max_criteria_and_metric_scores
120+
```
121+
122+
The model we just built gets 23% classification error at the F1-optimizing threshold, so there is still room for improvement.
123+
Let's add some features:
124+
125+
* There are 11 numerical predictors in the dataset, we will cut them into intervals and add a categorical variable for each
126+
* We can add interaction terms capturing interactions between categorical variables
127+
128+
Let's make a convenience function to cut the column into intervals working on all three of our datasets (Train/Validation/Test).
129+
We'll use `h2o.hist` to determine interval boundaries (but there are many more ways to do that!) on the Train set.
130+
We'll take only the bins with non-trivial support:
131+
132+
```{r cut_column}
133+
cut_column <- function(data, col) {
134+
# need lower/upper bound due to h2o.cut behavior (points < the first break or > the last break are replaced with missing value)
135+
min_val = min(data$Train[,col])-1
136+
max_val = max(data$Train[,col])+1
137+
x = h2o.hist(data$Train[, col])
138+
# use only the breaks with enough support
139+
breaks = x$breaks[which(x$counts > 1000)]
140+
# assign level names
141+
lvls = c("min",paste("i_",breaks[2:length(breaks)-1],sep=""),"max")
142+
col_cut <- paste(col,"_cut",sep="")
143+
data$Train[,col_cut] <- h2o.setLevels(h2o.cut(x = data$Train[,col],breaks=c(min_val,breaks,max_val)),lvls)
144+
# now do the same for test and validation, but using the breaks computed on the training!
145+
if(!is.null(data$Test)) {
146+
min_val = min(data$Test[,col])-1
147+
max_val = max(data$Test[,col])+1
148+
data$Test[,col_cut] <- h2o.setLevels(h2o.cut(x = data$Test[,col],breaks=c(min_val,breaks,max_val)),lvls)
149+
}
150+
if(!is.null(data$Valid)) {
151+
min_val = min(data$Valid[,col])-1
152+
max_val = max(data$Valid[,col])+1
153+
data$Valid[,col_cut] <- h2o.setLevels(h2o.cut(x = data$Valid[,col],breaks=c(min_val,breaks,max_val)),lvls)
154+
}
155+
data
156+
}
157+
```
158+
Now let's make a convenience function generating interaction terms on all three of our datasets. We'll use `h2o.interaction`:
159+
160+
```{r generate_interactions}
161+
interactions <- function(data, cols, pairwise = TRUE) {
162+
iii = h2o.interaction(data = data$Train, destination_frame = "itrain",factors = cols,pairwise=pairwise,max_factors=1000,min_occurrence=100)
163+
data$Train <- h2o.cbind(data$Train,iii)
164+
if(!is.null(data$Test)) {
165+
iii = h2o.interaction(data = data$Test, destination_frame = "itest",factors = cols,pairwise=pairwise,max_factors=1000,min_occurrence=100)
166+
data$Test <- h2o.cbind(data$Test,iii)
167+
}
168+
if(!is.null(data$Valid)) {
169+
iii = h2o.interaction(data = data$Valid, destination_frame = "ivalid",factors = cols,pairwise=pairwise,max_factors=1000,min_occurrence=100)
170+
data$Valid <- h2o.cbind(data$Valid,iii)
171+
}
172+
data
173+
}
174+
```
175+
Finally, let's wrap addition of the features into a separate function call, as we will use it again later.
176+
We'll add intervals for each numeric column and interactions between each pair of binary columns.
177+
178+
```{r add_features}
179+
# add features to our cover type example
180+
# let's cut all the numerical columns into intervals and add interactions between categorical terms
181+
add_features <- function(data) {
182+
names(data) <- c("Train","Test","Valid")
183+
data = cut_column(data,'Elevation')
184+
data = cut_column(data,'Hillshade_Noon')
185+
data = cut_column(data,'Hillshade_9am')
186+
data = cut_column(data,'Hillshade_3pm')
187+
data = cut_column(data,'Horizontal_Distance_To_Hydrology')
188+
data = cut_column(data,'Slope')
189+
data = cut_column(data,'Horizontal_Distance_To_Roadways')
190+
data = cut_column(data,'Aspect')
191+
# pairwise interactions between all categorical columns
192+
interaction_cols = c("Elevation_cut","Wilderness_Area","Soil_Type","Hillshade_Noon_cut","Hillshade_9am_cut","Hillshade_3pm_cut","Horizontal_Distance_To_Hydrology_cut","Slope_cut","Horizontal_Distance_To_Roadways_cut","Aspect_cut")
193+
data = interactions(data, interaction_cols)
194+
# interactions between Hillshade columns
195+
interaction_cols2 = c("Hillshade_Noon_cut","Hillshade_9am_cut","Hillshade_3pm_cut")
196+
data = interactions(data, interaction_cols2,pairwise = FALSE)
197+
data
198+
}
199+
```
200+
Now we generate new features and add them to the dataset. We'll also need to generate column names again, as we added more columns:
201+
202+
```{r add_features_binomial}
203+
# Add Features
204+
data_binomial_ext <- add_features(data_binomial)
205+
data_binomial_ext$Train <- h2o.assign(data_binomial_ext$Train,"train_b_ext")
206+
data_binomial_ext$Valid <- h2o.assign(data_binomial_ext$Valid,"valid_b_ext")
207+
data_binomial_ext$Test <- h2o.assign(data_binomial_ext$Test,"test_b_ext")
208+
y = "Cover_Type"
209+
x = names(data_binomial_ext$Train)
210+
x = x[-which(x==y)]
211+
```
212+
Let's build the model! We should add some regularization this time because we added correlated variables, so let's try the default:
213+
214+
```{r build_binomial_ext_1}
215+
m_binomial_1_ext = try(h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial'))
216+
```
217+
Oops, doesn't run - well, we know have more features than the default method can solve with 2GB of RAM. Let's try L-BFGS instead.
218+
219+
```{r build_binomial_ext_2}
220+
m_binomial_1_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', solver='L_BFGS')
221+
h2o.confusionMatrix(m_binomial_1_ext)
222+
h2o.auc(m_binomial_1_ext,valid=TRUE)
223+
```
224+
Not much better, maybe too much regularization? Let's pick a smaller lambda and try again.
225+
226+
```{r build_binomial_ext_3}
227+
m_binomial_2_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', solver='L_BFGS', lambda=1e-4)
228+
h2o.confusionMatrix(m_binomial_2_ext, valid=TRUE)
229+
h2o.auc(m_binomial_2_ext,valid=TRUE)
230+
```
231+
Way better, we got an AUC of .91 and classification error of 0.180838.
232+
We picked our regularization strength arbitrarily. Also, we used only the l2 penalty but we added lot of extra features, some of which may be useless.
233+
Maybe we can do better with an l1 penalty.
234+
So now we want to run a lambda search to find optimal penalty strength and we want to have a non-zero l1 penalty to get sparse solution.
235+
We'll use the IRLSM solver this time as it does much better with lambda search and l1 penalty.
236+
Recall we were not able to use it before. We can use it now as we are running a lambda search that will filter out a large portion of the inactive (coefficient==0) predictors.
237+
238+
```{r build_binomial_ext_4}
239+
m_binomial_3_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', lambda_search=TRUE)
240+
h2o.confusionMatrix(m_binomial_3_ext, valid=TRUE)
241+
h2o.auc(m_binomial_3_ext,valid=TRUE)
242+
```
243+
Better yet, we have 17% error and we used only 3000 out of 7000 features.
244+
Ok, our new features improved the binomial model significantly, so let's revisit our former multinomial model and see if they make a difference there (they should!):
245+
246+
```{r build_multinomial_ext_4}
247+
# Multinomial Model 2
248+
# let's revisit the multinomial case with our new features
249+
data_ext <- add_features(data)
250+
data_ext$Train <- h2o.assign(data_ext$Train,"train_m_ext")
251+
data_ext$Valid <- h2o.assign(data_ext$Valid,"valid_m_ext")
252+
data_ext$Test <- h2o.assign(data_ext$Test,"test_m_ext")
253+
y = "Cover_Type"
254+
x = names(data_ext$Train)
255+
x = x[-which(x==y)]
256+
m2 = h2o.glm(training_frame = data_ext$Train, validation_frame = data_ext$Valid, x = x, y = y,family='multinomial',solver='L_BFGS',lambda=1e-4)
257+
# 21% err down from 28%
258+
h2o.confusionMatrix(m2, valid=TRUE)
259+
```
260+
Improved considerably, 21% instead of 28%.
261+
262+

0 commit comments

Comments
 (0)