forked from h2oai/h2o-tutorials
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathglm_h2oworld_demo.Rmd
More file actions
248 lines (226 loc) · 13.5 KB
/
glm_h2oworld_demo.Rmd
File metadata and controls
248 lines (226 loc) · 13.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
---
title: "H2O World GLM Demo"
author: "Tomas Nykodym"
date: "November 7, 2015"
output: html_document
---
* Introduction
* Installation and Startup
* Cover Type Dataset
* Multinomial Model
* Binomial Model
* Adding extra features
* Multinomial Model Revisited
## Introduction
This tutorial shows how a H2O [GLM](http://en.wikipedia.org/wiki/Generalized_linear_model) model can be used to do binary and multi-class classification. This tutorial covers usage of H2O from R. A python version of this tutorial will be available as well in a separate document. This file is available in plain R, R markdown and regular markdown formats, and the plots are available as PDF files. All documents are available [on Github](https://github.com/h2oai/h2o-world-2015-training/raw/master/tutorials/glm/).
If run from plain R, execute R in the directory of this script. If run from RStudio, be sure to setwd() to the location of this script. h2o.init() starts H2O in R's current working directory. h2o.importFile() looks for files from the perspective of where H2O was started.
More examples and explanations can be found in our [H2O GLM booklet](http://h2o.ai/resources/) and on our [H2O Github Repository](http://github.com/h2oai/h2o-3/).
### H2O R Package
Load the H2O R package:
```{r load_library}
## R installation instructions are at http://h2o.ai/download
library(h2o)
```
### Start H2O
Start up a 1-node H2O server on your local machine, and allow it to use all CPU cores and up to 2GB of memory:
```{r start_h2o}
h2o.init(nthreads=-1, max_mem_size="2G")
h2o.removeAll() ## clean slate - just in case the cluster was already running
```
## Cover Type Data
Predicting forest cover type from cartographic variables only (no remotely sensed data).
Let's import the dataset:
```{r import_data}
D = h2o.importFile(path = normalizePath("../data/covtype.full.csv"))
h2o.summary(D)
```
We got 11 numeric and two categorical features. Response is "Cover_Type" and has 7 classes.
Let's split the data into Train/Test/Validation with train having 70% and Test and Validation 15% each:
```{r split_data}
data = h2o.splitFrame(D,ratios=c(.7,.15),destination_frames = c("train","test","valid"))
names(data) <- c("Train","Test","Valid")
y = "Cover_Type"
x = names(data$Train)
x = x[-which(x==y)]
```
## Multinomial Model
We got data in, let's run glm. As we mentioned previously, Cover_Type is the reponse and we use all other columns as predictors.
We have multi-class problem so we pick family=multinomial. L-BFGS solver tends to be faster on multinomial problems, so we pick L-BFGS for our first try.
The rest can remain default.
```{r build_model1}
m1 = h2o.glm(training_frame = data$Train, validation_frame = data$Valid, x = x, y = y,family='multinomial',solver='L_BFGS')
h2o.confusionMatrix(m1, valid=TRUE)
```
The model predicts only the majority class so it's not use at all! Maybe we regularized it too much, let's try again without regularization:
```{r build_model2}
m2 = h2o.glm(training_frame = data$Train, validation_frame = data$Valid, x = x, y = y,family='multinomial',solver='L_BFGS', lambda = 0)
h2o.confusionMatrix(m2, valid=FALSE) # get confusion matrix in the training data
h2o.confusionMatrix(m2, valid=TRUE) # get confusion matrix in the validation data
```
No overfitting(train and test performance are the same), regularization is not needed in this case.
This model is actually useful. It got 28% classification error, down from 51% obtained by predicting majority class only.
##Binomial Model
Multinomial models are difficult and time consuming, let's try simpler binary classification.
We'll take a subset of the data with only class_1 and class_2 (the two majority classes) and build a binomial model deciding between them.
```{r get_and_split_binomial_data}
D_binomial = D[D$Cover_Type %in% c("class_1","class_2"),]
h2o.setLevels(D_binomial$Cover_Type,c("class_1","class_2"))
# split to train/test/validation again
data_binomial = h2o.splitFrame(D_binomial,ratios=c(.7,.15),destination_frames = c("train_b","test_b","valid_b"))
names(data_binomial) <- c("Train","Test","Valid")
```
We can run binomial model now:
```{r build_binomial_model_1}
m_binomial = h2o.glm(training_frame = data_binomial$Train, validation_frame = data_binomial$Valid, x = x, y = y, family='binomial',lambda=0)
h2o.confusionMatrix(m_binomial, valid = TRUE)
h2o.confusionMatrix(m_binomial, valid = TRUE)
```
The output for binomial problem is slightly different from the multinomial. The confusion matrix now has a threshold attached to it.
The model produces probability of class_1 and class_2 similarly to multinomial example earlier, however this time we only have two classes and we can tune the classification to our needs. The classification errors in binomial case have special meaning, we call them false-positive and false negative. In reality, each can have different cost associated with it and we want to tune our classifier accordingly. The common way to evaluate a binary classifier performance is to look at it's [ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). ROC curve plots true positive rate versus false positive rate. We can plot it from h2o model output:
```{r build_binomial_model_output_1}
fpr = m_binomial@model$training_metrics@metrics$thresholds_and_metric_scores$fpr
tpr = m_binomial@model$training_metrics@metrics$thresholds_and_metric_scores$tpr
fpr_val = m_binomial@model$validation_metrics@metrics$thresholds_and_metric_scores$fpr
tpr_val = m_binomial@model$validation_metrics@metrics$thresholds_and_metric_scores$tpr
plot(fpr,tpr, type='l')
title('AUC')
lines(fpr_val,tpr_val,type='l',col='red')
legend("bottomright",c("Train", "Validation"),col=c("black","red"),lty=c(1,1),lwd=c(3,3))
```
The area under the ROC curve (AUC) is a common "goodness" metric for binary classifiers. We got:
```{r build_binomial_model_output_2}
h2o.auc(m_binomial,valid=FALSE) # on train
h2o.auc(m_binomial,valid=TRUE) # on test
```
The default confusion matrix is computed at thresholds which optimizes[F1 score](https://en.wikipedia.org/wiki/F1_score). We can choose different thresholds, h2o output shows optimal thresholds for some common metrics.
```{r build_binomial_model_output_3}
m_binomial@model$training_metrics@metrics$max_criteria_and_metric_scores
```
The model we just built gets 23% classification error at F1-optimizing threshold, there is still room for improvement.
Let's add some features:
* There are 11 numerical predictors in the dataset, we will cut them into intervals and add a categorical variable for each
* We can add interaction terms capturing interactions between catgorical variables
Let's make a convenience function to cut the column into intervals working on all three of our datasets (Train/Validation/Test).
We'll use h2o.hist to determine interval boundaries (many more ways to do that!) on the Train set.
We'll take only the bins with non-trivial support:
```{r cut_column}
cut_column <- function(data, col) {
# need lower/upper bound due to h2o.cut behavior (points < the first break or > the last break are replaced with missing value)
min_val = min(data$Train[,col])-1
max_val = max(data$Train[,col])+1
x = h2o.hist(data$Train[, col])
# use only the breaks with enough support
breaks = x$breaks[which(x$counts > 1000)]
# assign level names
lvls = c("min",paste("i_",breaks[2:length(breaks)-1],sep=""),"max")
col_cut <- paste(col,"_cut",sep="")
data$Train[,col_cut] <- h2o.setLevels(h2o.cut(x = data$Train[,col],breaks=c(min_val,breaks,max_val)),lvls)
# now do the same for test and validation, but using the breaks computed on the training!
if(!is.null(data$Test)) {
min_val = min(data$Test[,col])-1
max_val = max(data$Test[,col])+1
data$Test[,col_cut] <- h2o.setLevels(h2o.cut(x = data$Test[,col],breaks=c(min_val,breaks,max_val)),lvls)
}
if(!is.null(data$Valid)) {
min_val = min(data$Valid[,col])-1
max_val = max(data$Valid[,col])+1
data$Valid[,col_cut] <- h2o.setLevels(h2o.cut(x = data$Valid[,col],breaks=c(min_val,breaks,max_val)),lvls)
}
data
}
```
Now lets make a convenience function generating interaction terms on all three of our datasets. We use h2o.interaction:
```{r generate_interactions}
interactions <- function(data, cols, pairwise = TRUE) {
iii = h2o.interaction(data = data$Train, destination_frame = "itrain",factors = cols,pairwise=pairwise,max_factors=1000,min_occurrence=100)
data$Train <- h2o.cbind(data$Train,iii)
if(!is.null(data$Test)) {
iii = h2o.interaction(data = data$Test, destination_frame = "itest",factors = cols,pairwise=pairwise,max_factors=1000,min_occurrence=100)
data$Test <- h2o.cbind(data$Test,iii)
}
if(!is.null(data$Valid)) {
iii = h2o.interaction(data = data$Valid, destination_frame = "ivalid",factors = cols,pairwise=pairwise,max_factors=1000,min_occurrence=100)
data$Valid <- h2o.cbind(data$Valid,iii)
}
data
}
```
Lastly, let's wrap adding of the features into seperate function call as we will use it again later.
We'll add intervals for each numeric column and we'll add interactions between each pair of binary columns.
```{r add_features}
# add features to our cover type example
# let's cut all the numerical columns into intervals and add interactions between categorical terms
add_features <- function(data) {
names(data) <- c("Train","Test","Valid")
data = cut_column(data,'Elevation')
data = cut_column(data,'Hillshade_Noon')
data = cut_column(data,'Hillshade_9am')
data = cut_column(data,'Hillshade_3pm')
data = cut_column(data,'Horizontal_Distance_To_Hydrology')
data = cut_column(data,'Slope')
data = cut_column(data,'Horizontal_Distance_To_Roadways')
data = cut_column(data,'Aspect')
# pairwise interactions between all categorical columns
interaction_cols = c("Elevation_cut","Wilderness_Area","Soil_Type","Hillshade_Noon_cut","Hillshade_9am_cut","Hillshade_3pm_cut","Horizontal_Distance_To_Hydrology_cut","Slope_cut","Horizontal_Distance_To_Roadways_cut","Aspect_cut")
data = interactions(data, interaction_cols)
# interactions between Hillshade columns
interaction_cols2 = c("Hillshade_Noon_cut","Hillshade_9am_cut","Hillshade_3pm_cut")
data = interactions(data, interaction_cols2,pairwise = FALSE)
data
}
```
Now we generate new features and add them to the dataset. We'll also need to generate column names gain as we added more columns:
```{r add_features_binomial}
# Add Features
data_binomial_ext <- add_features(data_binomial)
data_binomial_ext$Train <- h2o.assign(data_binomial_ext$Train,"train_b_ext")
data_binomial_ext$Valid <- h2o.assign(data_binomial_ext$Valid,"valid_b_ext")
data_binomial_ext$Test <- h2o.assign(data_binomial_ext$Test,"test_b_ext")
y = "Cover_Type"
x = names(data_binomial_ext$Train)
x = x[-which(x==y)]
```
Lets build the model, we should add some regularization this time because we added correlated variables, let's try default:
```{r build_binomial_ext_1}
m_binomial_1_ext = try(h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial'))
```
Oops, does not run, we know have more features than the default method can solve with 2GB of RAM. Let's try L-BFGS instead.
```{r build_binomial_ext_2}
m_binomial_1_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', solver='L_BFGS')
h2o.confusionMatrix(m_binomial_1_ext)
h2o.auc(m_binomial_1_ext,valid=TRUE)
```
Not better, maybe too much regularization? Lets pick smaller lambda and try again.
```{r build_binomial_ext_3}
m_binomial_2_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', solver='L_BFGS', lambda=1e-4)
h2o.confusionMatrix(m_binomial_2_ext, valid=TRUE)
h2o.auc(m_binomial_2_ext,valid=TRUE)
```
Way better, we got auc of .91 and classification error of 0.180838.
We picked our regularization strength arbitrarily. Also, we used only l2 penalty but we added lot of extra features, some of which may be useless.
Maybe we can do better with l1 penalty.
So we want to run lambda search to find optimal penalty strength and we want to have non-zero l1 penalty to get sparse solution.
We'll use IRLSM solver this time as it does much better with lambda search and l1 penalty.
Recall we were not able to use before. We can use it now as we are running with lambda search which will filter large portion of the inactive (coefficient==0) predictors.
```{r build_binomial_ext_4}
m_binomial_3_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', lambda_search=TRUE)
h2o.confusionMatrix(m_binomial_3_ext, valid=TRUE)
h2o.auc(m_binomial_3_ext,valid=TRUE)
```
Better yet, we have 17% error and we used only 3000 out of 7000 features.
Ok, our new features improved binomial model significantly,let's revisit our former multinomial model and see if they make a difference there (they should!):
```{r build_multinomial_ext_4}
# Multinomial Model 2
# let's revisit the multinomial case with our new features
data_ext <- add_features(data)
data_ext$Train <- h2o.assign(data_ext$Train,"train_m_ext")
data_ext$Valid <- h2o.assign(data_ext$Valid,"valid_m_ext")
data_ext$Test <- h2o.assign(data_ext$Test,"test_m_ext")
y = "Cover_Type"
x = names(data_ext$Train)
x = x[-which(x==y)]
m2 = h2o.glm(training_frame = data_ext$Train, validation_frame = data_ext$Valid, x = x, y = y,family='multinomial',solver='L_BFGS',lambda=1e-4)
# 21% err down from 28%
h2o.confusionMatrix(m2, valid=TRUE)
```
Improved considerably, 21% insetad of 28%.