Skip to content

Commit e97df50

Browse files
author
Tomas Nykodym
committed
Added Rmd file for glm demo.
1 parent cd02331 commit e97df50

3 files changed

Lines changed: 249 additions & 2 deletions

File tree

tutorials/deeplearning/deeplearning.Rmd

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22

33
* Introduction
44
* Installation and Startup
5-
* Decision Boundaries
65
* Cover Type Dataset
76
* Exploratory Data Analysis
87
* Deep Learning Model

tutorials/glm/glm_h2oworld_demo.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ x = names(data_ext$Train)
135135
x = x[-which(x==y)]
136136
m2 = h2o.glm(training_frame = data_ext$Train, validation_frame = data_ext$Valid, x = x, y = y,family='multinomial',solver='L_BFGS',lambda=1e-4)
137137
# 21% err down from 28%
138-
m2
138+
summary(m2)
139139

140140
### All done, shutdown H2O
141141
h2o.shutdown(prompt=FALSE)
Lines changed: 248 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
---
2+
title: "H2O World GLM Demo"
3+
author: "Tomas Nykodym"
4+
date: "November 7, 2015"
5+
output: html_document
6+
---
7+
8+
* Introduction
9+
* Installation and Startup
10+
* Cover Type Dataset
11+
* Multinomial Model
12+
* Binomial Model
13+
* Adding extra features
14+
* Multinomial Model Revisited
15+
16+
17+
## Introduction
18+
This tutorial shows how a H2O [GLM](http://en.wikipedia.org/wiki/Generalized_linear_model) model can be used to do binary and multi-class classification. This tutorial covers usage of H2O from R. A python version of this tutorial will be available as well in a separate document. This file is available in plain R, R markdown and regular markdown formats, and the plots are available as PDF files. All documents are available [on Github](https://github.com/h2oai/h2o-world-2015-training/raw/master/tutorials/glm/).
19+
20+
If run from plain R, execute R in the directory of this script. If run from RStudio, be sure to setwd() to the location of this script. h2o.init() starts H2O in R's current working directory. h2o.importFile() looks for files from the perspective of where H2O was started.
21+
22+
More examples and explanations can be found in our [H2O GLM booklet](http://h2o.ai/resources/) and on our [H2O Github Repository](http://github.com/h2oai/h2o-3/).
23+
24+
### H2O R Package
25+
26+
Load the H2O R package:
27+
28+
```{r load_library}
29+
## R installation instructions are at http://h2o.ai/download
30+
library(h2o)
31+
```
32+
33+
### Start H2O
34+
Start up a 1-node H2O server on your local machine, and allow it to use all CPU cores and up to 2GB of memory:
35+
36+
```{r start_h2o}
37+
h2o.init(nthreads=-1, max_mem_size="2G")
38+
h2o.removeAll() ## clean slate - just in case the cluster was already running
39+
```
40+
## Cover Type Data
41+
Predicting forest cover type from cartographic variables only (no remotely sensed data).
42+
Let's import the dataset:
43+
```{r import_data}
44+
D = h2o.importFile(path = normalizePath("../data/covtype.full.csv"))
45+
h2o.summary(D)
46+
```
47+
We got 11 numeric and two categorical features. Response is "Cover_Type" and has 7 classes.
48+
Let's split the data into Train/Test/Validation with train having 70% and Test and Validation 15% each:
49+
```{r split_data}
50+
data = h2o.splitFrame(D,ratios=c(.7,.15),destination_frames = c("train","test","valid"))
51+
names(data) <- c("Train","Test","Valid")
52+
y = "Cover_Type"
53+
x = names(data$Train)
54+
x = x[-which(x==y)]
55+
```
56+
## Multinomial Model
57+
We got data in, let's run glm. As we mentioned previously, Cover_Type is the reponse and we use all other columns as predictors.
58+
We have multi-class problem so we pick family=multinomial. L-BFGS solver tends to be faster on multinomial problems, so we pick L-BFGS for our first try.
59+
The rest can remain default.
60+
```{r build_model1}
61+
m1 = h2o.glm(training_frame = data$Train, validation_frame = data$Valid, x = x, y = y,family='multinomial',solver='L_BFGS')
62+
h2o.confusionMatrix(m1, valid=TRUE)
63+
```
64+
The model predicts only the majority class so it's not use at all! Maybe we regularized it too much, let's try again without regularization:
65+
```{r build_model2}
66+
m2 = h2o.glm(training_frame = data$Train, validation_frame = data$Valid, x = x, y = y,family='multinomial',solver='L_BFGS', lambda = 0)
67+
h2o.confusionMatrix(m2, valid=FALSE) # get confusion matrix in the training data
68+
h2o.confusionMatrix(m2, valid=TRUE) # get confusion matrix in the validation data
69+
```
70+
No overfitting(train and test performance are the same), regularization is not needed in this case.
71+
72+
This model is actually useful. It got 28% classification error, down from 51% obtained by predicting majority class only.
73+
74+
##Binomial Model
75+
Multinomial models are difficult and time consuming, let's try simpler binary classification.
76+
We'll take a subset of the data with only class_1 and class_2 (the two majority classes) and build a binomial model deciding between them.
77+
```{r get_and_split_binomial_data}
78+
D_binomial = D[D$Cover_Type %in% c("class_1","class_2"),]
79+
h2o.setLevels(D_binomial$Cover_Type,c("class_1","class_2"))
80+
# split to train/test/validation again
81+
data_binomial = h2o.splitFrame(D_binomial,ratios=c(.7,.15),destination_frames = c("train_b","test_b","valid_b"))
82+
names(data_binomial) <- c("Train","Test","Valid")
83+
```
84+
We can run binomial model now:
85+
```{r build_binomial_model_1}
86+
m_binomial = h2o.glm(training_frame = data_binomial$Train, validation_frame = data_binomial$Valid, x = x, y = y, family='binomial',lambda=0)
87+
h2o.confusionMatrix(m_binomial, valid = TRUE)
88+
h2o.confusionMatrix(m_binomial, valid = TRUE)
89+
```
90+
The output for binomial problem is slightly different from the multinomial. The confusion matrix now has a threshold attached to it.
91+
92+
The model produces probability of class_1 and class_2 similarly to multinomial example earlier, however this time we only have two classes and we can tune the classification to our needs. The classification errors in binomial case have special meaning, we call them false-positive and false negative. In reality, each can have different cost associated with it and we want to tune our classifier accordingly. The common way to evaluate a binary classifier performance is to look at it's [ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). ROC curve plots true positive rate versus false positive rate. We can plot it from h2o model output:
93+
```{r build_binomial_model_output_1}
94+
fpr = m_binomial@model$training_metrics@metrics$thresholds_and_metric_scores$fpr
95+
tpr = m_binomial@model$training_metrics@metrics$thresholds_and_metric_scores$tpr
96+
fpr_val = m_binomial@model$validation_metrics@metrics$thresholds_and_metric_scores$fpr
97+
tpr_val = m_binomial@model$validation_metrics@metrics$thresholds_and_metric_scores$tpr
98+
plot(fpr,tpr, type='l')
99+
title('AUC')
100+
lines(fpr_val,tpr_val,type='l',col='red')
101+
legend("bottomright",c("Train", "Validation"),col=c("black","red"),lty=c(1,1),lwd=c(3,3))
102+
```
103+
104+
The area under the ROC curve (AUC) is a common "goodness" metric for binary classifiers. We got:
105+
106+
```{r build_binomial_model_output_2}
107+
h2o.auc(m_binomial,valid=FALSE) # on train
108+
h2o.auc(m_binomial,valid=TRUE) # on test
109+
```
110+
111+
The default confusion matrix is computed at thresholds which optimizes[F1 score](https://en.wikipedia.org/wiki/F1_score). We can choose different thresholds, h2o output shows optimal thresholds for some common metrics.
112+
```{r build_binomial_model_output_3}
113+
m_binomial@model$training_metrics@metrics$max_criteria_and_metric_scores
114+
```
115+
116+
The model we just built gets 23% classification error at F1-optimizing threshold, there is still room for improvement.
117+
Let's add some features:
118+
119+
* There are 11 numerical predictors in the dataset, we will cut them into intervals and add a categorical variable for each
120+
* We can add interaction terms capturing interactions between catgorical variables
121+
122+
Let's make a convenience function to cut the column into intervals working on all three of our datasets (Train/Validation/Test).
123+
We'll use h2o.hist to determine interval boundaries (many more ways to do that!) on the Train set.
124+
We'll take only the bins with non-trivial support:
125+
```{r cut_column}
126+
cut_column <- function(data, col) {
127+
# need lower/upper bound due to h2o.cut behavior (points < the first break or > the last break are replaced with missing value)
128+
min_val = min(data$Train[,col])-1
129+
max_val = max(data$Train[,col])+1
130+
x = h2o.hist(data$Train[, col])
131+
# use only the breaks with enough support
132+
breaks = x$breaks[which(x$counts > 1000)]
133+
# assign level names
134+
lvls = c("min",paste("i_",breaks[2:length(breaks)-1],sep=""),"max")
135+
col_cut <- paste(col,"_cut",sep="")
136+
data$Train[,col_cut] <- h2o.setLevels(h2o.cut(x = data$Train[,col],breaks=c(min_val,breaks,max_val)),lvls)
137+
# now do the same for test and validation, but using the breaks computed on the training!
138+
if(!is.null(data$Test)) {
139+
min_val = min(data$Test[,col])-1
140+
max_val = max(data$Test[,col])+1
141+
data$Test[,col_cut] <- h2o.setLevels(h2o.cut(x = data$Test[,col],breaks=c(min_val,breaks,max_val)),lvls)
142+
}
143+
if(!is.null(data$Valid)) {
144+
min_val = min(data$Valid[,col])-1
145+
max_val = max(data$Valid[,col])+1
146+
data$Valid[,col_cut] <- h2o.setLevels(h2o.cut(x = data$Valid[,col],breaks=c(min_val,breaks,max_val)),lvls)
147+
}
148+
data
149+
}
150+
```
151+
Now lets make a convenience function generating interaction terms on all three of our datasets. We use h2o.interaction:
152+
```{r generate_interactions}
153+
interactions <- function(data, cols, pairwise = TRUE) {
154+
iii = h2o.interaction(data = data$Train, destination_frame = "itrain",factors = cols,pairwise=pairwise,max_factors=1000,min_occurrence=100)
155+
data$Train <- h2o.cbind(data$Train,iii)
156+
if(!is.null(data$Test)) {
157+
iii = h2o.interaction(data = data$Test, destination_frame = "itest",factors = cols,pairwise=pairwise,max_factors=1000,min_occurrence=100)
158+
data$Test <- h2o.cbind(data$Test,iii)
159+
}
160+
if(!is.null(data$Valid)) {
161+
iii = h2o.interaction(data = data$Valid, destination_frame = "ivalid",factors = cols,pairwise=pairwise,max_factors=1000,min_occurrence=100)
162+
data$Valid <- h2o.cbind(data$Valid,iii)
163+
}
164+
data
165+
}
166+
```
167+
Lastly, let's wrap adding of the features into seperate function call as we will use it again later.
168+
We'll add intervals for each numeric column and we'll add interactions between each pair of binary columns.
169+
```{r add_features}
170+
# add features to our cover type example
171+
# let's cut all the numerical columns into intervals and add interactions between categorical terms
172+
add_features <- function(data) {
173+
names(data) <- c("Train","Test","Valid")
174+
data = cut_column(data,'Elevation')
175+
data = cut_column(data,'Hillshade_Noon')
176+
data = cut_column(data,'Hillshade_9am')
177+
data = cut_column(data,'Hillshade_3pm')
178+
data = cut_column(data,'Horizontal_Distance_To_Hydrology')
179+
data = cut_column(data,'Slope')
180+
data = cut_column(data,'Horizontal_Distance_To_Roadways')
181+
data = cut_column(data,'Aspect')
182+
# pairwise interactions between all categorical columns
183+
interaction_cols = c("Elevation_cut","Wilderness_Area","Soil_Type","Hillshade_Noon_cut","Hillshade_9am_cut","Hillshade_3pm_cut","Horizontal_Distance_To_Hydrology_cut","Slope_cut","Horizontal_Distance_To_Roadways_cut","Aspect_cut")
184+
data = interactions(data, interaction_cols)
185+
# interactions between Hillshade columns
186+
interaction_cols2 = c("Hillshade_Noon_cut","Hillshade_9am_cut","Hillshade_3pm_cut")
187+
data = interactions(data, interaction_cols2,pairwise = FALSE)
188+
data
189+
}
190+
```
191+
Now we generate new features and add them to the dataset. We'll also need to generate column names gain as we added more columns:
192+
```{r add_features_binomial}
193+
# Add Features
194+
data_binomial_ext <- add_features(data_binomial)
195+
data_binomial_ext$Train <- h2o.assign(data_binomial_ext$Train,"train_b_ext")
196+
data_binomial_ext$Valid <- h2o.assign(data_binomial_ext$Valid,"valid_b_ext")
197+
data_binomial_ext$Test <- h2o.assign(data_binomial_ext$Test,"test_b_ext")
198+
y = "Cover_Type"
199+
x = names(data_binomial_ext$Train)
200+
x = x[-which(x==y)]
201+
```
202+
Lets build the model, we should add some regularization this time because we added correlated variables, let's try default:
203+
```{r build_binomial_ext_1}
204+
m_binomial_1_ext = try(h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial'))
205+
```
206+
Oops, does not run, we know have more features than the default method can solve with 2GB of RAM. Let's try L-BFGS instead.
207+
```{r build_binomial_ext_2}
208+
m_binomial_1_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', solver='L_BFGS')
209+
h2o.confusionMatrix(m_binomial_1_ext)
210+
h2o.auc(m_binomial_1_ext,valid=TRUE)
211+
```
212+
Not better, maybe too much regularization? Lets pick smaller lambda and try again.
213+
```{r build_binomial_ext_3}
214+
m_binomial_2_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', solver='L_BFGS', lambda=1e-4)
215+
h2o.confusionMatrix(m_binomial_2_ext, valid=TRUE)
216+
h2o.auc(m_binomial_2_ext,valid=TRUE)
217+
```
218+
Way better, we got auc of .91 and classification error of 0.180838.
219+
We picked our regularization strength arbitrarily. Also, we used only l2 penalty but we added lot of extra features, some of which may be useless.
220+
Maybe we can do better with l1 penalty.
221+
So we want to run lambda search to find optimal penalty strength and we want to have non-zero l1 penalty to get sparse solution.
222+
We'll use IRLSM solver this time as it does much better with lambda search and l1 penalty.
223+
Recall we were not able to use before. We can use it now as we are running with lambda search which will filter large portion of the inactive (coefficient==0) predictors.
224+
```{r build_binomial_ext_4}
225+
m_binomial_3_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', lambda_search=TRUE)
226+
h2o.confusionMatrix(m_binomial_3_ext, valid=TRUE)
227+
h2o.auc(m_binomial_3_ext,valid=TRUE)
228+
```
229+
Better yet, we have 17% error and we used only 3000 out of 7000 features.
230+
Ok, our new features improved binomial model significantly,let's revisit our former multinomial model and see if they make a difference there (they should!):
231+
```{r build_multinomial_ext_4}
232+
# Multinomial Model 2
233+
# let's revisit the multinomial case with our new features
234+
data_ext <- add_features(data)
235+
data_ext$Train <- h2o.assign(data_ext$Train,"train_m_ext")
236+
data_ext$Valid <- h2o.assign(data_ext$Valid,"valid_m_ext")
237+
data_ext$Test <- h2o.assign(data_ext$Test,"test_m_ext")
238+
y = "Cover_Type"
239+
x = names(data_ext$Train)
240+
x = x[-which(x==y)]
241+
m2 = h2o.glm(training_frame = data_ext$Train, validation_frame = data_ext$Valid, x = x, y = y,family='multinomial',solver='L_BFGS',lambda=1e-4)
242+
# 21% err down from 28%
243+
h2o.confusionMatrix(m2, valid=TRUE)
244+
```
245+
Improved considerably, 21% insetad of 28%.
246+
247+
248+

0 commit comments

Comments
 (0)