|
| 1 | +* Introduction |
| 2 | + * Installation and Startup |
| 3 | +* Cover Type Dataset |
| 4 | +* Multinomial Model |
| 5 | +* Binomial Model |
| 6 | + * Adding extra features |
| 7 | +* Multinomial Model Revisited |
| 8 | + |
| 9 | + |
| 10 | +## Introduction |
| 11 | +This tutorial shows how a H2O [GLM](http://en.wikipedia.org/wiki/Generalized_linear_model) model can be used to do binary and multi-class classification. This tutorial covers usage of H2O from R. A python version of this tutorial will be available as well in a separate document. This file is available in plain R, R markdown and regular markdown formats, and the plots are available as PDF files. All documents are available [on Github](https://github.com/h2oai/h2o-world-2015-training/raw/master/tutorials/glm/). |
| 12 | + |
| 13 | +If run from plain R, execute R in the directory of this script. If run from RStudio, be sure to `setwd()` to the location of this script. `h2o.init()` starts H2O in R's current working directory. h2o.importFile() looks for files from the perspective of where H2O was started. |
| 14 | + |
| 15 | +More examples and explanations can be found in our [H2O GLM booklet](http://h2o.ai/resources/) and on our [H2O Github Repository](http://github.com/h2oai/h2o-3/). |
| 16 | + |
| 17 | +### H2O R Package |
| 18 | + |
| 19 | +Load the H2O R package: |
| 20 | + |
| 21 | +```{r load_library} |
| 22 | +## R installation instructions are at http://h2o.ai/download |
| 23 | +library(h2o) |
| 24 | +``` |
| 25 | + |
| 26 | +### Start H2O |
| 27 | +Start up a 1-node H2O server on your local machine, and allow it to use all CPU cores and up to 2GB of memory: |
| 28 | + |
| 29 | +```{r start_h2o} |
| 30 | +h2o.init(nthreads=-1, max_mem_size="2G") |
| 31 | +h2o.removeAll() ## clean slate - just in case the cluster was already running |
| 32 | +``` |
| 33 | +## Cover Type Data |
| 34 | +Predicting forest cover type from cartographic variables only (no remotely sensed data). |
| 35 | +Let's import the dataset: |
| 36 | + |
| 37 | +```{r import_data} |
| 38 | +D = h2o.importFile(path = normalizePath("../data/covtype.full.csv")) |
| 39 | +h2o.summary(D) |
| 40 | +``` |
| 41 | +We have 11 numeric and two categorical features. Response is "Cover_Type" and has 7 classes. |
| 42 | +Let's split the data into Train/Test/Validation with train having 70% and Test and Validation 15% each: |
| 43 | + |
| 44 | +```{r split_data} |
| 45 | +data = h2o.splitFrame(D,ratios=c(.7,.15),destination_frames = c("train","test","valid")) |
| 46 | +names(data) <- c("Train","Test","Valid") |
| 47 | +y = "Cover_Type" |
| 48 | +x = names(data$Train) |
| 49 | +x = x[-which(x==y)] |
| 50 | +``` |
| 51 | +## Multinomial Model |
| 52 | + |
| 53 | +We imported our data, so let's run GLM. As we mentioned previously, Cover_Type is the response and we use all other columns as predictors. |
| 54 | +We have multi-class problem so we pick family=multinomial. L-BFGS solver tends to be faster on multinomial problems, so we pick L-BFGS for our first try. |
| 55 | +The rest can use the default settings. |
| 56 | + |
| 57 | +```{r build_model1} |
| 58 | +m1 = h2o.glm(training_frame = data$Train, validation_frame = data$Valid, x = x, y = y,family='multinomial',solver='L_BFGS') |
| 59 | +h2o.confusionMatrix(m1, valid=TRUE) |
| 60 | +``` |
| 61 | +The model predicts only the majority class so it's not useful at all! Maybe we regularized it too much, let's try again without regularization: |
| 62 | + |
| 63 | +```{r build_model2} |
| 64 | +m2 = h2o.glm(training_frame = data$Train, validation_frame = data$Valid, x = x, y = y,family='multinomial',solver='L_BFGS', lambda = 0) |
| 65 | +h2o.confusionMatrix(m2, valid=FALSE) # get confusion matrix in the training data |
| 66 | +h2o.confusionMatrix(m2, valid=TRUE) # get confusion matrix in the validation data |
| 67 | +``` |
| 68 | +No overfitting (as train and test performance are the same), regularization is not needed in this case. |
| 69 | + |
| 70 | +This model is actually useful. It got 28% classification error, down from 51% obtained by predicting majority class only. |
| 71 | + |
| 72 | +##Binomial Model |
| 73 | +Since multinomial models are difficult and time consuming, let's try a simpler binary classification. |
| 74 | +We'll take a subset of the data with only `class_1` and `class_2` (the two majority classes) and build a binomial model deciding between them. |
| 75 | + |
| 76 | +```{r get_and_split_binomial_data} |
| 77 | +D_binomial = D[D$Cover_Type %in% c("class_1","class_2"),] |
| 78 | +h2o.setLevels(D_binomial$Cover_Type,c("class_1","class_2")) |
| 79 | +# split to train/test/validation again |
| 80 | +data_binomial = h2o.splitFrame(D_binomial,ratios=c(.7,.15),destination_frames = c("train_b","test_b","valid_b")) |
| 81 | +names(data_binomial) <- c("Train","Test","Valid") |
| 82 | +``` |
| 83 | +We can run a binomial model now: |
| 84 | + |
| 85 | +```{r build_binomial_model_1} |
| 86 | +m_binomial = h2o.glm(training_frame = data_binomial$Train, validation_frame = data_binomial$Valid, x = x, y = y, family='binomial',lambda=0) |
| 87 | +h2o.confusionMatrix(m_binomial, valid = TRUE) |
| 88 | +h2o.confusionMatrix(m_binomial, valid = TRUE) |
| 89 | +``` |
| 90 | +The output for a binomial problem is slightly different from multinomial. The confusion matrix now has a threshold attached to it. |
| 91 | + |
| 92 | +The model produces probability of `class_1` and `class_2` similarly to multinomial example earlier. However, this time we only have two classes and we can tune the classification to our needs. |
| 93 | + |
| 94 | +The classification errors in binomial cases have a particular meaning: we call them false-positive and false negative. In reality, each can have a different cost associated with it, so we want to tune our classifier accordingly. |
| 95 | + |
| 96 | +The common way to evaluate a binary classifier performance is to look at its [ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). The ROC curve plots the true positive rate versus false positive rate. We can plot it from the H2O model output: |
| 97 | + |
| 98 | +```{r build_binomial_model_output_1} |
| 99 | +fpr = m_binomial@model$training_metrics@metrics$thresholds_and_metric_scores$fpr |
| 100 | +tpr = m_binomial@model$training_metrics@metrics$thresholds_and_metric_scores$tpr |
| 101 | +fpr_val = m_binomial@model$validation_metrics@metrics$thresholds_and_metric_scores$fpr |
| 102 | +tpr_val = m_binomial@model$validation_metrics@metrics$thresholds_and_metric_scores$tpr |
| 103 | +plot(fpr,tpr, type='l') |
| 104 | +title('AUC') |
| 105 | +lines(fpr_val,tpr_val,type='l',col='red') |
| 106 | +legend("bottomright",c("Train", "Validation"),col=c("black","red"),lty=c(1,1),lwd=c(3,3)) |
| 107 | +``` |
| 108 | + |
| 109 | +The area under the ROC curve (AUC) is a common "good fit" metric for binary classifiers. For this example, the results were: |
| 110 | + |
| 111 | +```{r build_binomial_model_output_2} |
| 112 | +h2o.auc(m_binomial,valid=FALSE) # on train |
| 113 | +h2o.auc(m_binomial,valid=TRUE) # on test |
| 114 | +``` |
| 115 | + |
| 116 | +The default confusion matrix is computed at thresholds that optimize the [F1 score](https://en.wikipedia.org/wiki/F1_score). We can choose different thresholds - the H2O output shows optimal thresholds for some common metrics. |
| 117 | + |
| 118 | +```{r build_binomial_model_output_3} |
| 119 | +m_binomial@model$training_metrics@metrics$max_criteria_and_metric_scores |
| 120 | +``` |
| 121 | + |
| 122 | +The model we just built gets 23% classification error at the F1-optimizing threshold, so there is still room for improvement. |
| 123 | +Let's add some features: |
| 124 | + |
| 125 | +* There are 11 numerical predictors in the dataset, we will cut them into intervals and add a categorical variable for each |
| 126 | +* We can add interaction terms capturing interactions between categorical variables |
| 127 | + |
| 128 | +Let's make a convenience function to cut the column into intervals working on all three of our datasets (Train/Validation/Test). |
| 129 | +We'll use `h2o.hist` to determine interval boundaries (but there are many more ways to do that!) on the Train set. |
| 130 | +We'll take only the bins with non-trivial support: |
| 131 | + |
| 132 | +```{r cut_column} |
| 133 | +cut_column <- function(data, col) { |
| 134 | + # need lower/upper bound due to h2o.cut behavior (points < the first break or > the last break are replaced with missing value) |
| 135 | + min_val = min(data$Train[,col])-1 |
| 136 | + max_val = max(data$Train[,col])+1 |
| 137 | + x = h2o.hist(data$Train[, col]) |
| 138 | + # use only the breaks with enough support |
| 139 | + breaks = x$breaks[which(x$counts > 1000)] |
| 140 | + # assign level names |
| 141 | + lvls = c("min",paste("i_",breaks[2:length(breaks)-1],sep=""),"max") |
| 142 | + col_cut <- paste(col,"_cut",sep="") |
| 143 | + data$Train[,col_cut] <- h2o.setLevels(h2o.cut(x = data$Train[,col],breaks=c(min_val,breaks,max_val)),lvls) |
| 144 | + # now do the same for test and validation, but using the breaks computed on the training! |
| 145 | + if(!is.null(data$Test)) { |
| 146 | + min_val = min(data$Test[,col])-1 |
| 147 | + max_val = max(data$Test[,col])+1 |
| 148 | + data$Test[,col_cut] <- h2o.setLevels(h2o.cut(x = data$Test[,col],breaks=c(min_val,breaks,max_val)),lvls) |
| 149 | + } |
| 150 | + if(!is.null(data$Valid)) { |
| 151 | + min_val = min(data$Valid[,col])-1 |
| 152 | + max_val = max(data$Valid[,col])+1 |
| 153 | + data$Valid[,col_cut] <- h2o.setLevels(h2o.cut(x = data$Valid[,col],breaks=c(min_val,breaks,max_val)),lvls) |
| 154 | + } |
| 155 | + data |
| 156 | +} |
| 157 | +``` |
| 158 | +Now let's make a convenience function generating interaction terms on all three of our datasets. We'll use `h2o.interaction`: |
| 159 | + |
| 160 | +```{r generate_interactions} |
| 161 | +interactions <- function(data, cols, pairwise = TRUE) { |
| 162 | + iii = h2o.interaction(data = data$Train, destination_frame = "itrain",factors = cols,pairwise=pairwise,max_factors=1000,min_occurrence=100) |
| 163 | + data$Train <- h2o.cbind(data$Train,iii) |
| 164 | + if(!is.null(data$Test)) { |
| 165 | + iii = h2o.interaction(data = data$Test, destination_frame = "itest",factors = cols,pairwise=pairwise,max_factors=1000,min_occurrence=100) |
| 166 | + data$Test <- h2o.cbind(data$Test,iii) |
| 167 | + } |
| 168 | + if(!is.null(data$Valid)) { |
| 169 | + iii = h2o.interaction(data = data$Valid, destination_frame = "ivalid",factors = cols,pairwise=pairwise,max_factors=1000,min_occurrence=100) |
| 170 | + data$Valid <- h2o.cbind(data$Valid,iii) |
| 171 | + } |
| 172 | + data |
| 173 | +} |
| 174 | +``` |
| 175 | +Finally, let's wrap addition of the features into a separate function call, as we will use it again later. |
| 176 | +We'll add intervals for each numeric column and interactions between each pair of binary columns. |
| 177 | + |
| 178 | +```{r add_features} |
| 179 | +# add features to our cover type example |
| 180 | +# let's cut all the numerical columns into intervals and add interactions between categorical terms |
| 181 | +add_features <- function(data) { |
| 182 | + names(data) <- c("Train","Test","Valid") |
| 183 | + data = cut_column(data,'Elevation') |
| 184 | + data = cut_column(data,'Hillshade_Noon') |
| 185 | + data = cut_column(data,'Hillshade_9am') |
| 186 | + data = cut_column(data,'Hillshade_3pm') |
| 187 | + data = cut_column(data,'Horizontal_Distance_To_Hydrology') |
| 188 | + data = cut_column(data,'Slope') |
| 189 | + data = cut_column(data,'Horizontal_Distance_To_Roadways') |
| 190 | + data = cut_column(data,'Aspect') |
| 191 | + # pairwise interactions between all categorical columns |
| 192 | + interaction_cols = c("Elevation_cut","Wilderness_Area","Soil_Type","Hillshade_Noon_cut","Hillshade_9am_cut","Hillshade_3pm_cut","Horizontal_Distance_To_Hydrology_cut","Slope_cut","Horizontal_Distance_To_Roadways_cut","Aspect_cut") |
| 193 | + data = interactions(data, interaction_cols) |
| 194 | + # interactions between Hillshade columns |
| 195 | + interaction_cols2 = c("Hillshade_Noon_cut","Hillshade_9am_cut","Hillshade_3pm_cut") |
| 196 | + data = interactions(data, interaction_cols2,pairwise = FALSE) |
| 197 | + data |
| 198 | +} |
| 199 | +``` |
| 200 | +Now we generate new features and add them to the dataset. We'll also need to generate column names again, as we added more columns: |
| 201 | + |
| 202 | +```{r add_features_binomial} |
| 203 | +# Add Features |
| 204 | +data_binomial_ext <- add_features(data_binomial) |
| 205 | +data_binomial_ext$Train <- h2o.assign(data_binomial_ext$Train,"train_b_ext") |
| 206 | +data_binomial_ext$Valid <- h2o.assign(data_binomial_ext$Valid,"valid_b_ext") |
| 207 | +data_binomial_ext$Test <- h2o.assign(data_binomial_ext$Test,"test_b_ext") |
| 208 | +y = "Cover_Type" |
| 209 | +x = names(data_binomial_ext$Train) |
| 210 | +x = x[-which(x==y)] |
| 211 | +``` |
| 212 | +Let's build the model! We should add some regularization this time because we added correlated variables, so let's try the default: |
| 213 | + |
| 214 | +```{r build_binomial_ext_1} |
| 215 | +m_binomial_1_ext = try(h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial')) |
| 216 | +``` |
| 217 | +Oops, doesn't run - well, we know have more features than the default method can solve with 2GB of RAM. Let's try L-BFGS instead. |
| 218 | + |
| 219 | +```{r build_binomial_ext_2} |
| 220 | +m_binomial_1_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', solver='L_BFGS') |
| 221 | +h2o.confusionMatrix(m_binomial_1_ext) |
| 222 | +h2o.auc(m_binomial_1_ext,valid=TRUE) |
| 223 | +``` |
| 224 | +Not much better, maybe too much regularization? Let's pick a smaller lambda and try again. |
| 225 | + |
| 226 | +```{r build_binomial_ext_3} |
| 227 | +m_binomial_2_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', solver='L_BFGS', lambda=1e-4) |
| 228 | +h2o.confusionMatrix(m_binomial_2_ext, valid=TRUE) |
| 229 | +h2o.auc(m_binomial_2_ext,valid=TRUE) |
| 230 | +``` |
| 231 | +Way better, we got an AUC of .91 and classification error of 0.180838. |
| 232 | +We picked our regularization strength arbitrarily. Also, we used only the l2 penalty but we added lot of extra features, some of which may be useless. |
| 233 | +Maybe we can do better with an l1 penalty. |
| 234 | +So now we want to run a lambda search to find optimal penalty strength and we want to have a non-zero l1 penalty to get sparse solution. |
| 235 | +We'll use the IRLSM solver this time as it does much better with lambda search and l1 penalty. |
| 236 | +Recall we were not able to use it before. We can use it now as we are running a lambda search that will filter out a large portion of the inactive (coefficient==0) predictors. |
| 237 | + |
| 238 | +```{r build_binomial_ext_4} |
| 239 | +m_binomial_3_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', lambda_search=TRUE) |
| 240 | +h2o.confusionMatrix(m_binomial_3_ext, valid=TRUE) |
| 241 | +h2o.auc(m_binomial_3_ext,valid=TRUE) |
| 242 | +``` |
| 243 | +Better yet, we have 17% error and we used only 3000 out of 7000 features. |
| 244 | +Ok, our new features improved the binomial model significantly, so let's revisit our former multinomial model and see if they make a difference there (they should!): |
| 245 | + |
| 246 | +```{r build_multinomial_ext_4} |
| 247 | +# Multinomial Model 2 |
| 248 | +# let's revisit the multinomial case with our new features |
| 249 | +data_ext <- add_features(data) |
| 250 | +data_ext$Train <- h2o.assign(data_ext$Train,"train_m_ext") |
| 251 | +data_ext$Valid <- h2o.assign(data_ext$Valid,"valid_m_ext") |
| 252 | +data_ext$Test <- h2o.assign(data_ext$Test,"test_m_ext") |
| 253 | +y = "Cover_Type" |
| 254 | +x = names(data_ext$Train) |
| 255 | +x = x[-which(x==y)] |
| 256 | +m2 = h2o.glm(training_frame = data_ext$Train, validation_frame = data_ext$Valid, x = x, y = y,family='multinomial',solver='L_BFGS',lambda=1e-4) |
| 257 | +# 21% err down from 28% |
| 258 | +h2o.confusionMatrix(m2, valid=TRUE) |
| 259 | +``` |
| 260 | +Improved considerably, 21% instead of 28%. |
| 261 | + |
| 262 | + |
0 commit comments