|
12 | 12 | "cell_type": "markdown", |
13 | 13 | "metadata": {}, |
14 | 14 | "source": [ |
15 | | - "##Task: Predicting forest cover type from cartographic variables only\n", |
| 15 | + "## Task: Predicting forest cover type from cartographic variables only\n", |
16 | 16 | "\n", |
17 | 17 | "The actual forest cover type for a given observation (30 x 30 meter cell) was determined from the US Forest Service (USFS). We are using the UC Irvine Covertype dataset." |
18 | 18 | ] |
|
100 | 100 | "cell_type": "markdown", |
101 | 101 | "metadata": {}, |
102 | 102 | "source": [ |
103 | | - "##H2O GBM and RF\n", |
| 103 | + "## H2O GBM and RF\n", |
104 | 104 | "\n", |
105 | 105 | "While H2O Gradient Boosting Models and H2O Random Forest have many flexible parameters options, they were designed to be just as easy to use as the other supervised training methods in H2O. Early stopping, automatic data standardization and handling of categorical variables and missing values and adaptive learning rates (per weight) reduce the amount of parameters the user has to specify. Often, it's just the number and sizes of hidden layers, the number of epochs and the activation function and maybe some regularization techniques. " |
106 | 106 | ] |
|
109 | 109 | "cell_type": "markdown", |
110 | 110 | "metadata": {}, |
111 | 111 | "source": [ |
112 | | - "###Getting started\n", |
| 112 | + "### Getting started\n", |
113 | 113 | "\n", |
114 | 114 | "We begin by importing our data into H2OFrames, which operate similarly in function to pandas DataFrames but exist on the H2O cloud itself. \n", |
115 | 115 | "\n", |
|
124 | 124 | }, |
125 | 125 | "outputs": [], |
126 | 126 | "source": [ |
127 | | - "covtype_df = h2o.import_file(\"../data/covtype.full.csv\")" |
| 127 | + "covtype_df = h2o.import_file(os.path.realpath(\"../data/covtype.full.csv\"))" |
128 | 128 | ] |
129 | 129 | }, |
130 | 130 | { |
|
164 | 164 | "cell_type": "markdown", |
165 | 165 | "metadata": {}, |
166 | 166 | "source": [ |
167 | | - "###The First Random Forest\n", |
| 167 | + "### The First Random Forest\n", |
168 | 168 | "We build our first model with the following parameters\n", |
169 | 169 | "\n", |
170 | 170 | "**model_id:** Not required, but allows us to easily find our model in the [Flow](http://localhost:54321/) interface \n", |
|
194 | 194 | "cell_type": "markdown", |
195 | 195 | "metadata": {}, |
196 | 196 | "source": [ |
197 | | - "###Model Construction\n", |
| 197 | + "### Model Construction\n", |
198 | 198 | "H2O in Python is designed to be very similar in look and feel to to scikit-learn. Models are initialized individually with desired or default parameters and then trained on data. \n", |
199 | 199 | "\n", |
200 | 200 | "**Note that the below example uses model.train() as opposed the traditional model.fit()** \n", |
|
277 | 277 | "cell_type": "markdown", |
278 | 278 | "metadata": {}, |
279 | 279 | "source": [ |
280 | | - "###Now for GBM\n", |
| 280 | + "### Now for GBM\n", |
281 | 281 | "\n", |
282 | 282 | "First we will use all default settings, then make some changes to improve our predictions." |
283 | 283 | ] |
|
341 | 341 | "cell_type": "markdown", |
342 | 342 | "metadata": {}, |
343 | 343 | "source": [ |
344 | | - "###GBM Round 2\n", |
| 344 | + "### GBM Round 2\n", |
345 | 345 | "\n", |
346 | 346 | "Let's do the following:\n", |
347 | 347 | "\n", |
|
375 | 375 | "cell_type": "markdown", |
376 | 376 | "metadata": {}, |
377 | 377 | "source": [ |
378 | | - "###Live Performance Monitoring\n", |
| 378 | + "### Live Performance Monitoring\n", |
379 | 379 | "\n", |
380 | 380 | "While this is running, we can actually look at the model. To do this we simply need a new connection to H2O. \n", |
381 | 381 | "\n", |
|
454 | 454 | "cell_type": "markdown", |
455 | 455 | "metadata": {}, |
456 | 456 | "source": [ |
457 | | - "###Parity\n", |
| 457 | + "### Parity\n", |
458 | 458 | "\n", |
459 | 459 | "Now the GBM is close to the initial random forest.\n", |
460 | 460 | "\n", |
|
471 | 471 | "cell_type": "markdown", |
472 | 472 | "metadata": {}, |
473 | 473 | "source": [ |
474 | | - "###Random Forest #2" |
| 474 | + "### Random Forest #2" |
475 | 475 | ] |
476 | 476 | }, |
477 | 477 | { |
|
508 | 508 | "cell_type": "markdown", |
509 | 509 | "metadata": {}, |
510 | 510 | "source": [ |
511 | | - "###Final Predictions\n", |
| 511 | + "### Final Predictions\n", |
512 | 512 | "\n", |
513 | 513 | "Now that we have our validation accuracy up beyond 95%, we can start considering our test data. \n", |
514 | 514 | "We have withheld an extra test set to ensure that after all the parameter tuning we have repeatedly applied with the validation data, we still have a completely pristine data set upon which to test the predictive capacity of our model." |
|
584 | 584 | "Our final error rates are very similar between validation and test sets. This suggests that we did not overfit the validation set during our experimentation. This concludes our demo of H2O GBM and H2O Random Forests.\n", |
585 | 585 | "\n", |
586 | 586 | "\n", |
587 | | - "###Shut down the cluster\n", |
| 587 | + "### Shut down the cluster\n", |
588 | 588 | "Shut down the cluster now that we are done using it." |
589 | 589 | ] |
590 | 590 | }, |
|
603 | 603 | "cell_type": "markdown", |
604 | 604 | "metadata": {}, |
605 | 605 | "source": [ |
606 | | - "###Possible Further Steps\n", |
| 606 | + "### Possible Further Steps\n", |
607 | 607 | "\n", |
608 | 608 | "Model-agnostic gains can be found in improving handling of categorical features. We could experiment with the nbins and nbins_cats settings to control the H2O splitting.The general guidance is to lower the number to increase generalization (avoid overfitting), increase to better fit the distribution. \n", |
609 | 609 | " \n", |
|
0 commit comments