|
7 | 7 | - [Sparkling Water 1.5.4](http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.5/4/index.html) ([USB](../../SparklingWater)) |
8 | 8 | - [SMS dataset](https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/smsData.txt) ([USB](../data/smsData.txt)) |
9 | 9 |
|
10 | | -## Provided USB |
| 10 | +## Provided on USB |
11 | 11 | - [Binaries](../../) |
12 | 12 | - [SMS dataset](../data/smsData.txt) |
13 | 13 | - [Slides](SparklingWater.pdf) |
14 | 14 | - [Scala Script](h2oworld.script.scala) |
15 | 15 |
|
16 | 16 | ## Machine Learning Workflow |
17 | 17 |
|
18 | | -**Goal**: For a given text message identify if it is spam or not. |
| 18 | +**Goal**: For a given text message, identify if it is spam or not. |
19 | 19 |
|
20 | 20 | 1. Extract data |
21 | 21 | 2. Transform, tokenize messages |
|
33 | 33 | bin/sparkling-shell --conf spark.executor.memory=2G |
34 | 34 | ``` |
35 | 35 |
|
36 | | - > Note: I would recommend to edit your `$SPARK_HOME/conf/log4j.properties` and configure log level to `WARN` to avoid flooding output with Spark INFO messages. |
| 36 | + > Note: To avoid flooding output with Spark INFO messages, I recommend editing your `$SPARK_HOME/conf/log4j.properties` and configuring the log level to `WARN`. |
37 | 37 |
|
38 | | -2. Open Spark UI: You can go to [http://localhost:4040/](http://localhost:4040/) to see the Spark status. |
| 38 | +2. Open Spark UI: Go to [http://localhost:4040/](http://localhost:4040/) to see the Spark status. |
39 | 39 |
|
40 | | -3. Prepare environment |
| 40 | +3. Prepare the environment: |
41 | 41 | ```scala |
42 | 42 | // Input data |
43 | 43 | val DATAFILE="../data/smsData.txt" |
|
52 | 52 | import water.Key |
53 | 53 | ``` |
54 | 54 |
|
55 | | -4. Define representation of training message: |
| 55 | +4. Define the representation of the training message: |
56 | 56 | ```scala |
57 | 57 | // Representation of a training message |
58 | 58 | case class SMS(target: String, fv: mllib.linalg.Vector) |
59 | 59 | ``` |
60 | 60 |
|
61 | | -5. Define data loader and parser: |
| 61 | +5. Define the data loader and parser: |
62 | 62 | ```scala |
63 | 63 | def load(dataFile: String): RDD[Array[String]] = { |
64 | 64 | // Load file into memory, split on TABs and filter all empty lines |
65 | 65 | sc.textFile(dataFile).map(l => l.split("\t")).filter(r => !r(0).isEmpty) |
66 | 66 | } |
67 | 67 | ``` |
68 | 68 |
|
69 | | -6. Input messages tokenizer: |
| 69 | +6. Define the input messages tokenizer: |
70 | 70 | ```scala |
71 | 71 | // Tokenizer |
72 | 72 | // For each sentence in input RDD it provides array of string representing individual interesting words in the sentence |
|
92 | 92 | } |
93 | 93 | ``` |
94 | 94 |
|
95 | | -7. Spark's Tf-IDF model builder. |
| 95 | +7. Configure Spark's Tf-IDF model builder: |
96 | 96 | ```scala |
97 | 97 | def buildIDFModel(tokensRDD: RDD[Seq[String]], |
98 | 98 | minDocFreq:Int = 4, |
|
108 | 108 | } |
109 | 109 | ``` |
110 | 110 |
|
111 | | - > **Wikipedia** says: "tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. |
| 111 | + > **Wikipedia** defines TF-IDF as: "tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. |
112 | 112 | |
113 | | -8. H2O's DeepLearning model builder: |
| 113 | +8. Configure H2O's DeepLearning model builder: |
114 | 114 | ```scala |
115 | 115 | def buildDLModel(trainHF: Frame, validHF: Frame, |
116 | 116 | epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0, |
|
150 | 150 | } |
151 | 151 | ``` |
152 | 152 |
|
153 | | -9. Initialize `H2OContext` and start H2O services on top of the Spark: |
| 153 | +9. Initialize `H2OContext` and start H2O services on top of Spark: |
154 | 154 | ```scala |
155 | 155 | // Create SQL support |
156 | 156 | import org.apache.spark.sql._ |
|
167 | 167 | h2oContext.openFlow |
168 | 168 | ``` |
169 | 169 |
|
170 | | - > At this point, you can go use H2O UI and see status of H2O cloud by typing `getCloud`. |
171 | | - |
172 | | - |
| 170 | + > At this point, you can use the H2O UI and see the status of the H2O cloud by typing `getCloud`. |
| 171 | +
|
173 | 172 | 11. Build the final workflow by using all building pieces: |
174 | 173 | ```scala |
175 | 174 | // Data load |
|
0 commit comments