Skip to content

Commit 716f7d3

Browse files
author
h2o
committed
Update tutorials/sparkling-water/README.md
1 parent b4d50ad commit 716f7d3

1 file changed

Lines changed: 14 additions & 15 deletions

File tree

tutorials/sparkling-water/README.md

Lines changed: 14 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,15 @@
77
- [Sparkling Water 1.5.4](http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.5/4/index.html) ([USB](../../SparklingWater))
88
- [SMS dataset](https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/smsData.txt) ([USB](../data/smsData.txt))
99

10-
## Provided USB
10+
## Provided on USB
1111
- [Binaries](../../)
1212
- [SMS dataset](../data/smsData.txt)
1313
- [Slides](SparklingWater.pdf)
1414
- [Scala Script](h2oworld.script.scala)
1515

1616
## Machine Learning Workflow
1717

18-
**Goal**: For a given text message identify if it is spam or not.
18+
**Goal**: For a given text message, identify if it is spam or not.
1919

2020
1. Extract data
2121
2. Transform, tokenize messages
@@ -33,11 +33,11 @@
3333
bin/sparkling-shell --conf spark.executor.memory=2G
3434
```
3535

36-
> Note: I would recommend to edit your `$SPARK_HOME/conf/log4j.properties` and configure log level to `WARN` to avoid flooding output with Spark INFO messages.
36+
> Note: To avoid flooding output with Spark INFO messages, I recommend editing your `$SPARK_HOME/conf/log4j.properties` and configuring the log level to `WARN`.
3737
38-
2. Open Spark UI: You can go to [http://localhost:4040/](http://localhost:4040/) to see the Spark status.
38+
2. Open Spark UI: Go to [http://localhost:4040/](http://localhost:4040/) to see the Spark status.
3939

40-
3. Prepare environment
40+
3. Prepare the environment:
4141
```scala
4242
// Input data
4343
val DATAFILE="../data/smsData.txt"
@@ -52,21 +52,21 @@
5252
import water.Key
5353
```
5454

55-
4. Define representation of training message:
55+
4. Define the representation of the training message:
5656
```scala
5757
// Representation of a training message
5858
case class SMS(target: String, fv: mllib.linalg.Vector)
5959
```
6060

61-
5. Define data loader and parser:
61+
5. Define the data loader and parser:
6262
```scala
6363
def load(dataFile: String): RDD[Array[String]] = {
6464
// Load file into memory, split on TABs and filter all empty lines
6565
sc.textFile(dataFile).map(l => l.split("\t")).filter(r => !r(0).isEmpty)
6666
}
6767
```
6868

69-
6. Input messages tokenizer:
69+
6. Define the input messages tokenizer:
7070
```scala
7171
// Tokenizer
7272
// For each sentence in input RDD it provides array of string representing individual interesting words in the sentence
@@ -92,7 +92,7 @@
9292
}
9393
```
9494

95-
7. Spark's Tf-IDF model builder.
95+
7. Configure Spark's Tf-IDF model builder:
9696
```scala
9797
def buildIDFModel(tokensRDD: RDD[Seq[String]],
9898
minDocFreq:Int = 4,
@@ -108,9 +108,9 @@
108108
}
109109
```
110110

111-
> **Wikipedia** says: "tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
111+
> **Wikipedia** defines TF-IDF as: "tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
112112
113-
8. H2O's DeepLearning model builder:
113+
8. Configure H2O's DeepLearning model builder:
114114
```scala
115115
def buildDLModel(trainHF: Frame, validHF: Frame,
116116
epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0,
@@ -150,7 +150,7 @@
150150
}
151151
```
152152

153-
9. Initialize `H2OContext` and start H2O services on top of the Spark:
153+
9. Initialize `H2OContext` and start H2O services on top of Spark:
154154
```scala
155155
// Create SQL support
156156
import org.apache.spark.sql._
@@ -167,9 +167,8 @@
167167
h2oContext.openFlow
168168
```
169169

170-
> At this point, you can go use H2O UI and see status of H2O cloud by typing `getCloud`.
171-
172-
170+
> At this point, you can use the H2O UI and see the status of the H2O cloud by typing `getCloud`.
171+
173172
11. Build the final workflow by using all building pieces:
174173
```scala
175174
// Data load

0 commit comments

Comments
 (0)