NetHead21
diff --git a/‎gibberish/Makefile‎
Lines changed: 2 additions & 5 deletions b/‎gibberish/Makefile‎
Lines changed: 2 additions & 5 deletions
diff --git a/‎gibberish/README.md‎
Lines changed: 116 additions & 41 deletions b/‎gibberish/README.md‎
Lines changed: 116 additions & 41 deletions
diff --git a/‎gibberish/discussion.md‎
Lines changed: 90 additions & 0 deletions b/‎gibberish/discussion.md‎
Lines changed: 90 additions & 0 deletions
@@ -1,7 +1,4 @@
-.PHONY: pdf test
-
-pdf:
-	pandoc README.md -o README.pdf
+.PHONY: test
 
 test:
-	pytest -v test.py
+	pytest -xv gibberish.py test.py
@@ -1,6 +1,6 @@
 # Gibberish Generator
 
-Write a Python program called `gibberish.py` that uses the Markov chain algorithm to generate new words from a set of training files. The program should take one or more positional arguments which are files that you read, word-by-word, and note the options of letters after a given `-k|--kmer_size` (default `2`) grouping of letters. E.g., in the word "alabama" with `k=1`, the frequency table will look like:
+Write a Python program called `gibberish.py` that uses the Markov chain algorithm to generate new words from the words in a set of training files. The program should take one or more positional arguments which are files that you read, word-by-word, and note the options of letters after a given `-k|--kmer_size` (default `2`) grouping of letters. E.g., in the word "alabama" with `k=1`, the frequency table will look like:
 
 ````
 a = l, b, m
@@ -13,13 +13,8 @@ That is, given this training set, if you started with `l` you could only choose
 
 The program should generate `-n|--num_words` words (default `10`), each a random size between `k` + 2 and a `-m|--max_word` size (default `12`). Be sure to accept `-s|--seed` to pass to `random.seed`. My solution also takes a `-d|--debug` flag that will emit debug messages to `.log` for you to inspect.
 
-Chose the best words and create definitions for them:
-
-* yulcogicism: the study of Christmas gnostics
-* umjamp: skateboarding trick
-* callots: insignia of officers in Greek army
-* urchenev: fungal growth found under cobblestones
-
+If provided no arguments or the `-h|--help` flag, generate a usage:
+ 
 ````
 $ ./gibberish.py
 usage: gibberish.py [-h] [-n int] [-k int] [-m int] [-s int] [-d] FILE [FILE ...]
@@ -42,37 +37,117 @@ optional arguments:
                         Max word length (default: 12)
   -s int, --seed int    Random seed (default: None)
   -d, --debug           Debug to ".log" (default: False)
-$ ./gibberish.py /usr/share/dict/words -s 1
-  1: oveli
-  2: uming
-  3: uylatiteda
-  4: owsh
-  5: uuse
-  6: ismandl
-  7: efortai
-  8: eyhopy
-  9: auretrab
- 10: ozogralach
-$ ./gibberish.py ../inputs/const.txt -s 2 -k 3
-  1: romot
-  2: leasonsusp
-  3: gdoned
-  4: bunablished
-  5: neithere
-  6: achmen
-  7: reason
-  8: nmentyone
-  9: effereof
- 10: eipts
-$ ./gibberish.py -k 2 ../inputs/1945-boys.txt
-  1: baronaler
-  2: lip
-  3: oselli
-  4: ard
-  5: vicharley
-  6: melli
-  7: denry
-  8: jerictomank
-  9: rick
- 10: larvichaell
 ````
+
+Create new English words by training on a dictionary:
+
+````
+$ ./gibberish.py /usr/share/dict/words -s 1 -n 5
+  1: salva
+  2: xeroolizati
+  3: upst
+  4: azeconi
+  5: woco
+````
+
+Create different words by training on the US Constitution:
+
+````
+$ ./gibberish.py ../inputs/const.txt -s 2 -k 3 -n 4
+  1: lfare
+  2: sachmentit
+  3: such
+  4: rcessadopti
+```` 
+
+Generate new names for boys:
+
+```` 
+$ ./gibberish.py -k 2 -n 6 ../inputs/1945-boys.txt
+  1: marthomart
+  2: danie
+  3: muel
+  4: osep
+  5: tomandrenny
+  6: alberber
+````
+
+Chose the best words and create definitions for them:
+
+* yulcogicism: the study of Christmas gnostics
+* umjamp: skateboarding trick
+* callots: insignia of officers in Greek army
+* urchenev: fungal growth found under cobblestones
+
+## Kmers
+
+To create the Markov chains, you'll need to read all the words from each file. Use `str.lower` to lowercase all the text and then remove any character that is not in the regular English alphabet (a-z). You'll need to extract "k-mers" or "n-grams" from each word. In the text "abcd," if `k=2` then the 2-mers are "ab," "bc," and "cd." If `k=3`, then the 3-mers are "abc" and "bcd." It may be helpful to know the number `n` of kmers `k` is proportional to the length `l` of the string `n = l - k + 1`. 
+
+Consider writing a function `kmers(text, k=1)` that only extracts kmers from some text, and then add this function to your program:
+
+````
+def test_kmers():
+    """Test kmers"""
+
+    assert kmers('abcd') == list('abcd')
+    assert kmers('abcd', 2) == ['ab', 'bc', 'cd']
+    assert kmers('abcd', 3) == ['abc', 'bcd']
+    assert kmers('abcd', 4) == ['abcd']
+    assert kmers('abcd', 5) == []
+````
+
+Run your program with `pytest -v gibberish.py` and see if it passes.
+
+## Chains
+
+To create the Markov chains, you'll need to get all the kmers for `k+1` for all the words in all the texts. That is, if `k=3` you need to find all the 4-mers so that you can find the character *after* the 3-mers in the texts. For example, in the text "The quick brown fox jumps over the lazy dog.", we need to create a data structure that looks like this:
+
+````
+>>> from pprint import pprint as pp
+>>> pp(chains)
+{'bro': ['w'],
+ 'jum': ['p'],
+ 'laz': ['y'],
+ 'ove': ['r'],
+ 'qui': ['c'],
+ 'row': ['n'],
+ 'uic': ['k'],
+ 'ump': ['s']}
+````
+
+For every 3-mer, we need to know all the characters that follow each. Obviously this is not very exciting given the small size of the input text.
+
+Consider writing a function `read_training(fhs, k=1)` that reads the input training files and returns a dictionary of kmer chains. Then add this function to test that is works properly:
+
+````
+def test_read_training():
+    """Test read_training"""
+
+    text = 'The quick brown fox jumps over the lazy dog.'
+
+    expected3 = {
+        'qui': ['c'],
+        'uic': ['k'],
+        'bro': ['w'],
+        'row': ['n'],
+        'jum': ['p'],
+        'ump': ['s'],
+        'ove': ['r'],
+        'laz': ['y']
+    }
+    assert read_training([io.StringIO(text)], k=3) == expected3
+
+    expected4 = {'quic': ['k'], 'brow': ['n'], 'jump': ['s']}
+    assert read_training([io.StringIO(text)], k=4) == expected4
+````
+
+## Making new words
+
+Once you have the chains of letters that follow each kmer, you need can use `random.choice` to find a starting kmer from the `keys` of your chain dictionary. Also use that function to select a length for your new word from the range of `k + 2` to the `args.max_word` (which defaults to `12`). Build up your new word by again using `random.choice` to select from the possibilities for the kmer which will change through each iteration.
+
+That is, if you `k=3` and you start with the randomly selected kmer `ero`, you might get `n` as your next letter. On the next iteration of the loop, the `kmer` will be `ron` and you will look to see what letters follow that 3-mer. You might get `d`, and so the next time you would look for those letters following `ond`, and so forth. Continue until you've built a word that is the length you selected.
+
+Hints: 
+
+* Define the input files with `type=argparse.FileType('r')` so that `argparse` will validate the user provides readable files and then will `open` them for you.
+* Consider using the `logging` module to print out debugging messages. Run the `solution.py` with the `-d` flag and then inspect the `.log` file.
@@ -0,0 +1,90 @@
+As recommended in the description, I define my arguments in `get_args` to rely on `argparse` to validate as much as possible, e.g. verify that I get `int` values or readabnle files, and provide reasonable defaults for everything but the required `file` argument. I additionally define a `-d|--debug` flag that is only `True` when present so that I can add this bit of code:
+
+````
+logging.basicConfig(
+    filename='.log',
+    filemode='w',
+    level=logging.DEBUG if args.debug else logging.CRITICAL)
+````
+
+This is a simple and effective way to turn debugging messages on and off. I usually write to a `.log` file, being sure to choose a name that starts with a `.` so that it will normally be hidden when I `ls` the directory. Since the `filemode='w'`, the file will be overwritten on each run. I set the threshold to `logging.DEBUG` if the `debug` flag is `True`; otherwise the `logging` module will only emit those at the `CRITICAL` level. As I don't have any "critical" messages, the `.log` file will be empty unless the `--debug` is present. Then I have `logging.debug()` calls throughout my code which will only log messages when I ask. This is easier than putting `print` statements in your code which you have to remove or comment out when you are done debugging.
+
+## Finding kmers in text
+
+If you followed my advice about breaking down the problem, then you probably created a `kmers` function:
+
+````
+>>> def kmers(text, k=1):
+...     return [text[i:i + k] for i in range(len(text) - k + 1)]
+...
+````
+
+Using the formula given in the intro for the number of kmers in a string, I use the `range` function to get the start position of each of those kmers and then get the slice of the `text` from that position to the position `k` away.
+
+I can verify it works in the REPL:
+
+````
+>>> kmers('abcd', 2)
+['ab', 'bc', 'cd']
+>>> kmers('abcd', 3)
+['abc', 'bcd']
+````
+
+But more importantly, I can write a `test_kmers` function that I embed in my code and run with `pytest`!
+
+## Reading the input files
+
+Since I used the `argparse.FileType` to define the `file` with `nargs='+'`, I have a `list` of *open file handles* that can be read. I defined a `read_training` function that iterates over all the words in each file by calling `fh.read().split()`. As this breaks the text on spaces, various bits of punctuation may still be attached:
+
+````
+>>> fh = open('../inputs/spiders.txt')
+>>> fh.read().split()
+['Don’t', 'worry,', 'spiders,', 'I', 'keep', 'house', 'casually.']
+````
+
+So I use a regular expression to remove anything that is *not* in the set of letters "a-z" by defining a negated character class `[^a-z]`. I create a one-line function to `lower` the word and `clean` it:
+
+````
+>>> import re
+>>> clean = lambda word: re.sub('[^a-z]', '', word.lower())
+>>> clean('"Hey!"')
+'hey'
+````
+
+Now I can get cleaned, lowercase text:
+
+````
+>>> fh = open('../inputs/spiders.txt')
+>>> list(map(clean, fh.read().split()))
+['dont', 'worry', 'spiders', 'i', 'keep', 'house', 'casually']
+````
+
+I can now get all the kmers for each word by using my `kmers` function. I put all this into a function called `read_training`. It takes a `list` of open file handles (which I get from `argparse`) and a `k` which defaults to `1`:
+
+````
+>>> def read_training(fhs, k=1):
+...     chains = defaultdict(list)
+...     clean = lambda word: re.sub('[^a-z]', '', word.lower())
+...     for fh in fhs:
+...         for word in map(clean, fh.read().split()):
+...             for kmer in kmers(word, k + 1):
+...                 chains[kmer[:-1]].append(kmer[-1])
+...     return chains
+...
+````
+
+I can verify it works:
+
+````
+>>> from collections import defaultdict
+>>> from pprint import pprint as pp
+>>> pp(read_training([open('../inputs/spiders.txt')], k=5))
+defaultdict(<class 'list'>,
+            {'asual': ['l'],
+             'casua': ['l'],
+             'pider': ['s'],
+             'spide': ['r'],
+             'suall': ['y']})
+````
+
+But, again, *more importantly is that I can write a test that verifies it works*! If you copy in the `test_read_training` function, you have the knowledge that you are creating valid chains.