NetHead21
diff --git a/‎book.md‎
Lines changed: 626 additions & 256 deletions b/‎book.md‎
Lines changed: 626 additions & 256 deletions
diff --git a/‎bottles_of_beer/discussion.md‎
Lines changed: 2 additions & 0 deletions b/‎bottles_of_beer/discussion.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎gibberish/.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎gibberish/.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎gibberish/.log‎ b/‎gibberish/.log‎
diff --git a/‎gibberish/README.md‎
Lines changed: 24 additions & 16 deletions b/‎gibberish/README.md‎
Lines changed: 24 additions & 16 deletions
diff --git a/‎gibberish/discussion.md‎
Lines changed: 52 additions & 8 deletions b/‎gibberish/discussion.md‎
Lines changed: 52 additions & 8 deletions
diff --git a/‎gibberish/solution.py‎
Lines changed: 16 additions & 18 deletions b/‎gibberish/solution.py‎
Lines changed: 16 additions & 18 deletions
@@ -11,3 +11,5 @@ return parser.parse_args()
 But here we capture the arguments inside `get_args` and add a bit of validation. If `args.num_bottles` is less than one, we call `parser.error` with the message we want to tell the user. We don't have to tell the program to stop executing as `argparse` will exit immediately. Even better is that it will indicate a non-zero exit value to the operating system to indicate there was some sort of error. If you ever start writing command-line programs that chain together to make workflows, this is a way for one program to indicate failure and halt the entire process until the error has been fixed!
 
 Once you get to the line `args = get_args()` in `main`, a great deal of hard work has already occurred to get and validate the input from the user. From here, I decided to create a template for the song putting `{}` in the spots that change from verse to verse. Then I use the `reversed(range(...))` bit we discussed before to count down, with a `for` loop, using the current number `bottle` and `next_bottle` to print out the verse noting the presence or absence of the `s` where appropriate.
+
+I'd like to stress that there are literally hundreds of ways to solve this problem. The website http://www.99-bottles-of-beer.net/ claims to have 1500 variations in various languages, 15 in Python alone. As always, the solution you wrote and understand and that passes the test suite is the "right" solution.
@@ -0,0 +1 @@
+.log
@@ -11,7 +11,7 @@ m = a
 
 That is, given this training set, if you started with `l` you could only choose an `a`, but if you have `a` then you could choose `l`, `b`, or `m`.
 
-The program should generate `-n|--num_words` words (default `10`), each a random size between `k` + 2 and a `-m|--max_word` size (default `12`). Be sure to accept `-s|--seed` to pass to `random.seed`. My solution also takes a `-d|--debug` flag that will emit debug messages to `.log` for you to inspect.
+The program should generate `-n|--num_words` words (default `10`), each a random size between `k+2` and a `-m|--max_word` size (default `12`). Be sure to accept `-s|--seed` to pass to `random.seed`. My solution also takes a `-d|--debug` flag that will emit debug messages to `.log` for you to inspect.
 
 If provided no arguments or the `-h|--help` flag, generate a usage:
 
@@ -50,7 +50,7 @@ $ ./gibberish.py /usr/share/dict/words -s 1 -n 5
   5: woco
 ````
 
-Create different words by training on the US Constitution:
+Or train on the US Constitution:
 
 ````
 $ ./gibberish.py ../inputs/const.txt -s 2 -k 3 -n 4
@@ -78,22 +78,30 @@ Chose the best words and create definitions for them:
 * umjamp: skateboarding trick
 * callots: insignia of officers in Greek army
 * urchenev: fungal growth found under cobblestones
-
+ 
 ## Kmers
 
-To create the Markov chains, you'll need to read all the words from each file. Use `str.lower` to lowercase all the text and then remove any character that is not in the regular English alphabet (a-z). You'll need to extract "k-mers" or "n-grams" from each word. In the text "abcd," if `k=2` then the 2-mers are "ab," "bc," and "cd." If `k=3`, then the 3-mers are "abc" and "bcd." It may be helpful to know the number `n` of kmers `k` is proportional to the length `l` of the string `n = l - k + 1`. 
-
-Consider writing a function `kmers(text, k=1)` that only extracts kmers from some text, and then add this function to your program:
+To create the Markov chains, first you'll need to read all the words from each file. Use `str.lower` to lowercase all the text and then remove any character that are not in the regular English alphabet (a-z). A regular expression is handy for that:
 
 ````
-def test_kmers():
-    """Test kmers"""
+>>> import re
+>>> re.sub('[^a-z]', '', 'H48,`b09e3!"')
+'be'
+````
 
-    assert kmers('abcd') == list('abcd')
-    assert kmers('abcd', 2) == ['ab', 'bc', 'cd']
-    assert kmers('abcd', 3) == ['abc', 'bcd']
-    assert kmers('abcd', 4) == ['abcd']
-    assert kmers('abcd', 5) == []
+You'll need to extract "k-mers" or "n-grams" from each word. In the text "abcd," if `k=2` then the 2-mers are "ab," "bc," and "cd." If `k=3`, then the 3-mers are "abc" and "bcd." It may be helpful to know the number `n` of kmers `k` is proportional to the length `l` of the string `n = l - k + 1`. 
+
+Consider writing a function `get_kmers(text, k=1)` that only extracts kmers from some text, and then add this function to your program:
+
+````
+def test_get_kmers():
+    """Test get_kmers"""
+
+    assert get_kmers('abcd') == list('abcd')
+    assert get_kmers('abcd', 2) == ['ab', 'bc', 'cd']
+    assert get_kmers('abcd', 3) == ['abc', 'bcd']
+    assert get_kmers('abcd', 4) == ['abcd']
+    assert get_kmers('abcd', 5) == []
 ````
 
 Run your program with `pytest -v gibberish.py` and see if it passes.
@@ -115,7 +123,7 @@ To create the Markov chains, you'll need to get all the kmers for `k+1` for all
  'ump': ['s']}
 ````
 
-For every 3-mer, we need to know all the characters that follow each. Obviously this is not very exciting given the small size of the input text.
+For every 3-mer, we need to know all the characters that follow each. Obviously this is not very exciting given the small size of the input text. If `k=2`, then you will see that `th` has two options, `e` and `e`. It's important to note how you will represent the choices for a given kmer. Will you use a `list`, a `set`, or a `collections.Counter`? Consider the implications. A `set` is smaller as it will represent only the *unique* letters but you will lose information about the *frequency* of letters. A `Counter` would store letters and counts, but how will you sample from that in a way that takes into account frequency? A `list` is probably the easiest structure.
 
 Consider writing a function `read_training(fhs, k=1)` that reads the input training files and returns a dictionary of kmer chains. Then add this function to test that is works properly:
 
@@ -143,9 +151,9 @@ def test_read_training():
 
 ## Making new words
 
-Once you have the chains of letters that follow each kmer, you need can use `random.choice` to find a starting kmer from the `keys` of your chain dictionary. Also use that function to select a length for your new word from the range of `k + 2` to the `args.max_word` (which defaults to `12`). Build up your new word by again using `random.choice` to select from the possibilities for the kmer which will change through each iteration.
+Once you have the chains of letters that follow each kmer, you need can use `random.choice` to find a starting kmer from the `keys` of your chain dictionary. Also use that function to select a length for your new word from the range of `k+2` to the `args.max_word` (which defaults to `12`). Build up your new word by again using `random.choice` to select from the possibilities for the kmer which will change through each iteration.
 
-That is, if you `k=3` and you start with the randomly selected kmer `ero`, you might get `n` as your next letter. On the next iteration of the loop, the `kmer` will be `ron` and you will look to see what letters follow that 3-mer. You might get `d`, and so the next time you would look for those letters following `ond`, and so forth. Continue until you've built a word that is the length you selected.
+That is, if `k=3` and you start with the randomly selected kmer `ero`, you might get `n` as your next letter. On the next iteration of the loop, the `kmer` will be `ron` and you will look to see what letters follow that 3-mer. You might get `d`, and so the next time you would look for those letters following `ond`, and so forth. Continue until you've built a word that is the length you selected.
 
 Hints: 
 
 
@@ -1,4 +1,4 @@
-As recommended in the description, I define my arguments in `get_args` to rely on `argparse` to validate as much as possible, e.g. verify that I get `int` values or readabnle files, and provide reasonable defaults for everything but the required `file` argument. I additionally define a `-d|--debug` flag that is only `True` when present so that I can add this bit of code:
+As recommended in the description, I define my arguments in `get_args` to rely on `argparse` to validate as much as possible, e.g. verify that I get `int` values and readable files as well as provide reasonable defaults for everything but the required `file` argument. I additionally define a `-d|--debug` flag that is only `True` when present so that I can add this bit of code:
 
 ````
 logging.basicConfig(
@@ -11,10 +11,10 @@ This is a simple and effective way to turn debugging messages on and off. I usua
 
 ## Finding kmers in text
 
-If you followed my advice about breaking down the problem, then you probably created a `kmers` function:
+If you followed my advice about breaking down the problem, then you probably created a `kmers` function with the formula for the number of kmers in a given test (`n = l - k + 1`):
 
 ````
->>> def kmers(text, k=1):
+>>> def get_kmers(text, k=1):
 ...     return [text[i:i + k] for i in range(len(text) - k + 1)]
 ...
 ````
@@ -24,15 +24,15 @@ Using the formula given in the intro for the number of kmers in a string, I use
 I can verify it works in the REPL:
 
 ````
->>> kmers('abcd', 2)
+>>> get_kmers('abcd', 2)
 ['ab', 'bc', 'cd']
->>> kmers('abcd', 3)
+>>> get_kmers('abcd', 3)
 ['abc', 'bcd']
 ````
 
 But more importantly, I can write a `test_kmers` function that I embed in my code and run with `pytest`!
 
-## Reading the input files
+## Reading the training files
 
 Since I used the `argparse.FileType` to define the `file` with `nargs='+'`, I have a `list` of *open file handles* that can be read. I defined a `read_training` function that iterates over all the words in each file by calling `fh.read().split()`. As this breaks the text on spaces, various bits of punctuation may still be attached:
 
@@ -67,12 +67,14 @@ I can now get all the kmers for each word by using my `kmers` function. I put al
 ...     clean = lambda word: re.sub('[^a-z]', '', word.lower())
 ...     for fh in fhs:
 ...         for word in map(clean, fh.read().split()):
-...             for kmer in kmers(word, k + 1):
+...             for kmer in get_kmers(word, k + 1):
 ...                 chains[kmer[:-1]].append(kmer[-1])
 ...     return chains
 ...
 ````
 
+Note the handling of the kmers. I actually request `k+1`-mers and then slice `kmer[:-1]` to get the actual `k`-mer (everything up to the penultimate letter) and then `append` `kmer[-1]` (the last letter) to the `chains` for that `k`-mer.
+
 I can verify it works:
 
 ````
@@ -87,4 +89,46 @@ defaultdict(<class 'list'>,
              'suall': ['y']})
 ````
 
-But, again, *more importantly is that I can write a test that verifies it works*! If you copy in the `test_read_training` function, you have the knowledge that you are creating valid chains.
+But, again, *more importantly is that I can write a test that verifies it works*! If you copy in the `test_read_training` function, you have the assurange that you are creating valid chains.
+
+## Making new words
+
+Once I have the chains from all the input files, I need to use a `for` loop for the `range(args.num_words)`. Each time through the loop, I need to choose a starting kmer for a new word and a length
+
+````
+>>> k = 3
+>>> max_word = 12
+>>> chains = read_training([open('../inputs/spiders.txt')], k)
+>>> kmers = list(chains.keys())
+>>> num_words = 3
+>>> for i in range(num_words):
+...     word = random.choice(kmers)
+...     length = random.choice(range(k + 2, max_word))
+...     print('Length "{}" starting with "{}"'.format(length, word))
+...
+Length "9" starting with "pid"
+Length "7" starting with "cas"
+Length "8" starting with "orr"
+````
+
+OK, that's our starting point. Given a starting kmer like `'pid'`, we need to create a `while` loop that will continue as long as the `len(word)` is less than the `length` we chose for the word. Each time through the loop, I'll set the current `kmer` to the last `k` letters of the `word`. I use `random.choice` to select from `chains[kmer]` to find the next `char` (character) and append that to the `word`:
+
+````
+>>> while len(word) < length:
+...     kmer = word[-1 * k:]
+...     if not chains[kmer]: break
+...     char = random.choice(list(chains[kmer]))
+...     word += char
+...
+>>> print(word)
+piders
+````
+
+It can happen sometimes that there are no options for a given `kmer`. That is, `chains[kmer]` is an empty list, so I in my code I add a check to `break` out of the `while` loop if this evaluates to `False`.
+
+Finally I `print` out the number of the word and the word itself using a format string to align the numbers and text:
+
+````
+>>> print('{:3}: {}'.format(i+1, word))
+  3: piders
+````
@@ -4,10 +4,8 @@
 import argparse
 import io
 import logging
-import os
 import random
 import re
-import sys
 from collections import defaultdict
 
 
@@ -62,33 +60,33 @@ def get_args():
 
 
 # --------------------------------------------------
-def kmers(text, k=1):
-    return [text[i:i + k] for i in range(len(text) - k + 1)]
+def get_kmers(text, k=1):
     """Return k-mers from text"""
 
+    return [text[i:i + k] for i in range(len(text) - k + 1)]
 
 
 # --------------------------------------------------
-def test_kmers():
-    """Test kmers"""
+def test_get_kmers():
+    """Test get_kmers"""
 
-    assert kmers('abcd') == list('abcd')
-    assert kmers('abcd', 2) == ['ab', 'bc', 'cd']
-    assert kmers('abcd', 3) == ['abc', 'bcd']
-    assert kmers('abcd', 4) == ['abcd']
-    assert kmers('abcd', 5) == []
+    assert get_kmers('abcd') == list('abcd')
+    assert get_kmers('abcd', 2) == ['ab', 'bc', 'cd']
+    assert get_kmers('abcd', 3) == ['abc', 'bcd']
+    assert get_kmers('abcd', 4) == ['abcd']
+    assert get_kmers('abcd', 5) == []
 
 
 # --------------------------------------------------
 def read_training(fhs, k=1):
     """Read training files, return chains"""
 
     chains = defaultdict(list)
-    clean = lambda word: re.sub('[^a-z]', '', word.lower())
+    clean = lambda w: re.sub('[^a-z]', '', w.lower())
 
     for fh in fhs:
         for word in map(clean, fh.read().split()):
-            for kmer in kmers(word, k + 1):
+            for kmer in get_kmers(word, k + 1):
                 chains[kmer[:-1]].append(kmer[-1])
 
     return chains
@@ -129,26 +127,26 @@ def main():
         filemode='w',
         level=logging.DEBUG if args.debug else logging.CRITICAL)
 
-    chains = read_training(args.file, args.kmer_size)
+    chains = read_training(args.file, k)
     logging.debug(chains)
 
     kmers = list(chains.keys())
     for i in range(args.num_words):
         word = random.choice(kmers)
         length = random.choice(range(k + 2, args.max_word))
-        logging.debug('Length "{}" starting with "{}"'.format(length, word))
+        logging.debug('Length "%s" starting with "%s"', length, word)
 
         while len(word) < length:
             kmer = word[-1 * k:]
             if not chains[kmer]:
                 break
 
             char = random.choice(list(chains[kmer]))
-            logging.debug('char = "{}"'.format(char))
+            logging.debug('char = "%s"', char)
             word += char
 
-        logging.debug('word = "{}"'.format(word))
-        print('{:3}: {}'.format(i+1, word))
+        logging.debug('word = "%s"', word)
+        print('{:3}: {}'.format(i + 1, word))
 
 
 # --------------------------------------------------