Skip to content

Commit 8d9e3ac

Browse files
committed
gibberish
1 parent 7c4a422 commit 8d9e3ac

File tree

10 files changed

+859
-236184
lines changed

10 files changed

+859
-236184
lines changed

book.md

Lines changed: 626 additions & 256 deletions
Large diffs are not rendered by default.

bottles_of_beer/discussion.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,5 @@ return parser.parse_args()
1111
But here we capture the arguments inside `get_args` and add a bit of validation. If `args.num_bottles` is less than one, we call `parser.error` with the message we want to tell the user. We don't have to tell the program to stop executing as `argparse` will exit immediately. Even better is that it will indicate a non-zero exit value to the operating system to indicate there was some sort of error. If you ever start writing command-line programs that chain together to make workflows, this is a way for one program to indicate failure and halt the entire process until the error has been fixed!
1212

1313
Once you get to the line `args = get_args()` in `main`, a great deal of hard work has already occurred to get and validate the input from the user. From here, I decided to create a template for the song putting `{}` in the spots that change from verse to verse. Then I use the `reversed(range(...))` bit we discussed before to count down, with a `for` loop, using the current number `bottle` and `next_bottle` to print out the verse noting the presence or absence of the `s` where appropriate.
14+
15+
I'd like to stress that there are literally hundreds of ways to solve this problem. The website http://www.99-bottles-of-beer.net/ claims to have 1500 variations in various languages, 15 in Python alone. As always, the solution you wrote and understand and that passes the test suite is the "right" solution.

gibberish/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.log

gibberish/.log

Whitespace-only changes.

gibberish/README.md

Lines changed: 24 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ m = a
1111

1212
That is, given this training set, if you started with `l` you could only choose an `a`, but if you have `a` then you could choose `l`, `b`, or `m`.
1313

14-
The program should generate `-n|--num_words` words (default `10`), each a random size between `k` + 2 and a `-m|--max_word` size (default `12`). Be sure to accept `-s|--seed` to pass to `random.seed`. My solution also takes a `-d|--debug` flag that will emit debug messages to `.log` for you to inspect.
14+
The program should generate `-n|--num_words` words (default `10`), each a random size between `k+2` and a `-m|--max_word` size (default `12`). Be sure to accept `-s|--seed` to pass to `random.seed`. My solution also takes a `-d|--debug` flag that will emit debug messages to `.log` for you to inspect.
1515

1616
If provided no arguments or the `-h|--help` flag, generate a usage:
1717

@@ -50,7 +50,7 @@ $ ./gibberish.py /usr/share/dict/words -s 1 -n 5
5050
5: woco
5151
````
5252

53-
Create different words by training on the US Constitution:
53+
Or train on the US Constitution:
5454

5555
````
5656
$ ./gibberish.py ../inputs/const.txt -s 2 -k 3 -n 4
@@ -78,22 +78,30 @@ Chose the best words and create definitions for them:
7878
* umjamp: skateboarding trick
7979
* callots: insignia of officers in Greek army
8080
* urchenev: fungal growth found under cobblestones
81-
81+
8282
## Kmers
8383

84-
To create the Markov chains, you'll need to read all the words from each file. Use `str.lower` to lowercase all the text and then remove any character that is not in the regular English alphabet (a-z). You'll need to extract "k-mers" or "n-grams" from each word. In the text "abcd," if `k=2` then the 2-mers are "ab," "bc," and "cd." If `k=3`, then the 3-mers are "abc" and "bcd." It may be helpful to know the number `n` of kmers `k` is proportional to the length `l` of the string `n = l - k + 1`.
85-
86-
Consider writing a function `kmers(text, k=1)` that only extracts kmers from some text, and then add this function to your program:
84+
To create the Markov chains, first you'll need to read all the words from each file. Use `str.lower` to lowercase all the text and then remove any character that are not in the regular English alphabet (a-z). A regular expression is handy for that:
8785

8886
````
89-
def test_kmers():
90-
"""Test kmers"""
87+
>>> import re
88+
>>> re.sub('[^a-z]', '', 'H48,`b09e3!"')
89+
'be'
90+
````
9191

92-
assert kmers('abcd') == list('abcd')
93-
assert kmers('abcd', 2) == ['ab', 'bc', 'cd']
94-
assert kmers('abcd', 3) == ['abc', 'bcd']
95-
assert kmers('abcd', 4) == ['abcd']
96-
assert kmers('abcd', 5) == []
92+
You'll need to extract "k-mers" or "n-grams" from each word. In the text "abcd," if `k=2` then the 2-mers are "ab," "bc," and "cd." If `k=3`, then the 3-mers are "abc" and "bcd." It may be helpful to know the number `n` of kmers `k` is proportional to the length `l` of the string `n = l - k + 1`.
93+
94+
Consider writing a function `get_kmers(text, k=1)` that only extracts kmers from some text, and then add this function to your program:
95+
96+
````
97+
def test_get_kmers():
98+
"""Test get_kmers"""
99+
100+
assert get_kmers('abcd') == list('abcd')
101+
assert get_kmers('abcd', 2) == ['ab', 'bc', 'cd']
102+
assert get_kmers('abcd', 3) == ['abc', 'bcd']
103+
assert get_kmers('abcd', 4) == ['abcd']
104+
assert get_kmers('abcd', 5) == []
97105
````
98106

99107
Run your program with `pytest -v gibberish.py` and see if it passes.
@@ -115,7 +123,7 @@ To create the Markov chains, you'll need to get all the kmers for `k+1` for all
115123
'ump': ['s']}
116124
````
117125

118-
For every 3-mer, we need to know all the characters that follow each. Obviously this is not very exciting given the small size of the input text.
126+
For every 3-mer, we need to know all the characters that follow each. Obviously this is not very exciting given the small size of the input text. If `k=2`, then you will see that `th` has two options, `e` and `e`. It's important to note how you will represent the choices for a given kmer. Will you use a `list`, a `set`, or a `collections.Counter`? Consider the implications. A `set` is smaller as it will represent only the *unique* letters but you will lose information about the *frequency* of letters. A `Counter` would store letters and counts, but how will you sample from that in a way that takes into account frequency? A `list` is probably the easiest structure.
119127

120128
Consider writing a function `read_training(fhs, k=1)` that reads the input training files and returns a dictionary of kmer chains. Then add this function to test that is works properly:
121129

@@ -143,9 +151,9 @@ def test_read_training():
143151

144152
## Making new words
145153

146-
Once you have the chains of letters that follow each kmer, you need can use `random.choice` to find a starting kmer from the `keys` of your chain dictionary. Also use that function to select a length for your new word from the range of `k + 2` to the `args.max_word` (which defaults to `12`). Build up your new word by again using `random.choice` to select from the possibilities for the kmer which will change through each iteration.
154+
Once you have the chains of letters that follow each kmer, you need can use `random.choice` to find a starting kmer from the `keys` of your chain dictionary. Also use that function to select a length for your new word from the range of `k+2` to the `args.max_word` (which defaults to `12`). Build up your new word by again using `random.choice` to select from the possibilities for the kmer which will change through each iteration.
147155

148-
That is, if you `k=3` and you start with the randomly selected kmer `ero`, you might get `n` as your next letter. On the next iteration of the loop, the `kmer` will be `ron` and you will look to see what letters follow that 3-mer. You might get `d`, and so the next time you would look for those letters following `ond`, and so forth. Continue until you've built a word that is the length you selected.
156+
That is, if `k=3` and you start with the randomly selected kmer `ero`, you might get `n` as your next letter. On the next iteration of the loop, the `kmer` will be `ron` and you will look to see what letters follow that 3-mer. You might get `d`, and so the next time you would look for those letters following `ond`, and so forth. Continue until you've built a word that is the length you selected.
149157

150158
Hints:
151159

gibberish/discussion.md

Lines changed: 52 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
As recommended in the description, I define my arguments in `get_args` to rely on `argparse` to validate as much as possible, e.g. verify that I get `int` values or readabnle files, and provide reasonable defaults for everything but the required `file` argument. I additionally define a `-d|--debug` flag that is only `True` when present so that I can add this bit of code:
1+
As recommended in the description, I define my arguments in `get_args` to rely on `argparse` to validate as much as possible, e.g. verify that I get `int` values and readable files as well as provide reasonable defaults for everything but the required `file` argument. I additionally define a `-d|--debug` flag that is only `True` when present so that I can add this bit of code:
22

33
````
44
logging.basicConfig(
@@ -11,10 +11,10 @@ This is a simple and effective way to turn debugging messages on and off. I usua
1111

1212
## Finding kmers in text
1313

14-
If you followed my advice about breaking down the problem, then you probably created a `kmers` function:
14+
If you followed my advice about breaking down the problem, then you probably created a `kmers` function with the formula for the number of kmers in a given test (`n = l - k + 1`):
1515

1616
````
17-
>>> def kmers(text, k=1):
17+
>>> def get_kmers(text, k=1):
1818
... return [text[i:i + k] for i in range(len(text) - k + 1)]
1919
...
2020
````
@@ -24,15 +24,15 @@ Using the formula given in the intro for the number of kmers in a string, I use
2424
I can verify it works in the REPL:
2525

2626
````
27-
>>> kmers('abcd', 2)
27+
>>> get_kmers('abcd', 2)
2828
['ab', 'bc', 'cd']
29-
>>> kmers('abcd', 3)
29+
>>> get_kmers('abcd', 3)
3030
['abc', 'bcd']
3131
````
3232

3333
But more importantly, I can write a `test_kmers` function that I embed in my code and run with `pytest`!
3434

35-
## Reading the input files
35+
## Reading the training files
3636

3737
Since I used the `argparse.FileType` to define the `file` with `nargs='+'`, I have a `list` of *open file handles* that can be read. I defined a `read_training` function that iterates over all the words in each file by calling `fh.read().split()`. As this breaks the text on spaces, various bits of punctuation may still be attached:
3838

@@ -67,12 +67,14 @@ I can now get all the kmers for each word by using my `kmers` function. I put al
6767
... clean = lambda word: re.sub('[^a-z]', '', word.lower())
6868
... for fh in fhs:
6969
... for word in map(clean, fh.read().split()):
70-
... for kmer in kmers(word, k + 1):
70+
... for kmer in get_kmers(word, k + 1):
7171
... chains[kmer[:-1]].append(kmer[-1])
7272
... return chains
7373
...
7474
````
7575

76+
Note the handling of the kmers. I actually request `k+1`-mers and then slice `kmer[:-1]` to get the actual `k`-mer (everything up to the penultimate letter) and then `append` `kmer[-1]` (the last letter) to the `chains` for that `k`-mer.
77+
7678
I can verify it works:
7779

7880
````
@@ -87,4 +89,46 @@ defaultdict(<class 'list'>,
8789
'suall': ['y']})
8890
````
8991

90-
But, again, *more importantly is that I can write a test that verifies it works*! If you copy in the `test_read_training` function, you have the knowledge that you are creating valid chains.
92+
But, again, *more importantly is that I can write a test that verifies it works*! If you copy in the `test_read_training` function, you have the assurange that you are creating valid chains.
93+
94+
## Making new words
95+
96+
Once I have the chains from all the input files, I need to use a `for` loop for the `range(args.num_words)`. Each time through the loop, I need to choose a starting kmer for a new word and a length
97+
98+
````
99+
>>> k = 3
100+
>>> max_word = 12
101+
>>> chains = read_training([open('../inputs/spiders.txt')], k)
102+
>>> kmers = list(chains.keys())
103+
>>> num_words = 3
104+
>>> for i in range(num_words):
105+
... word = random.choice(kmers)
106+
... length = random.choice(range(k + 2, max_word))
107+
... print('Length "{}" starting with "{}"'.format(length, word))
108+
...
109+
Length "9" starting with "pid"
110+
Length "7" starting with "cas"
111+
Length "8" starting with "orr"
112+
````
113+
114+
OK, that's our starting point. Given a starting kmer like `'pid'`, we need to create a `while` loop that will continue as long as the `len(word)` is less than the `length` we chose for the word. Each time through the loop, I'll set the current `kmer` to the last `k` letters of the `word`. I use `random.choice` to select from `chains[kmer]` to find the next `char` (character) and append that to the `word`:
115+
116+
````
117+
>>> while len(word) < length:
118+
... kmer = word[-1 * k:]
119+
... if not chains[kmer]: break
120+
... char = random.choice(list(chains[kmer]))
121+
... word += char
122+
...
123+
>>> print(word)
124+
piders
125+
````
126+
127+
It can happen sometimes that there are no options for a given `kmer`. That is, `chains[kmer]` is an empty list, so I in my code I add a check to `break` out of the `while` loop if this evaluates to `False`.
128+
129+
Finally I `print` out the number of the word and the word itself using a format string to align the numbers and text:
130+
131+
````
132+
>>> print('{:3}: {}'.format(i+1, word))
133+
3: piders
134+
````

gibberish/solution.py

Lines changed: 16 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,8 @@
44
import argparse
55
import io
66
import logging
7-
import os
87
import random
98
import re
10-
import sys
119
from collections import defaultdict
1210

1311

@@ -62,33 +60,33 @@ def get_args():
6260

6361

6462
# --------------------------------------------------
65-
def kmers(text, k=1):
66-
return [text[i:i + k] for i in range(len(text) - k + 1)]
63+
def get_kmers(text, k=1):
6764
"""Return k-mers from text"""
6865

66+
return [text[i:i + k] for i in range(len(text) - k + 1)]
6967

7068

7169
# --------------------------------------------------
72-
def test_kmers():
73-
"""Test kmers"""
70+
def test_get_kmers():
71+
"""Test get_kmers"""
7472

75-
assert kmers('abcd') == list('abcd')
76-
assert kmers('abcd', 2) == ['ab', 'bc', 'cd']
77-
assert kmers('abcd', 3) == ['abc', 'bcd']
78-
assert kmers('abcd', 4) == ['abcd']
79-
assert kmers('abcd', 5) == []
73+
assert get_kmers('abcd') == list('abcd')
74+
assert get_kmers('abcd', 2) == ['ab', 'bc', 'cd']
75+
assert get_kmers('abcd', 3) == ['abc', 'bcd']
76+
assert get_kmers('abcd', 4) == ['abcd']
77+
assert get_kmers('abcd', 5) == []
8078

8179

8280
# --------------------------------------------------
8381
def read_training(fhs, k=1):
8482
"""Read training files, return chains"""
8583

8684
chains = defaultdict(list)
87-
clean = lambda word: re.sub('[^a-z]', '', word.lower())
85+
clean = lambda w: re.sub('[^a-z]', '', w.lower())
8886

8987
for fh in fhs:
9088
for word in map(clean, fh.read().split()):
91-
for kmer in kmers(word, k + 1):
89+
for kmer in get_kmers(word, k + 1):
9290
chains[kmer[:-1]].append(kmer[-1])
9391

9492
return chains
@@ -129,26 +127,26 @@ def main():
129127
filemode='w',
130128
level=logging.DEBUG if args.debug else logging.CRITICAL)
131129

132-
chains = read_training(args.file, args.kmer_size)
130+
chains = read_training(args.file, k)
133131
logging.debug(chains)
134132

135133
kmers = list(chains.keys())
136134
for i in range(args.num_words):
137135
word = random.choice(kmers)
138136
length = random.choice(range(k + 2, args.max_word))
139-
logging.debug('Length "{}" starting with "{}"'.format(length, word))
137+
logging.debug('Length "%s" starting with "%s"', length, word)
140138

141139
while len(word) < length:
142140
kmer = word[-1 * k:]
143141
if not chains[kmer]:
144142
break
145143

146144
char = random.choice(list(chains[kmer]))
147-
logging.debug('char = "{}"'.format(char))
145+
logging.debug('char = "%s"', char)
148146
word += char
149147

150-
logging.debug('word = "{}"'.format(word))
151-
print('{:3}: {}'.format(i+1, word))
148+
logging.debug('word = "%s"', word)
149+
print('{:3}: {}'.format(i + 1, word))
152150

153151

154152
# --------------------------------------------------

0 commit comments

Comments
 (0)