Skip to content

Commit 7c4a422

Browse files
committed
gibberish
1 parent e579998 commit 7c4a422

5 files changed

Lines changed: 298 additions & 87 deletions

File tree

gibberish/Makefile

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,4 @@
1-
.PHONY: pdf test
2-
3-
pdf:
4-
pandoc README.md -o README.pdf
1+
.PHONY: test
52

63
test:
7-
pytest -v test.py
4+
pytest -xv gibberish.py test.py

gibberish/README.md

Lines changed: 116 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Gibberish Generator
22

3-
Write a Python program called `gibberish.py` that uses the Markov chain algorithm to generate new words from a set of training files. The program should take one or more positional arguments which are files that you read, word-by-word, and note the options of letters after a given `-k|--kmer_size` (default `2`) grouping of letters. E.g., in the word "alabama" with `k=1`, the frequency table will look like:
3+
Write a Python program called `gibberish.py` that uses the Markov chain algorithm to generate new words from the words in a set of training files. The program should take one or more positional arguments which are files that you read, word-by-word, and note the options of letters after a given `-k|--kmer_size` (default `2`) grouping of letters. E.g., in the word "alabama" with `k=1`, the frequency table will look like:
44

55
````
66
a = l, b, m
@@ -13,13 +13,8 @@ That is, given this training set, if you started with `l` you could only choose
1313

1414
The program should generate `-n|--num_words` words (default `10`), each a random size between `k` + 2 and a `-m|--max_word` size (default `12`). Be sure to accept `-s|--seed` to pass to `random.seed`. My solution also takes a `-d|--debug` flag that will emit debug messages to `.log` for you to inspect.
1515

16-
Chose the best words and create definitions for them:
17-
18-
* yulcogicism: the study of Christmas gnostics
19-
* umjamp: skateboarding trick
20-
* callots: insignia of officers in Greek army
21-
* urchenev: fungal growth found under cobblestones
22-
16+
If provided no arguments or the `-h|--help` flag, generate a usage:
17+
2318
````
2419
$ ./gibberish.py
2520
usage: gibberish.py [-h] [-n int] [-k int] [-m int] [-s int] [-d] FILE [FILE ...]
@@ -42,37 +37,117 @@ optional arguments:
4237
Max word length (default: 12)
4338
-s int, --seed int Random seed (default: None)
4439
-d, --debug Debug to ".log" (default: False)
45-
$ ./gibberish.py /usr/share/dict/words -s 1
46-
1: oveli
47-
2: uming
48-
3: uylatiteda
49-
4: owsh
50-
5: uuse
51-
6: ismandl
52-
7: efortai
53-
8: eyhopy
54-
9: auretrab
55-
10: ozogralach
56-
$ ./gibberish.py ../inputs/const.txt -s 2 -k 3
57-
1: romot
58-
2: leasonsusp
59-
3: gdoned
60-
4: bunablished
61-
5: neithere
62-
6: achmen
63-
7: reason
64-
8: nmentyone
65-
9: effereof
66-
10: eipts
67-
$ ./gibberish.py -k 2 ../inputs/1945-boys.txt
68-
1: baronaler
69-
2: lip
70-
3: oselli
71-
4: ard
72-
5: vicharley
73-
6: melli
74-
7: denry
75-
8: jerictomank
76-
9: rick
77-
10: larvichaell
7840
````
41+
42+
Create new English words by training on a dictionary:
43+
44+
````
45+
$ ./gibberish.py /usr/share/dict/words -s 1 -n 5
46+
1: salva
47+
2: xeroolizati
48+
3: upst
49+
4: azeconi
50+
5: woco
51+
````
52+
53+
Create different words by training on the US Constitution:
54+
55+
````
56+
$ ./gibberish.py ../inputs/const.txt -s 2 -k 3 -n 4
57+
1: lfare
58+
2: sachmentit
59+
3: such
60+
4: rcessadopti
61+
````
62+
63+
Generate new names for boys:
64+
65+
````
66+
$ ./gibberish.py -k 2 -n 6 ../inputs/1945-boys.txt
67+
1: marthomart
68+
2: danie
69+
3: muel
70+
4: osep
71+
5: tomandrenny
72+
6: alberber
73+
````
74+
75+
Chose the best words and create definitions for them:
76+
77+
* yulcogicism: the study of Christmas gnostics
78+
* umjamp: skateboarding trick
79+
* callots: insignia of officers in Greek army
80+
* urchenev: fungal growth found under cobblestones
81+
82+
## Kmers
83+
84+
To create the Markov chains, you'll need to read all the words from each file. Use `str.lower` to lowercase all the text and then remove any character that is not in the regular English alphabet (a-z). You'll need to extract "k-mers" or "n-grams" from each word. In the text "abcd," if `k=2` then the 2-mers are "ab," "bc," and "cd." If `k=3`, then the 3-mers are "abc" and "bcd." It may be helpful to know the number `n` of kmers `k` is proportional to the length `l` of the string `n = l - k + 1`.
85+
86+
Consider writing a function `kmers(text, k=1)` that only extracts kmers from some text, and then add this function to your program:
87+
88+
````
89+
def test_kmers():
90+
"""Test kmers"""
91+
92+
assert kmers('abcd') == list('abcd')
93+
assert kmers('abcd', 2) == ['ab', 'bc', 'cd']
94+
assert kmers('abcd', 3) == ['abc', 'bcd']
95+
assert kmers('abcd', 4) == ['abcd']
96+
assert kmers('abcd', 5) == []
97+
````
98+
99+
Run your program with `pytest -v gibberish.py` and see if it passes.
100+
101+
## Chains
102+
103+
To create the Markov chains, you'll need to get all the kmers for `k+1` for all the words in all the texts. That is, if `k=3` you need to find all the 4-mers so that you can find the character *after* the 3-mers in the texts. For example, in the text "The quick brown fox jumps over the lazy dog.", we need to create a data structure that looks like this:
104+
105+
````
106+
>>> from pprint import pprint as pp
107+
>>> pp(chains)
108+
{'bro': ['w'],
109+
'jum': ['p'],
110+
'laz': ['y'],
111+
'ove': ['r'],
112+
'qui': ['c'],
113+
'row': ['n'],
114+
'uic': ['k'],
115+
'ump': ['s']}
116+
````
117+
118+
For every 3-mer, we need to know all the characters that follow each. Obviously this is not very exciting given the small size of the input text.
119+
120+
Consider writing a function `read_training(fhs, k=1)` that reads the input training files and returns a dictionary of kmer chains. Then add this function to test that is works properly:
121+
122+
````
123+
def test_read_training():
124+
"""Test read_training"""
125+
126+
text = 'The quick brown fox jumps over the lazy dog.'
127+
128+
expected3 = {
129+
'qui': ['c'],
130+
'uic': ['k'],
131+
'bro': ['w'],
132+
'row': ['n'],
133+
'jum': ['p'],
134+
'ump': ['s'],
135+
'ove': ['r'],
136+
'laz': ['y']
137+
}
138+
assert read_training([io.StringIO(text)], k=3) == expected3
139+
140+
expected4 = {'quic': ['k'], 'brow': ['n'], 'jump': ['s']}
141+
assert read_training([io.StringIO(text)], k=4) == expected4
142+
````
143+
144+
## Making new words
145+
146+
Once you have the chains of letters that follow each kmer, you need can use `random.choice` to find a starting kmer from the `keys` of your chain dictionary. Also use that function to select a length for your new word from the range of `k + 2` to the `args.max_word` (which defaults to `12`). Build up your new word by again using `random.choice` to select from the possibilities for the kmer which will change through each iteration.
147+
148+
That is, if you `k=3` and you start with the randomly selected kmer `ero`, you might get `n` as your next letter. On the next iteration of the loop, the `kmer` will be `ron` and you will look to see what letters follow that 3-mer. You might get `d`, and so the next time you would look for those letters following `ond`, and so forth. Continue until you've built a word that is the length you selected.
149+
150+
Hints:
151+
152+
* Define the input files with `type=argparse.FileType('r')` so that `argparse` will validate the user provides readable files and then will `open` them for you.
153+
* Consider using the `logging` module to print out debugging messages. Run the `solution.py` with the `-d` flag and then inspect the `.log` file.

gibberish/discussion.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
As recommended in the description, I define my arguments in `get_args` to rely on `argparse` to validate as much as possible, e.g. verify that I get `int` values or readabnle files, and provide reasonable defaults for everything but the required `file` argument. I additionally define a `-d|--debug` flag that is only `True` when present so that I can add this bit of code:
2+
3+
````
4+
logging.basicConfig(
5+
filename='.log',
6+
filemode='w',
7+
level=logging.DEBUG if args.debug else logging.CRITICAL)
8+
````
9+
10+
This is a simple and effective way to turn debugging messages on and off. I usually write to a `.log` file, being sure to choose a name that starts with a `.` so that it will normally be hidden when I `ls` the directory. Since the `filemode='w'`, the file will be overwritten on each run. I set the threshold to `logging.DEBUG` if the `debug` flag is `True`; otherwise the `logging` module will only emit those at the `CRITICAL` level. As I don't have any "critical" messages, the `.log` file will be empty unless the `--debug` is present. Then I have `logging.debug()` calls throughout my code which will only log messages when I ask. This is easier than putting `print` statements in your code which you have to remove or comment out when you are done debugging.
11+
12+
## Finding kmers in text
13+
14+
If you followed my advice about breaking down the problem, then you probably created a `kmers` function:
15+
16+
````
17+
>>> def kmers(text, k=1):
18+
... return [text[i:i + k] for i in range(len(text) - k + 1)]
19+
...
20+
````
21+
22+
Using the formula given in the intro for the number of kmers in a string, I use the `range` function to get the start position of each of those kmers and then get the slice of the `text` from that position to the position `k` away.
23+
24+
I can verify it works in the REPL:
25+
26+
````
27+
>>> kmers('abcd', 2)
28+
['ab', 'bc', 'cd']
29+
>>> kmers('abcd', 3)
30+
['abc', 'bcd']
31+
````
32+
33+
But more importantly, I can write a `test_kmers` function that I embed in my code and run with `pytest`!
34+
35+
## Reading the input files
36+
37+
Since I used the `argparse.FileType` to define the `file` with `nargs='+'`, I have a `list` of *open file handles* that can be read. I defined a `read_training` function that iterates over all the words in each file by calling `fh.read().split()`. As this breaks the text on spaces, various bits of punctuation may still be attached:
38+
39+
````
40+
>>> fh = open('../inputs/spiders.txt')
41+
>>> fh.read().split()
42+
['Don’t', 'worry,', 'spiders,', 'I', 'keep', 'house', 'casually.']
43+
````
44+
45+
So I use a regular expression to remove anything that is *not* in the set of letters "a-z" by defining a negated character class `[^a-z]`. I create a one-line function to `lower` the word and `clean` it:
46+
47+
````
48+
>>> import re
49+
>>> clean = lambda word: re.sub('[^a-z]', '', word.lower())
50+
>>> clean('"Hey!"')
51+
'hey'
52+
````
53+
54+
Now I can get cleaned, lowercase text:
55+
56+
````
57+
>>> fh = open('../inputs/spiders.txt')
58+
>>> list(map(clean, fh.read().split()))
59+
['dont', 'worry', 'spiders', 'i', 'keep', 'house', 'casually']
60+
````
61+
62+
I can now get all the kmers for each word by using my `kmers` function. I put all this into a function called `read_training`. It takes a `list` of open file handles (which I get from `argparse`) and a `k` which defaults to `1`:
63+
64+
````
65+
>>> def read_training(fhs, k=1):
66+
... chains = defaultdict(list)
67+
... clean = lambda word: re.sub('[^a-z]', '', word.lower())
68+
... for fh in fhs:
69+
... for word in map(clean, fh.read().split()):
70+
... for kmer in kmers(word, k + 1):
71+
... chains[kmer[:-1]].append(kmer[-1])
72+
... return chains
73+
...
74+
````
75+
76+
I can verify it works:
77+
78+
````
79+
>>> from collections import defaultdict
80+
>>> from pprint import pprint as pp
81+
>>> pp(read_training([open('../inputs/spiders.txt')], k=5))
82+
defaultdict(<class 'list'>,
83+
{'asual': ['l'],
84+
'casua': ['l'],
85+
'pider': ['s'],
86+
'spide': ['r'],
87+
'suall': ['y']})
88+
````
89+
90+
But, again, *more importantly is that I can write a test that verifies it works*! If you copy in the `test_read_training` function, you have the knowledge that you are creating valid chains.

0 commit comments

Comments
 (0)