You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Write a Python program called `gibberish.py` that uses the Markov chain algorithm to generate new words from a set of training files. The program should take one or more positional arguments which are files that you read, word-by-word, and note the options of letters after a given `-k|--kmer_size` (default `2`) grouping of letters. E.g., in the word "alabama" with `k=1`, the frequency table will look like:
3
+
Write a Python program called `gibberish.py` that uses the Markov chain algorithm to generate new words from the words in a set of training files. The program should take one or more positional arguments which are files that you read, word-by-word, and note the options of letters after a given `-k|--kmer_size` (default `2`) grouping of letters. E.g., in the word "alabama" with `k=1`, the frequency table will look like:
4
4
5
5
````
6
6
a = l, b, m
@@ -13,13 +13,8 @@ That is, given this training set, if you started with `l` you could only choose
13
13
14
14
The program should generate `-n|--num_words` words (default `10`), each a random size between `k` + 2 and a `-m|--max_word` size (default `12`). Be sure to accept `-s|--seed` to pass to `random.seed`. My solution also takes a `-d|--debug` flag that will emit debug messages to `.log` for you to inspect.
15
15
16
-
Chose the best words and create definitions for them:
17
-
18
-
* yulcogicism: the study of Christmas gnostics
19
-
* umjamp: skateboarding trick
20
-
* callots: insignia of officers in Greek army
21
-
* urchenev: fungal growth found under cobblestones
22
-
16
+
If provided no arguments or the `-h|--help` flag, generate a usage:
Chose the best words and create definitions for them:
76
+
77
+
* yulcogicism: the study of Christmas gnostics
78
+
* umjamp: skateboarding trick
79
+
* callots: insignia of officers in Greek army
80
+
* urchenev: fungal growth found under cobblestones
81
+
82
+
## Kmers
83
+
84
+
To create the Markov chains, you'll need to read all the words from each file. Use `str.lower` to lowercase all the text and then remove any character that is not in the regular English alphabet (a-z). You'll need to extract "k-mers" or "n-grams" from each word. In the text "abcd," if `k=2` then the 2-mers are "ab," "bc," and "cd." If `k=3`, then the 3-mers are "abc" and "bcd." It may be helpful to know the number `n` of kmers `k` is proportional to the length `l` of the string `n = l - k + 1`.
85
+
86
+
Consider writing a function `kmers(text, k=1)` that only extracts kmers from some text, and then add this function to your program:
87
+
88
+
````
89
+
def test_kmers():
90
+
"""Test kmers"""
91
+
92
+
assert kmers('abcd') == list('abcd')
93
+
assert kmers('abcd', 2) == ['ab', 'bc', 'cd']
94
+
assert kmers('abcd', 3) == ['abc', 'bcd']
95
+
assert kmers('abcd', 4) == ['abcd']
96
+
assert kmers('abcd', 5) == []
97
+
````
98
+
99
+
Run your program with `pytest -v gibberish.py` and see if it passes.
100
+
101
+
## Chains
102
+
103
+
To create the Markov chains, you'll need to get all the kmers for `k+1` for all the words in all the texts. That is, if `k=3` you need to find all the 4-mers so that you can find the character *after* the 3-mers in the texts. For example, in the text "The quick brown fox jumps over the lazy dog.", we need to create a data structure that looks like this:
104
+
105
+
````
106
+
>>> from pprint import pprint as pp
107
+
>>> pp(chains)
108
+
{'bro': ['w'],
109
+
'jum': ['p'],
110
+
'laz': ['y'],
111
+
'ove': ['r'],
112
+
'qui': ['c'],
113
+
'row': ['n'],
114
+
'uic': ['k'],
115
+
'ump': ['s']}
116
+
````
117
+
118
+
For every 3-mer, we need to know all the characters that follow each. Obviously this is not very exciting given the small size of the input text.
119
+
120
+
Consider writing a function `read_training(fhs, k=1)` that reads the input training files and returns a dictionary of kmer chains. Then add this function to test that is works properly:
121
+
122
+
````
123
+
def test_read_training():
124
+
"""Test read_training"""
125
+
126
+
text = 'The quick brown fox jumps over the lazy dog.'
Once you have the chains of letters that follow each kmer, you need can use `random.choice` to find a starting kmer from the `keys` of your chain dictionary. Also use that function to select a length for your new word from the range of `k + 2` to the `args.max_word` (which defaults to `12`). Build up your new word by again using `random.choice` to select from the possibilities for the kmer which will change through each iteration.
147
+
148
+
That is, if you `k=3` and you start with the randomly selected kmer `ero`, you might get `n` as your next letter. On the next iteration of the loop, the `kmer` will be `ron` and you will look to see what letters follow that 3-mer. You might get `d`, and so the next time you would look for those letters following `ond`, and so forth. Continue until you've built a word that is the length you selected.
149
+
150
+
Hints:
151
+
152
+
* Define the input files with `type=argparse.FileType('r')` so that `argparse` will validate the user provides readable files and then will `open` them for you.
153
+
* Consider using the `logging` module to print out debugging messages. Run the `solution.py` with the `-d` flag and then inspect the `.log` file.
As recommended in the description, I define my arguments in `get_args` to rely on `argparse` to validate as much as possible, e.g. verify that I get `int` values or readabnle files, and provide reasonable defaults for everything but the required `file` argument. I additionally define a `-d|--debug` flag that is only `True` when present so that I can add this bit of code:
2
+
3
+
````
4
+
logging.basicConfig(
5
+
filename='.log',
6
+
filemode='w',
7
+
level=logging.DEBUG if args.debug else logging.CRITICAL)
8
+
````
9
+
10
+
This is a simple and effective way to turn debugging messages on and off. I usually write to a `.log` file, being sure to choose a name that starts with a `.` so that it will normally be hidden when I `ls` the directory. Since the `filemode='w'`, the file will be overwritten on each run. I set the threshold to `logging.DEBUG` if the `debug` flag is `True`; otherwise the `logging` module will only emit those at the `CRITICAL` level. As I don't have any "critical" messages, the `.log` file will be empty unless the `--debug` is present. Then I have `logging.debug()` calls throughout my code which will only log messages when I ask. This is easier than putting `print` statements in your code which you have to remove or comment out when you are done debugging.
11
+
12
+
## Finding kmers in text
13
+
14
+
If you followed my advice about breaking down the problem, then you probably created a `kmers` function:
15
+
16
+
````
17
+
>>> def kmers(text, k=1):
18
+
... return [text[i:i + k] for i in range(len(text) - k + 1)]
19
+
...
20
+
````
21
+
22
+
Using the formula given in the intro for the number of kmers in a string, I use the `range` function to get the start position of each of those kmers and then get the slice of the `text` from that position to the position `k` away.
23
+
24
+
I can verify it works in the REPL:
25
+
26
+
````
27
+
>>> kmers('abcd', 2)
28
+
['ab', 'bc', 'cd']
29
+
>>> kmers('abcd', 3)
30
+
['abc', 'bcd']
31
+
````
32
+
33
+
But more importantly, I can write a `test_kmers` function that I embed in my code and run with `pytest`!
34
+
35
+
## Reading the input files
36
+
37
+
Since I used the `argparse.FileType` to define the `file` with `nargs='+'`, I have a `list` of *open file handles* that can be read. I defined a `read_training` function that iterates over all the words in each file by calling `fh.read().split()`. As this breaks the text on spaces, various bits of punctuation may still be attached:
So I use a regular expression to remove anything that is *not* in the set of letters "a-z" by defining a negated character class `[^a-z]`. I create a one-line function to `lower` the word and `clean` it:
I can now get all the kmers for each word by using my `kmers` function. I put all this into a function called `read_training`. It takes a `list` of open file handles (which I get from `argparse`) and a `k` which defaults to `1`:
But, again, *more importantly is that I can write a test that verifies it works*! If you copy in the `test_read_training` function, you have the knowledge that you are creating valid chains.
0 commit comments