You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: bottles_of_beer/discussion.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,3 +11,5 @@ return parser.parse_args()
11
11
But here we capture the arguments inside `get_args` and add a bit of validation. If `args.num_bottles` is less than one, we call `parser.error` with the message we want to tell the user. We don't have to tell the program to stop executing as `argparse` will exit immediately. Even better is that it will indicate a non-zero exit value to the operating system to indicate there was some sort of error. If you ever start writing command-line programs that chain together to make workflows, this is a way for one program to indicate failure and halt the entire process until the error has been fixed!
12
12
13
13
Once you get to the line `args = get_args()` in `main`, a great deal of hard work has already occurred to get and validate the input from the user. From here, I decided to create a template for the song putting `{}` in the spots that change from verse to verse. Then I use the `reversed(range(...))` bit we discussed before to count down, with a `for` loop, using the current number `bottle` and `next_bottle` to print out the verse noting the presence or absence of the `s` where appropriate.
14
+
15
+
I'd like to stress that there are literally hundreds of ways to solve this problem. The website http://www.99-bottles-of-beer.net/ claims to have 1500 variations in various languages, 15 in Python alone. As always, the solution you wrote and understand and that passes the test suite is the "right" solution.
Copy file name to clipboardExpand all lines: gibberish/README.md
+24-16Lines changed: 24 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ m = a
11
11
12
12
That is, given this training set, if you started with `l` you could only choose an `a`, but if you have `a` then you could choose `l`, `b`, or `m`.
13
13
14
-
The program should generate `-n|--num_words` words (default `10`), each a random size between `k` + 2 and a `-m|--max_word` size (default `12`). Be sure to accept `-s|--seed` to pass to `random.seed`. My solution also takes a `-d|--debug` flag that will emit debug messages to `.log` for you to inspect.
14
+
The program should generate `-n|--num_words` words (default `10`), each a random size between `k+2` and a `-m|--max_word` size (default `12`). Be sure to accept `-s|--seed` to pass to `random.seed`. My solution also takes a `-d|--debug` flag that will emit debug messages to `.log` for you to inspect.
15
15
16
16
If provided no arguments or the `-h|--help` flag, generate a usage:
@@ -78,22 +78,30 @@ Chose the best words and create definitions for them:
78
78
* umjamp: skateboarding trick
79
79
* callots: insignia of officers in Greek army
80
80
* urchenev: fungal growth found under cobblestones
81
-
81
+
82
82
## Kmers
83
83
84
-
To create the Markov chains, you'll need to read all the words from each file. Use `str.lower` to lowercase all the text and then remove any character that is not in the regular English alphabet (a-z). You'll need to extract "k-mers" or "n-grams" from each word. In the text "abcd," if `k=2` then the 2-mers are "ab," "bc," and "cd." If `k=3`, then the 3-mers are "abc" and "bcd." It may be helpful to know the number `n` of kmers `k` is proportional to the length `l` of the string `n = l - k + 1`.
85
-
86
-
Consider writing a function `kmers(text, k=1)` that only extracts kmers from some text, and then add this function to your program:
84
+
To create the Markov chains, first you'll need to read all the words from each file. Use `str.lower` to lowercase all the text and then remove any character that are not in the regular English alphabet (a-z). A regular expression is handy for that:
87
85
88
86
````
89
-
def test_kmers():
90
-
"""Test kmers"""
87
+
>>> import re
88
+
>>> re.sub('[^a-z]', '', 'H48,`b09e3!"')
89
+
'be'
90
+
````
91
91
92
-
assert kmers('abcd') == list('abcd')
93
-
assert kmers('abcd', 2) == ['ab', 'bc', 'cd']
94
-
assert kmers('abcd', 3) == ['abc', 'bcd']
95
-
assert kmers('abcd', 4) == ['abcd']
96
-
assert kmers('abcd', 5) == []
92
+
You'll need to extract "k-mers" or "n-grams" from each word. In the text "abcd," if `k=2` then the 2-mers are "ab," "bc," and "cd." If `k=3`, then the 3-mers are "abc" and "bcd." It may be helpful to know the number `n` of kmers `k` is proportional to the length `l` of the string `n = l - k + 1`.
93
+
94
+
Consider writing a function `get_kmers(text, k=1)` that only extracts kmers from some text, and then add this function to your program:
95
+
96
+
````
97
+
def test_get_kmers():
98
+
"""Test get_kmers"""
99
+
100
+
assert get_kmers('abcd') == list('abcd')
101
+
assert get_kmers('abcd', 2) == ['ab', 'bc', 'cd']
102
+
assert get_kmers('abcd', 3) == ['abc', 'bcd']
103
+
assert get_kmers('abcd', 4) == ['abcd']
104
+
assert get_kmers('abcd', 5) == []
97
105
````
98
106
99
107
Run your program with `pytest -v gibberish.py` and see if it passes.
@@ -115,7 +123,7 @@ To create the Markov chains, you'll need to get all the kmers for `k+1` for all
115
123
'ump': ['s']}
116
124
````
117
125
118
-
For every 3-mer, we need to know all the characters that follow each. Obviously this is not very exciting given the small size of the input text.
126
+
For every 3-mer, we need to know all the characters that follow each. Obviously this is not very exciting given the small size of the input text. If `k=2`, then you will see that `th` has two options, `e` and `e`. It's important to note how you will represent the choices for a given kmer. Will you use a `list`, a `set`, or a `collections.Counter`? Consider the implications. A `set` is smaller as it will represent only the *unique* letters but you will lose information about the *frequency* of letters. A `Counter` would store letters and counts, but how will you sample from that in a way that takes into account frequency? A `list` is probably the easiest structure.
119
127
120
128
Consider writing a function `read_training(fhs, k=1)` that reads the input training files and returns a dictionary of kmer chains. Then add this function to test that is works properly:
121
129
@@ -143,9 +151,9 @@ def test_read_training():
143
151
144
152
## Making new words
145
153
146
-
Once you have the chains of letters that follow each kmer, you need can use `random.choice` to find a starting kmer from the `keys` of your chain dictionary. Also use that function to select a length for your new word from the range of `k + 2` to the `args.max_word` (which defaults to `12`). Build up your new word by again using `random.choice` to select from the possibilities for the kmer which will change through each iteration.
154
+
Once you have the chains of letters that follow each kmer, you need can use `random.choice` to find a starting kmer from the `keys` of your chain dictionary. Also use that function to select a length for your new word from the range of `k+2` to the `args.max_word` (which defaults to `12`). Build up your new word by again using `random.choice` to select from the possibilities for the kmer which will change through each iteration.
147
155
148
-
That is, if you `k=3` and you start with the randomly selected kmer `ero`, you might get `n` as your next letter. On the next iteration of the loop, the `kmer` will be `ron` and you will look to see what letters follow that 3-mer. You might get `d`, and so the next time you would look for those letters following `ond`, and so forth. Continue until you've built a word that is the length you selected.
156
+
That is, if `k=3` and you start with the randomly selected kmer `ero`, you might get `n` as your next letter. On the next iteration of the loop, the `kmer` will be `ron` and you will look to see what letters follow that 3-mer. You might get `d`, and so the next time you would look for those letters following `ond`, and so forth. Continue until you've built a word that is the length you selected.
Copy file name to clipboardExpand all lines: gibberish/discussion.md
+52-8Lines changed: 52 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
As recommended in the description, I define my arguments in `get_args` to rely on `argparse` to validate as much as possible, e.g. verify that I get `int` values or readabnle files, and provide reasonable defaults for everything but the required `file` argument. I additionally define a `-d|--debug` flag that is only `True` when present so that I can add this bit of code:
1
+
As recommended in the description, I define my arguments in `get_args` to rely on `argparse` to validate as much as possible, e.g. verify that I get `int` values and readable files as well as provide reasonable defaults for everything but the required `file` argument. I additionally define a `-d|--debug` flag that is only `True` when present so that I can add this bit of code:
2
2
3
3
````
4
4
logging.basicConfig(
@@ -11,10 +11,10 @@ This is a simple and effective way to turn debugging messages on and off. I usua
11
11
12
12
## Finding kmers in text
13
13
14
-
If you followed my advice about breaking down the problem, then you probably created a `kmers` function:
14
+
If you followed my advice about breaking down the problem, then you probably created a `kmers` function with the formula for the number of kmers in a given test (`n = l - k + 1`):
15
15
16
16
````
17
-
>>> def kmers(text, k=1):
17
+
>>> def get_kmers(text, k=1):
18
18
... return [text[i:i + k] for i in range(len(text) - k + 1)]
19
19
...
20
20
````
@@ -24,15 +24,15 @@ Using the formula given in the intro for the number of kmers in a string, I use
24
24
I can verify it works in the REPL:
25
25
26
26
````
27
-
>>> kmers('abcd', 2)
27
+
>>> get_kmers('abcd', 2)
28
28
['ab', 'bc', 'cd']
29
-
>>> kmers('abcd', 3)
29
+
>>> get_kmers('abcd', 3)
30
30
['abc', 'bcd']
31
31
````
32
32
33
33
But more importantly, I can write a `test_kmers` function that I embed in my code and run with `pytest`!
34
34
35
-
## Reading the input files
35
+
## Reading the training files
36
36
37
37
Since I used the `argparse.FileType` to define the `file` with `nargs='+'`, I have a `list` of *open file handles* that can be read. I defined a `read_training` function that iterates over all the words in each file by calling `fh.read().split()`. As this breaks the text on spaces, various bits of punctuation may still be attached:
38
38
@@ -67,12 +67,14 @@ I can now get all the kmers for each word by using my `kmers` function. I put al
Note the handling of the kmers. I actually request `k+1`-mers and then slice `kmer[:-1]` to get the actual `k`-mer (everything up to the penultimate letter) and then `append``kmer[-1]` (the last letter) to the `chains` for that `k`-mer.
77
+
76
78
I can verify it works:
77
79
78
80
````
@@ -87,4 +89,46 @@ defaultdict(<class 'list'>,
87
89
'suall': ['y']})
88
90
````
89
91
90
-
But, again, *more importantly is that I can write a test that verifies it works*! If you copy in the `test_read_training` function, you have the knowledge that you are creating valid chains.
92
+
But, again, *more importantly is that I can write a test that verifies it works*! If you copy in the `test_read_training` function, you have the assurange that you are creating valid chains.
93
+
94
+
## Making new words
95
+
96
+
Once I have the chains from all the input files, I need to use a `for` loop for the `range(args.num_words)`. Each time through the loop, I need to choose a starting kmer for a new word and a length
... print('Length "{}" starting with "{}"'.format(length, word))
108
+
...
109
+
Length "9" starting with "pid"
110
+
Length "7" starting with "cas"
111
+
Length "8" starting with "orr"
112
+
````
113
+
114
+
OK, that's our starting point. Given a starting kmer like `'pid'`, we need to create a `while` loop that will continue as long as the `len(word)` is less than the `length` we chose for the word. Each time through the loop, I'll set the current `kmer` to the last `k` letters of the `word`. I use `random.choice` to select from `chains[kmer]` to find the next `char` (character) and append that to the `word`:
115
+
116
+
````
117
+
>>> while len(word) < length:
118
+
... kmer = word[-1 * k:]
119
+
... if not chains[kmer]: break
120
+
... char = random.choice(list(chains[kmer]))
121
+
... word += char
122
+
...
123
+
>>> print(word)
124
+
piders
125
+
````
126
+
127
+
It can happen sometimes that there are no options for a given `kmer`. That is, `chains[kmer]` is an empty list, so I in my code I add a check to `break` out of the `while` loop if this evaluates to `False`.
128
+
129
+
Finally I `print` out the number of the word and the word itself using a format string to align the numbers and text:
0 commit comments