Skip to content

Commit 71408fa

Browse files
committed
moog exercise
1 parent e29f565 commit 71408fa

File tree

4 files changed

+387
-0
lines changed

4 files changed

+387
-0
lines changed

extra/09_synthetic_dna/Makefile

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
.PHONY: test pdf clean
2+
3+
pdf:
4+
asciidoctor-pdf README.adoc
5+
6+
test:
7+
pytest -xv --disable-pytest-warnings test.py
8+
9+
clean:
10+
rm -rf __pycache__ .pytest

extra/09_synthetic_dna/README.adoc

Lines changed: 230 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
= Creating synthetic DNA/RNA sequences
2+
3+
In this exercise, you will write a Python program called `moog.py` footnote:[Why "moog"?] that will generate a FASTA-formatted footnote:[https://en.wikipedia.org/wiki/FASTA_format] file of synthetic DNA or RNA.
4+
The program will accept the following optional arguments:
5+
6+
* `-o`|`--outfile`: The output file to write the sequences (default "out.fa")
7+
* `-t`|`--seqtype`: The sequence type, either `dna` or `rna` (`str`, default "dna")
8+
* `-n`|`--numseqs`: The number of sequences to generate (`int`, default 10)
9+
* `-m`|`--minlen`: The minimum length for any sequence (`int`, default 50)
10+
* `-x`|`--maxlen`: The maximum length for any sequence (`int`, default 75)
11+
* `-p`|`--pctgc`: The average percentage of GC content for a sequence (`float`, default 0.5 or 50%)
12+
* `-s`|`--seed`: An integer value to use for the random seed (`int`, default `None`) so that the random choices of the program can be repeated under testing conditions.
13+
14+
Here is the usage the program should generate:
15+
16+
----
17+
$ ./moog.py -h
18+
usage: moog.py [-h] [-o str] [-t str] [-n int] [-m int] [-x int] [-p float]
19+
[-s int]
20+
21+
Create synthetic sequences
22+
23+
optional arguments:
24+
-h, --help show this help message and exit
25+
-o str, --outfile str
26+
Output filename (default: out.fa)
27+
-t str, --seqtype str
28+
DNA or RNA (default: dna)
29+
-n int, --numseqs int
30+
Number of sequences to create (default: 10)
31+
-m int, --minlen int Minimum length (default: 50)
32+
-x int, --maxlen int Maximum length (default: 75)
33+
-p float, --pctgc float
34+
Percent GC (default: 0.5)
35+
-s int, --seed int Random seed (default: None)
36+
----
37+
38+
For instance, I can run it to create 3 sequences with the default values, and the program will tell me how many of what kind of sequences were placed into which output file:
39+
40+
----
41+
$ ./moog.py -n 3 -s 1
42+
Done, wrote 3 DNA sequences to "out.fa".
43+
----
44+
45+
The output file should be in FASTA format:
46+
47+
* Each sequence record takes up two lines
48+
* The first line for each sequence record starts with a literal `>` (greater than sign) and is followed by a unique identifier. For this exercise, the ID is not important so numbering the sequences in order is sufficient.
49+
* The second line of a record is the sequence itself. Note that some FASTA formats will limit the length of this line and so may break up the sequence over several lines. This is not necessary. The sequence can be one very long line. If you really want to break the sequence after something like 80 characters, that is fine, too.
50+
51+
Here is what the output for the above should (might) look like:
52+
53+
----
54+
$ cat out.fa
55+
>1
56+
ATTTGCATAGGAGCAGGACAAAGGGCTCGACTCTTCCGCGCCATGTTGTATCAGAACA
57+
>2
58+
CCCTTGATCGGCCCGGGGGTACGCATACCGTACAAGCTGGTTAATTACTAAAAATTACTGAAACGGAATGC
59+
>3
60+
TTCTGTGGGAGTCAGAGACCTATGAAGATTCTAATAGCAGACGCCAAGATCCGCAGCACAT
61+
----
62+
63+
== Writing `moog.py`
64+
65+
The first challenge is to define all your arguments correctly.
66+
Onerous as this is, perhaps 30% of the program is in accepting the arguments correctly!
67+
68+
Some tips:
69+
70+
* Be sure to use `choices` for the `seqtype` argument
71+
* Consider using `type=argparse.FileType('wt')` for the `--outfile`. You don't have to do this, but, if you do, then `args.outfile` will be an open, writable file handle!
72+
* You should also manually verify that `pctgc` is between 0 and 1 which can be done with a compound comparison like so:
73+
74+
----
75+
args = parser.parse_args()
76+
77+
if not 0 < args.pctgc < 1:
78+
parser.error(f'--pctgc "{args.pctgc}" must be between 0 and 1')
79+
----
80+
81+
You should be sure to set the random seed immediately after accepting the arguments, _before_ you do anything using the `random` module.
82+
Then your program will need to get a "pool" of bases for creating the sequences:
83+
84+
----
85+
def main():
86+
args = get_args()
87+
random.seed(args.seed)
88+
pool = create_pool(args.pctgc, args.maxlen, args.seqtype)
89+
----
90+
91+
You can use the following `create_pool` function.
92+
The function accepts three positional arguments:
93+
94+
. `pctgc`: The average percentage of GC content for each sequence
95+
. `max_len`: The maximum length for any sequence
96+
. `seq_type`: The sequence type, either "rna" or "dna"
97+
98+
----
99+
def create_pool(pctgc, max_len, seq_type):
100+
""" Create the pool of bases """
101+
102+
t_or_u = 'T' if seq_type == 'dna' else 'U' <1>
103+
num_gc = int((pctgc / 2) * max_len) <2>
104+
num_at = int(((1 - pctgc) / 2) * max_len) <3>
105+
pool = 'A' * num_at + 'C' * num_gc + 'G' * num_gc + t_or_u * num_at <4>
106+
107+
for _ in range(max_len - len(pool)): <5>
108+
pool += random.choice(pool)
109+
110+
return ''.join(sorted(pool)) <6>
111+
----
112+
113+
<1> Choose either "T" if `seq_type` is "dna" or choose "U" for "rna"
114+
<2> The number of G or C bases is the `pctgc` divided by 2 times the number of bases.
115+
<3> The number of A or T/U bases is the `1 - pctgc` divided by 2 times the number of bases.
116+
<4> The `pool` of bases will be each base in "ACG[TU]" repeated the correct number of times.
117+
<5> Because of rounding issues, we may not actually have enough bases, so pad the pool with random choices from the existing pool (in the hopes this essentially keeps the GC content the same).
118+
<6> Return a sorted string of the bases.
119+
120+
Here is the test for the function.
121+
Notice how the test also gives us a very clear understanding of how we'll pass in and receive values:
122+
123+
----
124+
def test_create_pool():
125+
""" Test create_pool """
126+
127+
state = random.getstate() <1>
128+
random.seed(1) <2>
129+
assert create_pool(.5, 10, 'dna') == 'AAACCCGGTT'
130+
assert create_pool(.6, 11, 'rna') == 'AACCCCGGGUU'
131+
assert create_pool(.7, 12, 'dna') == 'ACCCCCGGGGGT'
132+
assert create_pool(.7, 20, 'rna') == 'AAACCCCCCCGGGGGGGUUU'
133+
assert create_pool(.4, 15, 'dna') == 'AAAACCCGGGTTTTT'
134+
random.setstate(state) <3>
135+
----
136+
137+
<1> The state of the `random` module is _global_ to the program. Any change we make here could affect unknown parts of the program, so we save our current state.
138+
<2> Set the random seed to a known value. This is a _global change_ to our program. Any other calls to `random` after this line are affected, even if they are in another function or module!
139+
<3> Reset the global state to the original value.
140+
141+
With the above functions, your program is essentially left to fill in this:
142+
143+
----
144+
def main():
145+
args = get_args()
146+
random.seed(args.seed)
147+
pool = create_pool(args.pctgc, args.maxlen, args.seqtype)
148+
149+
for ...: <1>
150+
seq_len = ... <2>
151+
seq = ... <3>
152+
args.outfile.write(...)) <4>
153+
154+
print(...) <5>
155+
----
156+
157+
<1> You need to do this `args.numseqs` times
158+
<2> Use `random.choice` to select a number between `args.minlen` and `arg.maxlen`
159+
<3> Use `random.sample` to select `seq_len` number of bases from the `pool` and make a new `str` for your sequence.
160+
<4> Write the new `seq` to the output file in the FASTA format.
161+
<5> Print the output message.
162+
163+
== Using the `random` functions
164+
165+
As noted in the `abuse` chapter, we can use `random.seed` to control the pseudo-random generator in Python.
166+
This allows us to test that our random functions are _reproducible_!
167+
You should set this value before calling any functions in the `random` module.
168+
The default value for your `--seed` parameter should be `None`.
169+
If you set the seed to `None`, it is the same as not setting it at all.
170+
The seed can be a `str` or an `int`, but stick with using an `int` for this exercise.
171+
172+
For each sequence, you will use `random.randint` to select a length for the sequence between the min/max values:
173+
174+
----
175+
>>> import random
176+
>>> min_len = 5
177+
>>> max_len = 15
178+
>>> seq_len = random.randint(min_len, max_len)
179+
>>> seq_len
180+
12
181+
----
182+
183+
You will then use this value to select the bases for your new sequence:
184+
185+
----
186+
>>> random.sample(pool, seq_len)
187+
['A', 'T', 'A', 'T', 'C', 'C', 'G', 'C', 'G', 'A', 'G', 'T']
188+
----
189+
190+
The output should be written to the output file like so (assuming this is the first sequence):
191+
192+
----
193+
>1
194+
ATATCCGCGAGT
195+
----
196+
197+
== Testing
198+
199+
You may need to install BioPython and numpy in order to run the tests:
200+
201+
----
202+
$ python3 -m pip install biopython numpy
203+
----
204+
205+
I would recommend you study the `test.py` as it uses the BioPython module to parse the output file your program creates so as to check:
206+
207+
* that the output file is parsable as FASTA format
208+
* that the output file has the correct number of sequences
209+
* that the sequences lie in the range of min/max lengths
210+
* that the bases are correct for the given sequence type
211+
* that the average GC content of the sequences is close to the value indicated
212+
213+
Note that BioPython will emit a deprecation warning under `pytest`, so I have added an additional flag `--disable-pytest-warnings` to the `make test` target that you should use:
214+
215+
----
216+
$ make test
217+
pytest -xv --disable-pytest-warnings test.py
218+
============================= test session starts ==============================
219+
...
220+
collected 6 items
221+
222+
test.py::test_exists PASSED [ 16%]
223+
test.py::test_usage PASSED [ 33%]
224+
test.py::test_bad_seqtype PASSED [ 50%]
225+
test.py::test_bad_pctgc PASSED [ 66%]
226+
test.py::test_defaults PASSED [ 83%]
227+
test.py::test_options PASSED [100%]
228+
229+
========================= 6 passed, 1 warning in 0.87s =========================
230+
----

extra/09_synthetic_dna/README.pdf

120 KB
Binary file not shown.

extra/09_synthetic_dna/test.py

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
#!/usr/bin/env python3
2+
"""tests for moog.py"""
3+
4+
import os
5+
import random
6+
import re
7+
import string
8+
from subprocess import getstatusoutput
9+
from Bio import SeqIO
10+
from Bio.SeqUtils import GC
11+
from numpy import mean
12+
from itertools import chain
13+
14+
prg = './moog.py'
15+
16+
17+
# --------------------------------------------------
18+
def random_string():
19+
"""generate a random string"""
20+
21+
return ''.join(random.choices(string.ascii_uppercase + string.digits, k=5))
22+
23+
24+
# --------------------------------------------------
25+
def test_exists():
26+
"""usage"""
27+
28+
assert os.path.isfile(prg)
29+
30+
31+
# --------------------------------------------------
32+
def test_usage():
33+
"""usage"""
34+
35+
for flag in ['-h', '--help']:
36+
rv, out = getstatusoutput('{} {}'.format(prg, flag))
37+
assert rv == 0
38+
assert re.match("usage", out, re.IGNORECASE)
39+
40+
41+
# --------------------------------------------------
42+
def test_bad_seqtype():
43+
"""die on bad seqtype"""
44+
45+
bad = random_string()
46+
rv, out = getstatusoutput(f'{prg} -t {bad}')
47+
assert rv != 0
48+
assert re.match('usage:', out, re.I)
49+
assert re.search(
50+
f"-t/--seqtype: invalid choice: '{bad}' \(choose from 'dna', 'rna'\)",
51+
out)
52+
53+
54+
# --------------------------------------------------
55+
def test_bad_pctgc():
56+
"""die on bad pctgc"""
57+
58+
bad = random.randint(1, 10)
59+
rv, out = getstatusoutput(f'{prg} -p {bad}')
60+
assert rv != 0
61+
assert re.match('usage:', out, re.I)
62+
assert re.search(f'--pctgc "{float(bad)}" must be between 0 and 1', out)
63+
64+
65+
# --------------------------------------------------
66+
def test_defaults():
67+
"""runs on good input"""
68+
69+
out_file = 'out.fa'
70+
try:
71+
if os.path.isfile(out_file):
72+
os.remove(out_file)
73+
74+
rv, out = getstatusoutput(prg)
75+
assert rv == 0
76+
assert out == f'Done, wrote 10 DNA sequences to "{out_file}".'
77+
assert os.path.isfile(out_file)
78+
79+
# correct number of seqs
80+
seqs = list(SeqIO.parse(out_file, 'fasta'))
81+
assert len(seqs) == 10
82+
83+
# the lengths are in the correct range
84+
seq_lens = list(map(lambda seq: len(seq.seq), seqs))
85+
assert max(seq_lens) <= 75
86+
assert min(seq_lens) >= 50
87+
88+
# bases are correct
89+
bases = ''.join(
90+
sorted(
91+
set(chain(map(lambda seq: ''.join(sorted(set(seq.seq))),
92+
seqs)))))
93+
assert bases == 'ACGT'
94+
95+
# the pct GC is about right
96+
gc = list(map(lambda seq: GC(seq.seq) / 100, seqs))
97+
assert .47 <= mean(gc) <= .53
98+
99+
finally:
100+
if os.path.isfile(out_file):
101+
os.remove(out_file)
102+
103+
104+
# --------------------------------------------------
105+
def test_options():
106+
"""runs on good input"""
107+
108+
out_file = random_string() + '.fasta'
109+
try:
110+
if os.path.isfile(out_file):
111+
os.remove(out_file)
112+
113+
min_len = random.randint(50, 99)
114+
max_len = random.randint(100, 150)
115+
num_seqs = random.randint(100, 150)
116+
pct_gc = random.random()
117+
cmd = (f'{prg} -m {min_len} -x {max_len} -o {out_file} '
118+
f'-n {num_seqs} -t rna -p {pct_gc:.02f} -s 1')
119+
rv, out = getstatusoutput(cmd)
120+
121+
assert rv == 0
122+
assert out == f'Done, wrote {num_seqs} RNA sequences to "{out_file}".'
123+
assert os.path.isfile(out_file)
124+
125+
# correct number of seqs
126+
seqs = list(SeqIO.parse(out_file, 'fasta'))
127+
assert len(seqs) == num_seqs
128+
129+
# the lengths are in the correct range
130+
seq_lens = list(map(lambda seq: len(seq.seq), seqs))
131+
assert max(seq_lens) <= max_len
132+
assert min(seq_lens) >= min_len
133+
134+
# bases are correct
135+
bases = ''.join(
136+
sorted(
137+
set(chain(map(lambda seq: ''.join(sorted(set(seq.seq))),
138+
seqs)))))
139+
assert bases == 'ACGU'
140+
141+
# the pct GC is about right
142+
gc = list(map(lambda seq: GC(seq.seq) / 100, seqs))
143+
assert pct_gc - .3 <= mean(gc) <= pct_gc + .3
144+
145+
finally:
146+
if os.path.isfile(out_file):
147+
os.remove(out_file)

0 commit comments

Comments
 (0)