Skip to content

Commit a61b846

Browse files
committed
adding documentation for the genomics module
1 parent 5099a7e commit a61b846

File tree

11 files changed

+454
-6
lines changed

11 files changed

+454
-6
lines changed

README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,14 @@ At the moment this tutorial is still under development. Please check the [BioJa
1111

1212
## Index
1313

14+
Quick [Installation](installation.md)
15+
1416
Book 1: [The Protein Structure modules](structure/README.md)
1517

18+
Book 2: [The Genomics Module](genomics/README.md)
19+
20+
Book 3: Alignments
21+
1622

1723
## License
1824

genomics/README.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
The BioJava - Genomics Module
2+
=====================================================
3+
4+
A tutorial for the genomics module of [BioJava](http://www.biojava.org)
5+
6+
## About
7+
<table>
8+
<tr>
9+
<td>
10+
<img src="img/genomics.png"/>
11+
</td>
12+
<td>
13+
The <i>genome</i> module of BioJava provides an API that allows to
14+
<ul>
15+
<li>Parse popular file formats used in genomcs</li>
16+
<li>Convert from one file format to another</li>
17+
<li>Translate DNA sequences into protein sequences</li>
18+
</ul>
19+
20+
</td>
21+
</tr>
22+
</table>
23+
24+
## Index
25+
26+
This tutorial is split into several chapters.
27+
28+
Chapter 1 - Quick [Installation](installation.md)
29+
30+
Chapter 2 - Reading [gene names information](genenames.md) from genenames.org
31+
32+
Chapter 3 - Reading [chromosomal positions](chromosomeposition.md) for genes. (UCSC's refFlat.txt.gz )
33+
34+
Chapter 4 - Reading [GTF and GFF files](gff.md)
35+
36+
Chapter 5 - Reading and writing a [Genebank](genebank.md) file
37+
38+
Chapter 5 - Reading [karyotype (cytoband)](karyotype.md) files
39+
40+
Chapter 6 - Reading UCSC's .2bit files
41+
42+
43+
44+
### Author:
45+
46+
[Andreas Prli&#263;](https://github.com/andreasprlic)
47+
48+
## Please cite
49+
50+
**BioJava: an open-source framework for bioinformatics in 2012**<br/>
51+
*Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis* <br/>
52+
[Bioinformatics (2012) 28 (20): 2693-2695.](http://bioinformatics.oxfordjournals.org/content/28/20/2693.abstract) <br/>
53+
doi: 10.1093/bioinformatics/bts494
54+
55+
## License
56+
57+
The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
58+
59+
[view license](../license.md)
60+
61+

genomics/chromosomeposition.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
Parse Chromosomal Information of Genes
2+
======================================
3+
4+
BioJava contains a parser the [refFlat.txt.gz](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz)
5+
from the UCSC genome browser that contains a mapping of gene names to chromosome positions.
6+
7+
8+
```java
9+
try {
10+
11+
List<GeneChromosomePosition> genePositions= GeneChromosomePositionParser.getChromosomeMappings();
12+
System.out.println("got " + genePositions.size() + " gene positions") ;
13+
14+
for (GeneChromosomePosition pos : genePositions){
15+
if ( pos.getGeneName().equals("FOLH1")) {
16+
System.out.println(pos);
17+
break;
18+
}
19+
}
20+
21+
} catch(Exception e){
22+
e.printStackTrace();
23+
}
24+
```
25+
26+
If a local copy of the file is available, it can be provide via this:
27+
28+
29+
```java
30+
31+
URL url = new URL("file://local/copy/of/file");
32+
33+
InputStreamProvider prov = new InputStreamProvider();
34+
35+
InputStream inStream = prov.getInputStream(url);
36+
37+
GeneChromosomePositionParser.getChromosomeMappings(inStream);
38+
39+
40+
41+
```

genomics/genebank.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
Reading and writing a Genbank file
2+
==================================
3+
4+
There are multiple ways how to read a Genbank file.
5+
6+
## Method 1: Read a Genbank file using the GenbankProxySequenceReader
7+
8+
```java
9+
10+
GenbankProxySequenceReader<AminoAcidCompound> genbankProteinReader
11+
= new GenbankProxySequenceReader<AminoAcidCompound>("/tmp", "NP_000257", AminoAcidCompoundSet.getAminoAcidCompoundSet());
12+
ProteinSequence proteinSequence = new ProteinSequence(genbankProteinReader);
13+
genbankProteinReader.getHeaderParser().parseHeader(genbankProteinReader.getHeader(), proteinSequence);
14+
System.out.println("Sequence" + "(" + proteinSequence.getAccession() + "," + proteinSequence.getLength() + ")=" +
15+
proteinSequence.getSequenceAsString().substring(0, 10) + "...");
16+
17+
GenbankProxySequenceReader<NucleotideCompound> genbankDNAReader
18+
= new GenbankProxySequenceReader<NucleotideCompound>("/tmp", "NM_001126", DNACompoundSet.getDNACompoundSet());
19+
DNASequence dnaSequence = new DNASequence(genbankDNAReader);
20+
genbankDNAReader.getHeaderParser().parseHeader(genbankDNAReader.getHeader(), dnaSequence);
21+
System.out.println("Sequence" + "(" + dnaSequence.getAccession() + "," + dnaSequence.getLength() + ")=" +
22+
dnaSequence.getSequenceAsString().substring(0, 10) + "...");
23+
24+
```
25+
26+
27+
## Method 2: Read a Genbank file using GenbankReaderHelper
28+
29+
```java
30+
File dnaFile = new File("src/test/resources/NM_000266.gb");
31+
File protFile = new File("src/test/resources/BondFeature.gb");
32+
33+
LinkedHashMap<String, DNASequence> dnaSequences = GenbankReaderHelper.readGenbankDNASequence( dnaFile );
34+
for (DNASequence sequence : dnaSequences.values()) {
35+
System.out.println( sequence.getSequenceAsString() );
36+
}
37+
38+
LinkedHashMap<String, ProteinSequence> protSequences = GenbankReaderHelper.readGenbankProteinSequence(protFile);
39+
for (ProteinSequence sequence : protSequences.values()) {
40+
System.out.println( sequence.getSequenceAsString() );
41+
}
42+
43+
```
44+
45+
## Method 3: Read a Genbank file using the GenbankReader Object
46+
47+
```java
48+
49+
FileInputStream is = new FileInputStream(dnaFile);
50+
GenbankReader<DNASequence, NucleotideCompound> dnaReader = new GenbankReader<DNASequence, NucleotideCompound>(
51+
is,
52+
new GenericGenbankHeaderParser<DNASequence,NucleotideCompound>(),
53+
new DNASequenceCreator(DNACompoundSet.getDNACompoundSet())
54+
);
55+
dnaSequences = dnaReader.process();
56+
is.close();
57+
System.out.println(dnaSequences);
58+
59+
is = new FileInputStream(protFile);
60+
GenbankReader<ProteinSequence, AminoAcidCompound> protReader = new GenbankReader<ProteinSequence, AminoAcidCompound>(
61+
is,
62+
new GenericGenbankHeaderParser<ProteinSequence,AminoAcidCompound>(),
63+
new ProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet())
64+
);
65+
protSequences = protReader.process();
66+
is.close();
67+
System.out.println(protSequences);
68+
69+
```
70+
71+
72+
# Write a Genbank file
73+
74+
75+
Use the GenbankWriterHelper to write DNA sequences into a Genbank file.
76+
77+
```java
78+
79+
// First let's read dome DNA sequences from a genbank file
80+
81+
File dnaFile = new File("src/test/resources/NM_000266.gb");
82+
LinkedHashMap<String, DNASequence> dnaSequences = GenbankReaderHelper.readGenbankDNASequence( dnaFile );
83+
ByteArrayOutputStream fragwriter = new ByteArrayOutputStream();
84+
ArrayList<DNASequence> seqs = new ArrayList<DNASequence>();
85+
for(DNASequence seq : dnaSequences.values()) {
86+
seqs.add(seq);
87+
}
88+
89+
// ok now we got some DNA sequence data. Next step is to write it
90+
91+
GenbankWriterHelper.writeNucleotideSequence(fragwriter, seqs,
92+
GenbankWriterHelper.LINEAR_DNA);
93+
94+
// the fragwriter object now contains a string representation in the Genbank format
95+
// and you could write this into a file
96+
// or print it out on the console
97+
System.out.println(fragwriter.toString());
98+
99+
```

genomics/genenames.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
Parse Gene Name Information
2+
===========================
3+
4+
The following code parses [a file from the www.genenames.org](http://www.genenames.org/cgi-bin/download?title=HGNC+output+data&hgnc_dbtag=on&col=gd_app_sym&col=gd_app_name&col=gd_status&col=gd_prev_sym&col=gd_prev_name&col=gd_aliases&col=gd_pub_chrom_map&col=gd_pub_acc_ids&col=md_mim_id&col=gd_pub_refseq_ids&col=md_ensembl_id&col=md_prot_id&col=gd_hgnc_id" +
5+
"&status=Approved&status_opt=2&where=((gd_pub_chrom_map%20not%20like%20%27%patch%%27%20and%20gd_pub_chrom_map%20not%20like%20%27%ALT_REF%%27)%20or%20gd_pub_chrom_map%20IS%20NULL)%20and%20gd_locus_group%20%3d%20%27protein-coding%20gene%27&order_by=gd_app_sym_sort&format=text&limit=&submit=submit&.cgifields=&.cgifields=chr&.cgifields=status&.cgifields=hgnc_dbtag)
6+
website that contains a mapping of human gene names to other databases.
7+
8+
9+
```java
10+
/** parses a file from the genenames website
11+
*
12+
* @param args
13+
*/
14+
public static void main(String[] args) {
15+
16+
try {
17+
18+
List<GeneName> geneNames = GeneNamesParser.getGeneNames();
19+
20+
System.out.println("got " + geneNames.size() + " gene names");
21+
22+
23+
for ( GeneName g : geneNames){
24+
if ( g.getApprovedSymbol().equals("FOLH1"))
25+
System.out.println(g);
26+
}
27+
// and returns a list of beans that contains key-value pairs for each gene name
28+
29+
} catch (Exception e) {
30+
// TODO Auto-generated catch block
31+
e.printStackTrace();
32+
}
33+
34+
}
35+
```
36+
37+
If you have a local copy of this file, then you can just provide an input stream for it:
38+
39+
```java
40+
41+
URL url = new URL("file:///local/copy/of/file");
42+
43+
InputStreamProvider prov = new InputStreamProvider();
44+
45+
InputStream inStream = prov.getInputStream(url);
46+
47+
GeneNamesParser.getGeneNames(inStream);
48+
49+
50+
```

genomics/gff.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
Reading GFF files
2+
=================
3+
4+
The biojava3-genome library leverages the sequence relationships in biojava3-core to read (gtf,gff2,gff3) files and
5+
write gff3 files. The file formats for gtf, gff2, gff3 are well defined but what gets written in the file is very
6+
flexible. We currently provide support for reading gff files generated by open source gene prediction applications
7+
GeneID, GeneMark and GlimmerHMM. Each prediction algorithm uses a different ontology to describe coding sequence,
8+
exons, start or stop codon which makes it difficult to write a general purpose gff parser that can create biologically
9+
meaningful objects. If the application is simply loading a gff file and drawing a colored glyph then you don't need to
10+
worry about the ontology used. It is easier to support the popular gene prediction algorithms by writing a parser that
11+
is aware of each gene prediction applications ontology.
12+
13+
14+
The following code example takes a 454scaffold file that was used by genemark to predict genes and returns a
15+
collection of ChromosomeSequences. Each chromosome sequence maps to a named entry in the fasta file and would
16+
contain N gene sequences. The gene sequences can be +/- strand with frame shifts and multiple transcriptions.
17+
18+
Passing the collection of ChromsomeSequences to GeneFeatureHelper.getProteinSequences would return all protein
19+
sequences. You can then write the protein sequences to a fasta file.
20+
21+
```java
22+
23+
LinkedHashMap<String, ChromosomeSequence> chromosomeSequenceList = GeneFeatureHelper.loadFastaAddGeneFeaturesFromGeneMarkGTF(new File("454Scaffolds.fna"), new File("genemark_hmm.gtf"));
24+
LinkedHashMap<String, ProteinSequence> proteinSequenceList = GeneFeatureHelper.getProteinSequences(chromosomeSequenceList.values());
25+
FastaWriterHelper.writeProteinSequence(new File("genemark_proteins.faa"), proteinSequenceList.values());
26+
```
27+
28+
You can also output the gene sequence to a fasta file where the coding regions will be upper case and the non-coding regions will be lower case
29+
30+
```java
31+
LinkedHashMap<String, GeneSequence> geneSequenceHashMap = GeneFeatureHelper.getGeneSequences(chromosomeSequenceList.values());
32+
Collection<GeneSequence> geneSequences = geneSequenceHashMap.values();
33+
FastaWriterHelper.writeGeneSequence(new File("genemark_genes.fna"), geneSequences, true);
34+
35+
```
36+
37+
You can easily write out a gff3 view of a ChromosomeSequence with the following code.
38+
39+
```java
40+
FileOutputStream fo = new FileOutputStream("genemark.gff3");
41+
GFF3Writer gff3Writer = new GFF3Writer();
42+
gff3Writer.write(fo, chromosomeSequenceList);
43+
fo.close();
44+
```
45+
46+
The chromsome sequence becomes the middle layer that represents the essence of what is mapped in a gtf, gff2 or
47+
gff3 file. This makes it fairly easy to write code to convert from gtf to gff3 or from gff2 to gtf. The challenge
48+
is picking the correct ontology for writing into gtf or gff2 formats. You could use feature names used by a
49+
specific gene prediction program or features supported by your favorite genome browser. We would like to provide a
50+
complete set of java classes to do these conversions where the list of supported gene prediction applications and
51+
genome browsers will get longer based on end user requests.
52+

genomics/img/genomics.png

12.2 KB
Loading

genomics/installation.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
## Quick Installation
2+
3+
In the beginning, just one quick paragraph of how to get access to BioJava.
4+
5+
BioJava is open source and you can get the code from [Github](https://github.com/biojava/biojava), however it might be easier this way:
6+
7+
BioJava uses [Maven](http://maven.apache.org/) as a build and distribution system. If you are new to Maven, take a look at the [Getting Started with Maven](http://maven.apache.org/guides/getting-started/index.html) guide.
8+
9+
Currently, we are providing a BioJava specific Maven repository at (http://biojava.org/download/maven/) .
10+
11+
You can add the BioJava repository by adding the following XML to your project pom.xml file:
12+
13+
```xml
14+
<repositories>
15+
...
16+
<repository>
17+
<id>biojava-maven-repo</id>
18+
<name>BioJava repository</name>
19+
<url>http://www.biojava.org/download/maven/</url>
20+
</repository>
21+
</repositories>
22+
```
23+
24+
We are currently in the process of changing our distribution to Maven Central, which would not even require this configuration step.
25+
26+
```xml
27+
<dependencies>
28+
...
29+
30+
<!-- This imports the latest version of BioJava genomics module -->
31+
<dependency>
32+
33+
<groupId>org.biojava</groupId>
34+
<artifactId>biojava3-genomics</artifactId>
35+
<version>3.0.8</version>
36+
<!-- note: the genomics module depends on the BioJava-core module and will import it automatically -->
37+
</dependency>
38+
39+
40+
<!-- other biojava jars as needed -->
41+
42+
</dependencies>
43+
```
44+
45+
If you run
46+
47+
<pre>
48+
mvn package
49+
</pre>
50+
51+
on your project, the BioJava dependencies will be automatically downloaded and installed for you.
52+

0 commit comments

Comments
 (0)