GenbankReader patch for multiple sequences from single file#251
Conversation
-added 2 test cases to GenbankReaderTest
There was a problem hiding this comment.
Isn't SOURCE_TAG redundant to your DBSOURCE ?
Anyway, why don't you consider to group and sort this tags with the already defined above? Would increase readibility
There was a problem hiding this comment.
HI Paolo,
when I include the little print function in GenbankSequenceParser:
private void printSection(List<String[]> sec) {
if(sec!=null && sec.size()>0) {
for(String[] sa : sec) {
StringBuffer sb=new StringBuffer();
for(String s : sa) {
sb.append(" "+s);
}
log.debug(new String(sb)+"\n");
}
}
}
and let it run for the SOURCE and DBSOURCE tags I get quite different results:
for Example during the test case Reading: /tmp/254839678.gb
DBSOURCE pdb: molecule 3IAN, chain 65, release Jul 29, 2009;
deposition: Jul 14, 2009;
class: Hydrolase;
source: Mmdb_id: 75718, Pdb_id 1: 3IAN;
Exp. method: X-Ray Diffraction.
SOURCE Lactococcus lactis subsp. lactis
ORGANISM Lactococcus lactis subsp. lactis
Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae;
Lactococcus.
while for test case Reading: /tmp/NP_000257.gb
DBSOURCE REFSEQ: accession NM_000266.3
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
I would conclude, that the two section keys are not redundant
CheersStefan
paolopavan <notifications@github.com> schrieb am 23:29 Donnerstag, 12.Februar 2015:
In biojava-core/src/main/java/org/biojava/nbio/core/sequence/io/GenbankSequenceParser.java:> @@ -127,6 +128,9 @@
protected static final Pattern readableFiles = Pattern.compile(".*(g[bp]k*$|\\u002eg[bp].*)"); protected static final Pattern headerLine = Pattern.compile("^LOCUS.*");
- private static final String DBSOURCE = "DBSOURCE";
Isn't SOURCE_TAG redundant to your DBSOURCE ?—
Reply to this email directly or view it on GitHub.
reason there is an increment added in GenericInsdcHeaderFormat which is responsible for this effect. Changing the increment to 0 patches the writer.
|
@paolopavan Do you have any other concerns? Looks good to me otherwise and I'd merge this in. |
|
Andreas and Stefan, |
|
@andreasprlic, I'm pretty coinvinced of the interpretation I gave 2 days ago. |
|
Note that in addition to the INSDC feature table standard shared by GenBank/EMBL/DDBJ which you linked to by http://www.insdc.org/files/feature_table.html there is also the separate GenBank information at ftp://ftp.ncbi.nih.gov/genbank/README.genbank and ftp://ftp.ncbi.nih.gov/genbank/docs/ which includes some of the header lines you are talking about. DBLINK was a replacement for the short-lived PROJECT line type (effective as of GenBank release 172) and does seem to still be in active use, eg http://www.ncbi.nlm.nih.gov/nuccore/NC_000913 DBSOURCE / SOURCE / ORGANISM are all somewhat redundant with the source feature(s). Probably chimeric records are interesting test cases here. e.g. bacteria with integrated viruses, see http://blastedbio.blogspot.jp/2013/11/entrez-trouble-with-chimeras.html |
GenbankReader patch for multiple sequences from single file
|
Ok merged in Stefan's patch, however it seems this topic requires additional work. |
|
Thank you Peter for pointing to the additional material. If there is no urgency, I can take care of the update of the Reader about 2015-02-18 7:22 GMT+01:00 Andreas Prlic notifications@github.com:
|
|
The changes here don't seem to change the API. Would it have been more appropriate to merge with the minor or patch branches? If so, we can cherry-pick the commits backwards (never merge master into minor or patch!). |
|
I can confirm that this change will not affect API. Multiple sequence In the next update (new data parsing) I will try to do the same. Maybe some 2015-02-19 10:30 GMT+01:00 Spencer Bliven notifications@github.com:
|
|
Let's discuss our branching policy on the mailing list. I think master should be the most active branch, and as such this patch is correctly in master. However the version number on master is wrong. |
No description provided.