Skip to content

GenbankReader patch for multiple sequences from single file#251

Merged
andreasprlic merged 3 commits into
biojava:masterfrom
stefanharjes:master
Feb 18, 2015
Merged

GenbankReader patch for multiple sequences from single file#251
andreasprlic merged 3 commits into
biojava:masterfrom
stefanharjes:master

Conversation

@stefanharjes

Copy link
Copy Markdown
Contributor

No description provided.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't SOURCE_TAG redundant to your DBSOURCE ?
Anyway, why don't you consider to group and sort this tags with the already defined above? Would increase readibility

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI Paolo,
when I include the little print function in GenbankSequenceParser:
    private void printSection(List<String[]> sec) {
        if(sec!=null && sec.size()>0) {
            for(String[] sa : sec) {
                StringBuffer sb=new StringBuffer();
                for(String s : sa) {
                    sb.append(" "+s);
                }
                log.debug(new String(sb)+"\n");
            }
        }
    }

and let it run for the SOURCE and DBSOURCE tags I get quite different results:
for Example during the test case Reading: /tmp/254839678.gb
 DBSOURCE pdb: molecule 3IAN, chain 65, release Jul 29, 2009;
deposition: Jul 14, 2009;
class: Hydrolase;
source: Mmdb_id: 75718, Pdb_id 1: 3IAN;
Exp. method: X-Ray Diffraction.
 SOURCE Lactococcus lactis subsp. lactis
 ORGANISM Lactococcus lactis subsp. lactis
Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae;
Lactococcus.

while for test case Reading: /tmp/NP_000257.gb
 DBSOURCE REFSEQ: accession NM_000266.3
 SOURCE Homo sapiens (human)
 ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.

I would conclude, that the two section keys are not redundant
CheersStefan

 paolopavan <notifications@github.com> schrieb am 23:29 Donnerstag, 12.Februar 2015:

In biojava-core/src/main/java/org/biojava/nbio/core/sequence/io/GenbankSequenceParser.java:> @@ -127,6 +128,9 @@

 protected static final Pattern readableFiles = Pattern.compile(".*(g[bp]k*$|\\u002eg[bp].*)");
 protected static final Pattern headerLine = Pattern.compile("^LOCUS.*");
  • private static final String DBSOURCE = "DBSOURCE";
    Isn't SOURCE_TAG redundant to your DBSOURCE ?—
    Reply to this email directly or view it on GitHub.

reason there is an increment added in GenericInsdcHeaderFormat which is
responsible for this effect. Changing the increment to 0 patches the
writer.
@andreasprlic

Copy link
Copy Markdown
Member

@paolopavan Do you have any other concerns? Looks good to me otherwise and I'd merge this in.

@paolopavan

Copy link
Copy Markdown
Contributor

Andreas and Stefan,
SOURCE_TAG is not redundant to DBSOURCE but actually I'm perplexed since I can't find those tags in any official reference, I principally refer to this link.
My suspect is that DBSOURCE and DBLINK tags are obsolete and they were replaced by db_xref as a qualifier of SOURCE feature (this scheme is supported by our parser).
So said, Bioperl support them and consider DBSOURCE and DBLINK equivalent (see here).
We could choose to support those tags as well and instantiate a db_xref object to be attached to SOURCE annotation.
Finally I cannot find PRIMARY key anywhere.
Maybe @peterjc at biopython can gracefully explain what strategy they use at their side?

@paolopavan

Copy link
Copy Markdown
Contributor

@andreasprlic, I'm pretty coinvinced of the interpretation I gave 2 days ago.
Still remaining open to external contributes, If you want I think you can merge this in, also since the main issue here is supporting multiple sequences. Then I (or @stefanharjes himself, if he wants) will manage DBSOURCE and DBLINK tags as explained above.

@peterjc

peterjc commented Feb 18, 2015

Copy link
Copy Markdown

Note that in addition to the INSDC feature table standard shared by GenBank/EMBL/DDBJ which you linked to by http://www.insdc.org/files/feature_table.html there is also the separate GenBank information at ftp://ftp.ncbi.nih.gov/genbank/README.genbank and ftp://ftp.ncbi.nih.gov/genbank/docs/ which includes some of the header lines you are talking about.

DBLINK was a replacement for the short-lived PROJECT line type (effective as of GenBank release 172) and does seem to still be in active use, eg http://www.ncbi.nlm.nih.gov/nuccore/NC_000913

DBSOURCE / SOURCE / ORGANISM are all somewhat redundant with the source feature(s). Probably chimeric records are interesting test cases here. e.g. bacteria with integrated viruses, see http://blastedbio.blogspot.jp/2013/11/entrez-trouble-with-chimeras.html

andreasprlic added a commit that referenced this pull request Feb 18, 2015
GenbankReader patch for multiple sequences from single file
@andreasprlic andreasprlic merged commit 98b08e8 into biojava:master Feb 18, 2015
@andreasprlic

Copy link
Copy Markdown
Member

Ok merged in Stefan's patch, however it seems this topic requires additional work.

@paolopavan

Copy link
Copy Markdown
Contributor

Thank you Peter for pointing to the additional material.
I'm not surprised at all about those doubts, it is a situation that I have
already seen. Unfortunely this format continue to change and parsers must
be adapted accordingly.

If there is no urgency, I can take care of the update of the Reader about
the debated issues.
Cheers!

2015-02-18 7:22 GMT+01:00 Andreas Prlic notifications@github.com:

Ok merged in Stefan's patch, however it seems this topic requires
additional work.


Reply to this email directly or view it on GitHub
#251 (comment).

@sbliven

sbliven commented Feb 19, 2015

Copy link
Copy Markdown
Member

The changes here don't seem to change the API. Would it have been more appropriate to merge with the minor or patch branches? If so, we can cherry-pick the commits backwards (never merge master into minor or patch!).

@paolopavan

Copy link
Copy Markdown
Contributor

I can confirm that this change will not affect API. Multiple sequence
reading was indeed already prepared in the data structures.

In the next update (new data parsing) I will try to do the same. Maybe some
new getter will be required, but this can be consider a minor version
increase (second level) since adding new methods does not affect backward
compatibility.

2015-02-19 10:30 GMT+01:00 Spencer Bliven notifications@github.com:

The changes here don't seem to change the API. Would it have been more
appropriate to merge with the minor or patch branches? If so, we can
cherry-pick the commits backwards (never merge master into minor or patch!).


Reply to this email directly or view it on GitHub
#251 (comment).

@andreasprlic

Copy link
Copy Markdown
Member

Let's discuss our branching policy on the mailing list. I think master should be the most active branch, and as such this patch is correctly in master. However the version number on master is wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants