GenbankReader patch for multiple sequences from single file by stefanharjes · Pull Request #251 · biojava/biojava

stefanharjes · 2015-02-10T09:15:08Z

No description provided.

-added 2 test cases to GenbankReaderTest

paolopavan · 2015-02-12T10:29:51Z

Isn't SOURCE_TAG redundant to your DBSOURCE ?
Anyway, why don't you consider to group and sort this tags with the already defined above? Would increase readibility

HI Paolo,
when I include the little print function in GenbankSequenceParser:
    private void printSection(List<String[]> sec) {
        if(sec!=null && sec.size()>0) {
            for(String[] sa : sec) {
                StringBuffer sb=new StringBuffer();
                for(String s : sa) {
                    sb.append(" "+s);
                }
                log.debug(new String(sb)+"\n");
            }
        }
    }

and let it run for the SOURCE and DBSOURCE tags I get quite different results:
for Example during the test case Reading: /tmp/254839678.gb
DBSOURCE pdb: molecule 3IAN, chain 65, release Jul 29, 2009;
deposition: Jul 14, 2009;
class: Hydrolase;
source: Mmdb_id: 75718, Pdb_id 1: 3IAN;
Exp. method: X-Ray Diffraction.
SOURCE Lactococcus lactis subsp. lactis
ORGANISM Lactococcus lactis subsp. lactis
Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae;
Lactococcus.

while for test case Reading: /tmp/NP_000257.gb
DBSOURCE REFSEQ: accession NM_000266.3
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.

I would conclude, that the two section keys are not redundant
CheersStefan

paolopavan <notifications@github.com> schrieb am 23:29 Donnerstag, 12.Februar 2015:

In biojava-core/src/main/java/org/biojava/nbio/core/sequence/io/GenbankSequenceParser.java:> @@ -127,6 +128,9 @@

protected static final Pattern readableFiles = Pattern.compile(".*(g[bp]k*$|\\u002eg[bp].*)"); protected static final Pattern headerLine = Pattern.compile("^LOCUS.*");

private static final String DBSOURCE = "DBSOURCE";
Isn't SOURCE_TAG redundant to your DBSOURCE ?—
Reply to this email directly or view it on GitHub.

reason there is an increment added in GenericInsdcHeaderFormat which is responsible for this effect. Changing the increment to 0 patches the writer.

andreasprlic · 2015-02-15T07:45:52Z

@paolopavan Do you have any other concerns? Looks good to me otherwise and I'd merge this in.

paolopavan · 2015-02-15T19:55:47Z

Andreas and Stefan,
SOURCE_TAG is not redundant to DBSOURCE but actually I'm perplexed since I can't find those tags in any official reference, I principally refer to this link.
My suspect is that DBSOURCE and DBLINK tags are obsolete and they were replaced by db_xref as a qualifier of SOURCE feature (this scheme is supported by our parser).
So said, Bioperl support them and consider DBSOURCE and DBLINK equivalent (see here).
We could choose to support those tags as well and instantiate a db_xref object to be attached to SOURCE annotation.
Finally I cannot find PRIMARY key anywhere.
Maybe @peterjc at biopython can gracefully explain what strategy they use at their side?

paolopavan · 2015-02-17T19:00:23Z

@andreasprlic, I'm pretty coinvinced of the interpretation I gave 2 days ago.
Still remaining open to external contributes, If you want I think you can merge this in, also since the main issue here is supporting multiple sequences. Then I (or @stefanharjes himself, if he wants) will manage DBSOURCE and DBLINK tags as explained above.

peterjc · 2015-02-18T05:47:41Z

Note that in addition to the INSDC feature table standard shared by GenBank/EMBL/DDBJ which you linked to by http://www.insdc.org/files/feature_table.html there is also the separate GenBank information at ftp://ftp.ncbi.nih.gov/genbank/README.genbank and ftp://ftp.ncbi.nih.gov/genbank/docs/ which includes some of the header lines you are talking about.

DBLINK was a replacement for the short-lived PROJECT line type (effective as of GenBank release 172) and does seem to still be in active use, eg http://www.ncbi.nlm.nih.gov/nuccore/NC_000913

DBSOURCE / SOURCE / ORGANISM are all somewhat redundant with the source feature(s). Probably chimeric records are interesting test cases here. e.g. bacteria with integrated viruses, see http://blastedbio.blogspot.jp/2013/11/entrez-trouble-with-chimeras.html

GenbankReader patch for multiple sequences from single file

andreasprlic · 2015-02-18T06:22:06Z

Ok merged in Stefan's patch, however it seems this topic requires additional work.

paolopavan · 2015-02-18T23:37:40Z

Thank you Peter for pointing to the additional material.
I'm not surprised at all about those doubts, it is a situation that I have
already seen. Unfortunely this format continue to change and parsers must
be adapted accordingly.

If there is no urgency, I can take care of the update of the Reader about
the debated issues.
Cheers!

2015-02-18 7:22 GMT+01:00 Andreas Prlic notifications@github.com:

Ok merged in Stefan's patch, however it seems this topic requires
additional work.

—
Reply to this email directly or view it on GitHub
#251 (comment).

sbliven · 2015-02-19T09:30:53Z

The changes here don't seem to change the API. Would it have been more appropriate to merge with the minor or patch branches? If so, we can cherry-pick the commits backwards (never merge master into minor or patch!).

paolopavan · 2015-02-19T11:34:17Z

I can confirm that this change will not affect API. Multiple sequence
reading was indeed already prepared in the data structures.

In the next update (new data parsing) I will try to do the same. Maybe some
new getter will be required, but this can be consider a minor version
increase (second level) since adding new methods does not affect backward
compatibility.

2015-02-19 10:30 GMT+01:00 Spencer Bliven notifications@github.com:

The changes here don't seem to change the API. Would it have been more
appropriate to merge with the minor or patch branches? If so, we can
cherry-pick the commits backwards (never merge master into minor or patch!).

—
Reply to this email directly or view it on GitHub
#251 (comment).

andreasprlic · 2015-02-19T15:13:12Z

Let's discuss our branching policy on the mailing list. I think master should be the most active branch, and as such this patch is correctly in master. However the version number on master is wrong.

stefanharjes added 2 commits February 11, 2015 05:50

-patch GenbankReader to read several sequences from a single file

c860f19

-added 2 test cases to GenbankReaderTest

-new resource files for two new test cases of GenbankReaderTest

6721c1c

paolopavan reviewed Feb 12, 2015
View reviewed changes

GenbankWriter increments the start position of each location. For some

021bf4e

reason there is an increment added in GenericInsdcHeaderFormat which is responsible for this effect. Changing the increment to 0 patches the writer.

andreasprlic added a commit that referenced this pull request Feb 18, 2015

Merge pull request #251 from stefanharjes/master

98b08e8

GenbankReader patch for multiple sequences from single file

andreasprlic merged commit 98b08e8 into biojava:master Feb 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GenbankReader patch for multiple sequences from single file#251

GenbankReader patch for multiple sequences from single file#251
andreasprlic merged 3 commits into
biojava:masterfrom
stefanharjes:master

stefanharjes commented Feb 10, 2015

Uh oh!

paolopavan Feb 12, 2015

Uh oh!

stefanharjes Feb 13, 2015

Uh oh!

andreasprlic commented Feb 15, 2015

Uh oh!

paolopavan commented Feb 15, 2015

Uh oh!

paolopavan commented Feb 17, 2015

Uh oh!

peterjc commented Feb 18, 2015

Uh oh!

andreasprlic commented Feb 18, 2015

Uh oh!

paolopavan commented Feb 18, 2015

Uh oh!

sbliven commented Feb 19, 2015

Uh oh!

paolopavan commented Feb 19, 2015

Uh oh!

andreasprlic commented Feb 19, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

stefanharjes commented Feb 10, 2015

Uh oh!

paolopavan Feb 12, 2015

Choose a reason for hiding this comment

Uh oh!

stefanharjes Feb 13, 2015

Choose a reason for hiding this comment

Uh oh!

andreasprlic commented Feb 15, 2015

Uh oh!

paolopavan commented Feb 15, 2015

Uh oh!

paolopavan commented Feb 17, 2015

Uh oh!

peterjc commented Feb 18, 2015

Uh oh!

andreasprlic commented Feb 18, 2015

Uh oh!

paolopavan commented Feb 18, 2015

Uh oh!

sbliven commented Feb 19, 2015

Uh oh!

paolopavan commented Feb 19, 2015

Uh oh!

andreasprlic commented Feb 19, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants