Skip to content

Commit d4b0bba

Browse files
committed
Add substantially new unified class for identify structures: StructureIdentifier
This unifies various prior methods for identifying a structure from a string, which was previously spread over AtomCache methods and StructureTools/StructureIO utilities. It includes PDB IDs, residue range specifications, URLs, SCOP/CATH/PDP domains, and files. Major Classes: * StructureIdentifier is an interface for methods which know how to load a structure from some resource and reduce it to the requested substructure. The most important property is the identifier, which can be an arbitrary string. * SubstructureIdentifier is considered the canonical way to specify a structure and is the major implementation for chain and residue level substructures. All identifiers can be converted to a SubstructureIdentifier. * StructureName is suitable for wrapping user-supplied identifiers, and it dispatches the request to a more specific class based on a guess as to the type (PDB, URL, file, etc). StructureIdentifiers can represent arbitrary strings (e.g. domain IDs). These are converted (possibly through some relatively slow process like downloading and parsing a file) into standard-format SubstructureIdentifier instances, which should be easier to serialize and recreate. Detailed changes: * Substantially changes StructureIdentifier, which existed but wasn't used anywhere. * Remove getPdbId() method, since not globally relevant. It's still present in most implementations, and can always be accessed via toCanonical().getPdbId() * AtomCache * Accept chain indices ("4HHB.0") with a warning if there was not already a chain with that ID. This commit rebases and squashes pre-4.1 development from sbliven's fix81 branch culminating in 007ea6e. For posterity, the original commit messages are listed below. Some changes may be omitted or modified when resolving the rebase. Commit messages (oldest to youngest): --- Major changes to StructureIdentifier (biojava#81) Redefines StructureIdentifier as something which transforms a structure. This will replace all of the disjoint places where strings are parsed into various structures and ranges and identifiers. --- More work on using StructureIdentifiers * Add AtomCache.getStructure(StructureIdentifier) method, and change other methods to use it * Implement StructureName and other StructureIdentifiers * Remove the 'A:+1-+5' range syntax from StructureTools. It was stupid. * Remove string parsing from lots of places and replace with StructureIdentifiers * Fix lots of tests --- Adding support for models in SubstructureIdentifier --- More improvements for StructureIdentifiers * Accept chain indices ("4HHB.0") with a warning if there was not already a chain with that ID. * Fixing bugs in StructureName due to differences with SubstructureIdentifier * Removing StructureName.compareTo method, since they aren't well ordered * Changing more AtomCache methods to use StructureIdentifiers * Test fixes --- Fixing bugs with loading structures from files and urls These are now valid StructureName values. They are implemented using the PassthroughIdentifier, which makes AtomCache responsible for fetching the right structure. --- Implementing PDP identifiers in StructureName PDP parsing doesn't have good test coverage, but my basic checks work. --- Last test fix. All (existing) tests now pass. --- Fix some AtomCache synchronization. --- Modifying StructureIdentifier interface - Add loadStructure() method - Remove getPdbId() method, since not globally relevant. It's still present in most implementations, and can always be accessed via toCanonical().getPdbId() - In AtomCache, rename protected loadStructureByPdbId to public getStructureForPdbId() to bypass StructureIdentifier parsing --- Fix bug loading PDP structures ---
1 parent 37bdc1e commit d4b0bba

29 files changed

+1510
-838
lines changed

biojava-integrationtest/src/test/java/org/biojava/nbio/structure/test/StructureToolsTest.java

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,7 @@ public void testGetNrAtoms(){
126126
assertEquals("did not find the expected number of Atoms (1087), but got " + length,1087,length);
127127
}
128128

129+
@SuppressWarnings("deprecation")
129130
public void testGetSubRanges() throws StructureException {
130131
String range;
131132
Structure substr;
@@ -230,13 +231,13 @@ public void testGetSubRanges() throws StructureException {
230231
try {
231232
range = "7-10";
232233
substr = StructureTools.getSubRanges(structure2, range);
233-
fail("Illegal range '"+range+"'. Should throw StructureException");
234-
} catch(StructureException ex) {} //expected
234+
fail("Illegal range '"+range+"'. Should throw IllegalArgumentException");
235+
} catch(IllegalArgumentException ex) {} //expected
235236
try {
236237
range = "A7-10";
237238
substr = StructureTools.getSubRanges(structure2, range);
238-
fail("Illegal range '"+range+"'. Should throw StructureException");
239-
} catch(StructureException ex) {} //expected
239+
fail("Illegal range '"+range+"'. Should throw IllegalArgumentException");
240+
} catch(IllegalArgumentException ex) {} //expected
240241
}
241242

242243
public void testRevisedConvention() throws IOException, StructureException{
@@ -319,6 +320,7 @@ public void testRevisedConvention() throws IOException, StructureException{
319320
* Test some subranges that we used to have problems with
320321
* @throws StructureException
321322
*/
323+
@SuppressWarnings("deprecation")
322324
public void testGetSubRangesExtended() throws StructureException {
323325
String range;
324326
Structure substr;
@@ -379,6 +381,7 @@ public void testGetSubRangesExtended() throws StructureException {
379381
* Test insertion codes
380382
* @throws StructureException
381383
*/
384+
@SuppressWarnings("deprecation")
382385
public void testGetSubRangesInsertionCodes() throws StructureException {
383386
String range;
384387
Structure substr;

biojava-integrationtest/src/test/java/org/biojava/nbio/structure/test/cath/CathDomainTest.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,6 @@ public class CathDomainTest {
3636
public void test() {
3737
String id = "1qvrC03";
3838
CathDomain domain = CathFactory.getCathDatabase().getDomainByCathId(id);
39-
assertEquals("1qvr.C_332-400,C_514-540", domain.getIdentifier());
39+
assertEquals("1qvr.C_332-400,C_514-540", domain.toCanonical().getIdentifier());
4040
}
4141
}

biojava-structure/src/main/java/org/biojava/nbio/structure/AtomPositionMap.java

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,15 @@ public AtomPositionMap(Atom[] atoms, GroupMatcher matcher) {
165165
treeMap = new TreeMap<ResidueNumber, Integer>(vc);
166166
treeMap.putAll(hashMap);
167167
}
168+
169+
/**
170+
* Creates a new AtomPositionMap containing representative atoms
171+
* from a structure.
172+
* @param s
173+
*/
174+
public AtomPositionMap(Structure s) {
175+
this(StructureTools.getRepresentativeAtomArray(s));
176+
}
168177

169178
/**
170179
* Calculates the number of residues of the specified chain in a given range, inclusive.

biojava-structure/src/main/java/org/biojava/nbio/structure/Identifier.java

Lines changed: 6 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -23,36 +23,25 @@
2323

2424
package org.biojava.nbio.structure;
2525

26+
import org.biojava.nbio.structure.align.client.StructureName;
2627
import org.biojava.nbio.structure.align.util.AtomCache;
27-
import org.biojava.nbio.structure.cath.CathFactory;
28-
import org.biojava.nbio.structure.scop.ScopFactory;
2928

3029
/**
3130
* A collection of utilities to create {@link StructureIdentifier StructureIdentifiers}.
3231
* @author dmyersturnbull
32+
* @deprecated Use {@link StructureName} instead. Deprecated in v. 4.2.0
3333
*/
34+
@Deprecated
3435
public class Identifier {
3536

36-
private static final String CATH_PATTERN = "[0-9][a-z0-9]{3}.[0-9]{2}";
37-
private static final String SCOP_PATTERN = "d[0-9][a-zA-Z0-9]{3,4}([a-zA-Z][0-9_]|\\.[0-9]+)";
38-
3937
/**
4038
* Loads a {@link StructureIdentifier} from the specified string.
4139
* The type returned for any particular string can be considered relatively stable
4240
* but should not be relied on.
43-
*
41+
* @deprecated Create a new {@link StructureName} instead.
4442
*/
43+
@Deprecated
4544
public static StructureIdentifier loadIdentifier(String id, AtomCache cache) {
46-
if (id.matches(CATH_PATTERN)) {
47-
return CathFactory.getCathDatabase().getDescriptionByCathId(id);
48-
} else if (id.matches(SCOP_PATTERN)) {
49-
return ScopFactory.getSCOP().getDomainByScopID(id);
50-
}
51-
try {
52-
return new SubstructureIdentifier(id, cache);
53-
} catch (Exception e) {
54-
throw new IllegalArgumentException("Couldn't understand id " + id, e);
55-
}
45+
return new StructureName(id);
5646
}
57-
5847
}
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
package org.biojava.nbio.structure;
2+
3+
import java.io.IOException;
4+
import java.util.ArrayList;
5+
6+
import org.biojava.nbio.structure.align.util.AtomCache;
7+
8+
/**
9+
* A stub StructureIdentifier, representing the full structure in all cases.
10+
* @author Spencer Bliven
11+
*
12+
*/
13+
public class PassthroughIdentifier implements StructureIdentifier {
14+
15+
private String identifier;
16+
public PassthroughIdentifier(String identifier) {
17+
this.identifier = identifier;
18+
}
19+
@Override
20+
public String getIdentifier() {
21+
return identifier;
22+
}
23+
24+
/**
25+
* @return A SubstructureIdentifier without ranges (e.g. including all residues)
26+
*/
27+
@Override
28+
public SubstructureIdentifier toCanonical() {
29+
return new SubstructureIdentifier(null, new ArrayList<ResidueRange>());
30+
}
31+
32+
@Override
33+
public Structure reduce(Structure input) throws StructureException {
34+
return input;
35+
}
36+
/**
37+
* Passthrough identifiers don't know how to load a structure
38+
* @return null
39+
*/
40+
@Override
41+
public Structure loadStructure(AtomCache cache) throws StructureException,
42+
IOException {
43+
return null;
44+
}
45+
46+
}

biojava-structure/src/main/java/org/biojava/nbio/structure/ResidueNumber.java

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,9 +157,12 @@ public String toPDB() {
157157
* The string representation can be a integer followed by a character.
158158
*
159159
* @param pdb_code
160-
* @return a ResidueNumber object
160+
* @return a ResidueNumber object, or null if the input was null
161161
*/
162162
public static ResidueNumber fromString(String pdb_code) {
163+
if(pdb_code == null)
164+
return null;
165+
163166
ResidueNumber residueNumber = new ResidueNumber();
164167
Integer resNum = null;
165168
String icode = null;

biojava-structure/src/main/java/org/biojava/nbio/structure/ResidueRange.java

Lines changed: 40 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,10 @@
3131
/**
3232
* A chain, a start residue, and an end residue.
3333
*
34+
* Chain may be null when referencing a single-chain structure; for multi-chain
35+
* structures omitting the chain is an error. Start and/or end may also be null,
36+
* which is interpreted as the first and last residues in the chain, respectively.
37+
*
3438
* @author dmyerstu
3539
* @see ResidueNumber
3640
* @see org.biojava.nbio.structure.ResidueRangeAndLength
@@ -58,8 +62,24 @@ public class ResidueRange {
5862
public static final Pattern CHAIN_REGEX = Pattern.compile("^\\s*([a-zA-Z0-9]+|_)$");
5963

6064
/**
61-
* @param s
62-
* A string of the form chain_start-end or chain.start-end. For example: <code>A.5-100</code> or <code>A_5-100</code>.
65+
* Parse the residue range from a string. Several formats are accepted:
66+
* <ul>
67+
* <li> chain.start-end
68+
* <li> chain.residue
69+
* <li> chain_start-end (for better filename compatibility)
70+
* </ul>
71+
*
72+
* <p>Residues can be positive or negative and may include insertion codes.
73+
* See {@link ResidueNumber#fromString(String)}.
74+
*
75+
* <p>Examples:
76+
* <ul>
77+
* <li><code>A.5-100</code>
78+
* <li><code>A_5-100</code>
79+
* <li><code>A_-5</code>
80+
* <li><code>A.-12I-+12I
81+
*
82+
* @param s residue string to parse
6383
* @return The unique ResidueRange corresponding to {@code s}
6484
*/
6585
public static ResidueRange parse(String s) {
@@ -71,27 +91,41 @@ public static ResidueRange parse(String s) {
7191
chain = matcher.group(1);
7292
if (matcher.group(2) != null) {
7393
start = ResidueNumber.fromString(matcher.group(2));
74-
end = ResidueNumber.fromString(matcher.group(3));
7594
start.setChainId(chain);
76-
end.setChainId(chain);
95+
if(matcher.group(3) == null) {
96+
// single-residue range
97+
end = start;
98+
} else {
99+
end = ResidueNumber.fromString(matcher.group(3));
100+
end.setChainId(chain);
101+
}
77102
}
103+
return new ResidueRange(chain, start, end);
78104
} catch (IllegalStateException e) {
79105
throw new IllegalArgumentException("Range " + s + " was not valid", e);
80106
}
81-
return new ResidueRange(chain, start, end);
82107
} else if (CHAIN_REGEX.matcher(s).matches()) {
83108
return new ResidueRange(s, (ResidueNumber)null, null);
84109
}
85-
throw new IllegalArgumentException("Could not understand ResidueRange string " + s);
110+
throw new IllegalArgumentException("Illegal ResidueRange format:" + s);
86111
}
87112

88113
/**
89114
* @param s
90115
* A string of the form chain_start-end,chain_start-end, ... For example:
91116
* <code>A.5-100,R_110-190,Z_200-250</code>.
92117
* @return The unique ResidueRange corresponding to {@code s}.
118+
* @see #parse(String)
93119
*/
94120
public static List<ResidueRange> parseMultiple(String s) {
121+
s = s.trim();
122+
// trim parentheses, for backwards compatibility
123+
if ( s.startsWith("("))
124+
s = s.substring(1);
125+
if ( s.endsWith(")")) {
126+
s = s.substring(0,s.length()-1);
127+
}
128+
95129
String[] parts = s.split(",");
96130
List<ResidueRange> list = new ArrayList<ResidueRange>(parts.length);
97131
for (String part : parts) {

biojava-structure/src/main/java/org/biojava/nbio/structure/StructureIdentifier.java

Lines changed: 53 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -23,65 +23,73 @@
2323

2424
package org.biojava.nbio.structure;
2525

26-
import java.util.List;
26+
import java.io.IOException;
27+
28+
import org.biojava.nbio.structure.align.util.AtomCache;
29+
2730

2831
/**
29-
* An identifier that <em>uniquely</em> identifies a whole {@link Structure} or arbitrary substructure,
30-
* including whole chains, {@link org.biojava.nbio.structure.scop.ScopDomain ScopDomains}, and {@link org.biojava.nbio.structure.cath.CathDomain CathDomains}.
32+
* An identifier that <em>uniquely</em> identifies a whole {@link Structure} or
33+
* arbitrary substructure. Common examples would be reducing a structure to a
34+
* single chain, domain, or residue range.
35+
*
36+
* StructureIdentifiers are represented by unique strings. The getId() and fromId()
37+
* methods convert to and from the string representation.
38+
*
39+
* Implementations should provide a constructor which takes a String. A static
40+
* <tt>fromId(String)</tt> method is also recommended.
41+
*
3142
* @author dmyersturnbull
43+
* @author Spencer Bliven
3244
*/
3345
public interface StructureIdentifier {
3446

3547
/**
36-
* The unique identifier, using the following formal specification:
37-
* <pre>
38-
* name := pdbID
39-
* | pdbID '.' chainID
40-
* | pdbID '.' range
41-
* | scopID
42-
* range := '('? range (',' range)? ')'?
43-
* | chainID
44-
* | chainID '_' resNum '-' resNum
45-
* pdbID := [0-9][a-zA-Z0-9]{3}
46-
* chainID := [a-zA-Z0-9]
47-
* scopID := 'd' pdbID [a-z_][0-9_]
48-
* cathID := pdbID [A-Z][0-9]{2}
49-
* resNum := [-+]?[0-9]+[A-Za-z]?
50-
* </pre>
51-
* For example:
52-
* <pre>
53-
* 1TIM #whole structure
54-
* 1tim #same as above
55-
* 4HHB.C #single chain
56-
* 3AA0.A,B #two chains
57-
* d2bq6a1 #SCOP domain
58-
* 1cukA01 #CATH domain
59-
* 4GCR.A_1-40 #substructure
60-
* 3iek.A_17-28,A_56-294,A_320-377 #substructure of 3 disjoint parts
61-
* </pre>
62-
* More options may be added to the specification at a future time.
48+
* Get the String form of this identifier.
49+
* @return The String form of this identifier
6350
*/
6451
String getIdentifier();
65-
52+
53+
6654
/**
67-
* Returns the PDB identifier associated with this StructureIdentifier.
55+
* Loads a structure encompassing the structure identified.
56+
* The Structure returned should be suitable for passing as
57+
* the input to {@link #reduce(Structure)}.
58+
*
59+
* It is recommended that the most complete structure available be returned
60+
* (e.g. the full PDB) to allow processing of unselected portions where
61+
* appropriate.
62+
* @param AtomCache A potential sources of structures
63+
* @return A Structure containing at least the atoms identified by this,
64+
* or null if Structures are not applicable.
65+
* @throws StructureException For errors loading and parsing the structure
66+
* @throws IOException Errors reading the structure from disk
6867
*/
69-
String getPdbId();
68+
Structure loadStructure(AtomCache cache) throws StructureException, IOException;
7069

7170
/**
72-
* Returns the list of {@link ResidueRange ResidueRanges} that this StructureIdentifier defines.
73-
* This is a unique representation.
71+
* Convert to a canonical SubstructureIdentifier.
72+
*
73+
* <p>This allows all domains to be converted to a standard format String.
74+
* @return A SubstructureIdentifier equivalent to this
75+
* @throws StructureException Wraps exceptions that may be thrown by individual
76+
* implementations. For example, a SCOP identifier may require that the
77+
* domain definitions be available for download.
7478
*/
75-
List<? extends ResidueRange> getResidueRanges();
76-
79+
SubstructureIdentifier toCanonical() throws StructureException;
80+
7781
/**
78-
* Returns a list of ranges of the form described in {@link #getIdentifier()}. For example:
79-
* <pre>
80-
* getRanges().get(0): 'A'
81-
* getRanges().get(1): 'B_5-100'
82-
* </pre>
83-
* This is a unique representation.
82+
* Takes a complete structure as input and reduces it to the substructure
83+
* represented by this StructureIdentifier.
84+
*
85+
* <p>The returned structure may be a shallow copy of the input, with shared
86+
* Chains, Residues, etc.
87+
* @param input A full structure, e.g. as loaded from the PDB. The structure
88+
* ID should match that returned by getPdbId(), if applicable.
89+
* @return
90+
* @throws StructureException
91+
* @see StructureTools#getReducedStructure(Structure, String)
8492
*/
85-
List<String> getRanges();
86-
93+
Structure reduce(Structure input) throws StructureException;
94+
8795
}

0 commit comments

Comments
 (0)