Skip to content

Commit 43d0b3c

Browse files
committed
Merge pull request biojava#5 from lafita/multiplealignment
Extend structure alignment page with multiple alignments
2 parents 5d1bdfd + bbc77f4 commit 43d0b3c

24 files changed

Lines changed: 490 additions & 137 deletions

README.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ A brief introduction into [BioJava](https://github.com/biojava/biojava).
66

77
The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava.
88

9-
At the moment this tutorial is still under development. Please check the [BioJava Cookbook](http://biojava.org/wiki/BioJava:CookBook3.0) for a more comprehensive collection of many examples of what is possible with BioJava and how to do things.
9+
At the moment this tutorial is still under development. Please check the [BioJava Cookbook](http://biojava.org/wiki/BioJava:CookBook3.0) for a more comprehensive collection of examples about what is possible with BioJava and how to do things.
1010

1111
## Index
1212

@@ -16,10 +16,9 @@ Book 1: [The Core module](core/README.md), basic working with sequences.
1616

1717
Book 2: [The Alignment module](alignment/README.md), pairwise and multiple alignments of protein sequences.
1818

19-
Book 3: [The Protein Structure modules](structure/README.md), everything related to working with 3D structures.
20-
21-
Book 4: [The Genomics Module](genomics/README.md), working with genomic data
19+
Book 3: [The Structure modules](structure/README.md), everything related to working with 3D structures.
2220

21+
Book 4: [The Genomics Module](genomics/README.md), working with genomic data.
2322

2423
## License
2524

alignment/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,4 +63,4 @@ Navigation:
6363

6464
Prev: [Book 1: The Core module](../core/README.md)
6565

66-
Next: [Book 3: The Protein Structure modules](../structure/README.md)
66+
Next: [Book 3: The Structure modules](../structure/README.md)

bin/update_index.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ def makefooter(self):
110110
name = p.makename()
111111
# Get a path to p relative to our own path
112112
link = os.path.relpath(p.rootlink(),os.path.dirname(self.rootlink()))
113-
linkmd.append("[{}]({})".format(name,link))
113+
linkmd.append("[{0}]({1})".format(name,link))
114114
p = p.parent
115115
linkmd.reverse()
116116
lines.append("\n| ".join(linkmd))
@@ -123,13 +123,13 @@ def makefooter(self):
123123
prev = self.parent.children[pos-1]
124124
name = prev.makename()
125125
link = os.path.relpath(prev.rootlink(),os.path.dirname(self.rootlink()))
126-
lines.append("Prev: [{}]({})".format(name,link))
126+
lines.append("Prev: [{0}]({1})".format(name,link))
127127
lines.append("")
128128
if pos < len(self.parent.children)-1:
129129
next = self.parent.children[pos+1]
130130
name = next.makename()
131131
link = os.path.relpath(next.rootlink(),os.path.dirname(self.rootlink()))
132-
lines.append("Next: [{}]({})".format(name,link))
132+
lines.append("Next: [{0}]({1})".format(name,link))
133133
lines.append("")
134134

135135
#lines.append(self.makename()+", "+self.link)
@@ -162,7 +162,7 @@ def __repr__(self):
162162

163163
# Output tree
164164
def pr(node,indent=""):
165-
print "{}{}".format(indent,node.link,node.rootlink())
165+
print "{0}{1}".format(indent,node.link,node.rootlink())
166166
for n in node.children:
167167
pr(n,indent+" ")
168168

genomics/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,4 +64,4 @@ Navigation:
6464
[Home](../README.md)
6565
| Book 4: The Genomics Module
6666

67-
Prev: [Book 3: The Protein Structure modules](../structure/README.md)
67+
Prev: [Book 3: The Structure modules](../structure/README.md)

structure/README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1-
The Protein Structure Modules of BioJava
1+
The Structure Modules of BioJava
22
=====================================================
33

4-
A tutorial for the protein structure modules of [BioJava](http://www.biojava.org)
4+
A tutorial for the structure modules of [BioJava](http://www.biojava.org)
55

66
## About
77
<table>
@@ -32,35 +32,35 @@ Chapter 1 - Quick [Installation](installation.md)
3232

3333
Chapter 2 - [First Steps](firststeps.md)
3434

35-
Chapter 3 - The [data model](structure-data-model.md) for the representation of macromolecular structures.
35+
Chapter 3 - The [Structure Data Model](structure-data-model.md), for the representation of macromolecular structures
3636

37-
Chapter 4 - [Local installations](caching.md) of PDB
37+
Chapter 4 - [Local Installations](caching.md) of PDB
3838

3939
Chapter 5 - The [Chemical Component Dictionary](chemcomp.md)
4040

41-
Chapter 6 - How to [work with mmCIF/PDBx files](mmcif.md)
41+
Chapter 6 - How to [Work with mmCIF/PDBx Files](mmcif.md)
4242

43-
Chapter 7 - [SEQRES and ATOM records](seqres.md), mapping to Uniprot (SIFTs)
43+
Chapter 7 - [SEQRES and ATOM Records](seqres.md), mapping to Uniprot (SIFTs)
4444

45-
Chapter 8 - Protein [Structure Alignments](alignment.md)
45+
Chapter 8 - [Structure Alignments](alignment.md)
4646

4747
Chapter 9 - [Biological Assemblies](bioassembly.md)
4848

4949
Chapter 10 - [External Databases](externaldb.md) like SCOP &amp; CATH
5050

5151
Chapter 11 - [Accessible Surface Areas](asa.md)
5252

53-
Chapter 12 - [Contacts within a chain and between chains](contact-map.md)
53+
Chapter 12 - [Contacts Within a Chain and between Chains](contact-map.md)
5454

55-
Chapter 13 - Finding all interfaces in crystal: [crystal contacts](crystal-contacts.md)
55+
Chapter 13 - Finding all Interfaces in Crystal: [Crystal Contacts](crystal-contacts.md)
5656

5757
Chapter 14 - Protein Symmetry
5858

5959
Chapter 15 - Bonds
6060

6161
Chapter 16 - [Special Cases](special.md)
6262

63-
Chapter 17 - [Lists](lists.md) of PDB IDs and PDB [status information](lists.md).
63+
Chapter 17 - [Lists](lists.md) of PDB IDs and PDB [Status Information](lists.md)
6464

6565

6666
### Author:
@@ -88,7 +88,7 @@ The content of this tutorial is available under the [CC-BY](http://creativecommo
8888

8989
Navigation:
9090
[Home](../README.md)
91-
| Book 3: The Protein Structure modules
91+
| Book 3: The Structure modules
9292

9393
Prev: [Book 2: The Alignment module](../alignment/README.md)
9494

structure/alignment-data-model.md

Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
Structure Alignment Data Model
2+
===
3+
4+
## AFPChain Data Model
5+
6+
The `AFPChain` data structure was designed to store pairwise structural
7+
alignments. The class functions as a bean, and contains many variables
8+
used internally by the alignment algorithms implemented in biojava.
9+
10+
Some of the important stored variables are:
11+
* Algorithm Name
12+
* Optimal Alignment: described later.
13+
* Optimal RMSD: final and total RMSD value of the alignment.
14+
* TM-score
15+
* BlockRotationMatrix: rotation component of the superposition transformation.
16+
* BlockShiftVector: translation component of the superposition transformation.
17+
18+
BioJava class: [org.biojava.bio.structure.align.model.AFPChain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/model/AFPChain.html)
19+
20+
### The Optimal Alignment
21+
22+
The residue equivalencies of the alignment (EQRs) are described in the optimal
23+
alignment variable, a triple array of integers, where the indices stand for:
24+
25+
```java
26+
int[][][] optAln = afpChain.getOptAln();
27+
int residue = optAln[block][chain][eqr];
28+
```
29+
30+
* **block**: the blocks divide the alignment into different parts. The
31+
division can be due to non-topological rearrangements (e.g. circular
32+
permutations) or due to flexible parts (e.g. domain switch). There can
33+
be any number of blocks in a structural alignment, defined by the structure
34+
alignment algorithm.
35+
* **chain**: in a pairwise alignment there are only two chains, or structures.
36+
* **eqr**: EQR stands for equivalent residue position, i.e. the alignment
37+
position. There are as many positions (EQRs) in a block as the length of
38+
the alignment block, and their number is equal for any of the two chains in
39+
the same block.
40+
41+
In each entry (combination of the three indices described above) an integer
42+
is stored, which corresponds to the residue index in the specified chain, i.e.
43+
the index in the Atom array of the chain. In between the same block, the stored
44+
integers (residues) are always in increasing order.
45+
46+
### Examples
47+
48+
Some examples of how to get the basic properties of an `AFPChain`:
49+
50+
```java
51+
afpChain.getAlgorithmName(); //Name of the algorithm that generated the alignment
52+
afpChain.getBlockNum(); //Number of blocks
53+
afpChain.getTMScore(); //TM-score
54+
afpChain.getTotalRmsdOpt() //Optimal RMSD
55+
afpChain.getBlockRotationMatrix()[0] //get the rotation matrix of the first block
56+
afpChain.getBlockShiftVector()[0] //get the translation vector of the first block
57+
```
58+
59+
### Overview
60+
61+
As an overview, the `AFPChain` data model:
62+
63+
* Only supports **pairwise alignments**, i.e. two chains or structures aligned.
64+
* Can support **flexible alignments** and **non-topological alignments**.
65+
However, their combinatation (a flexible alignment with topological rearrangements)
66+
can not be represented, because the blocks mean either one or the other.
67+
* Can not support **non-sequential alignments**, or they would require a new block
68+
for each EQR, because sequentiality of the residues is assumed inside each block.
69+
70+
## MultipleAlignment Data Model
71+
72+
Since BioJava 4.1.0, a new data model is available to store structure alignments.
73+
The `MultipleAlignment` data structure is a general model that supports any of the
74+
following properties, and any combination:
75+
76+
* **Multiple structures**: the model is no longer restricted to pairwise alignments.
77+
* **Non-topological alignments**: such as circular permutations or domain rearrangements.
78+
* **Flexible alignments**: parts of the alignment with different superposition
79+
transformation.
80+
81+
In addtition, the data structure is not limited in the number and types of scores
82+
it can store, because the scores are stored in a key:value fashion, as it will be
83+
described later.
84+
85+
BioJava class: [org.biojava.bio.structure.align.multiple.MultipleAlignment](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/MultipleAlignment.html)
86+
87+
### Object Hierarchy
88+
89+
The biggest difference with `AFPChain` is that the `MultipleAlignment` data
90+
structure is object oriented.
91+
The hierarchy of sub-objects is represented below:
92+
93+
<pre>
94+
MultipleAlignmentEnsemble
95+
|
96+
MultipleAlignment(s)
97+
|
98+
BlockSet(s)
99+
|
100+
Block(s)
101+
</pre>
102+
103+
* **MultipleAlignmentEnsemble**: the ensemble is the top level of the hierarchy.
104+
As a top level, it stores information regarding creation properties (algorithm,
105+
version, creation time, etc.), the structures involved in the alignment (Atoms,
106+
structure identifiers, etc.) and cached variables (atomic distance matrices).
107+
It contains a collection of `MultipleAlignment` that share the same properties
108+
stored in the ensemble. This construction allows the storage of alternative
109+
alignments inside the same data structure.
110+
111+
* **MultipleAlignment**: the `MultipleAlignment` stores the core information of a
112+
multiple structure alignment. It is designed to be the return type of the multiple
113+
structure alignment algorithms. The object contains a collection of `BlockSet` and
114+
it is linked to its parent `MultipleAlignmentEnsemble`.
115+
116+
* **BlockSet**: the `BlockSet` stores a flexible part of a multiple structure
117+
alignment. A flexible part needs the residue equivalencies involved, contained in
118+
a collection of `Block`, and a transformation matrix for every structure that
119+
describes the 3D superposition of all structures. It is linked to its parent
120+
`MultipleAlignment`.
121+
122+
* **Block**: the `Block` stores the aligned positions (equivalent residues) of a
123+
`BlockSet` that are in sequentially increasing order. Each `Block` represents a
124+
sequential part of a non-topological alignment, if more than one `Block` is present.
125+
It is linked to its parent `BlockSet`.
126+
127+
### The Optimal Alignment
128+
129+
In the `MultipleAlignment` data structure the aligned residues are stored in a
130+
double List for every `Block`. The indices of the double List are the following:
131+
132+
```java
133+
List<List<Integer>> optAln = block.getAlnRes();
134+
Integer residue = optAln.get(chain).get(eqr);
135+
```
136+
137+
The indices mean the same as in the optimal alignment of the `AFPChain`, just to
138+
remember them:
139+
140+
* **chain**: chain or structure index.
141+
* **eqr**: EQR stands for equivalent residue position, i.e. the alignment
142+
position. There are as many positions (EQRs) in a block as the length of
143+
the alignment block, and their number is equal for any of the chains in
144+
the same block.
145+
146+
As in `AFPChain`, each entry (combination of the two indices described above)
147+
is an Integer that corresponds to the residue index in the specified chain, i.e.
148+
the index in the Atom array of the chain. Caution has to be taken in the code,
149+
because a `MultipleAlignment` can contain gaps, which are represented as `null`
150+
in the List entries.
151+
152+
### Alignment Scores
153+
154+
All the objects in the hierarchy levels implement the `ScoresCache` interface.
155+
This interface allows the storage of any number of scores as a key:value set.
156+
The key is a `String` that describes the score and used to recover it after,
157+
and the value is a double with the calculated score. The interface has only
158+
two methods: putScore and getScore.
159+
160+
The following lines of code are an example on how to do score manipulations
161+
on a `MultipleAlignment`:
162+
163+
```java
164+
//Put a score into the alignment and get it back
165+
alignment.putScore('myRMSD', 1.234);
166+
double myRMSD = alignment.getScore('myRMSD');
167+
168+
BlockSet bs = alignment.getBlockSets().get(0);
169+
//The same can be done for BlockSets
170+
alignment.putScore('bsRMSD', 1.234);
171+
double bsRMSD = alignment.getScore('bsRMSD');
172+
```
173+
174+
### Manipulating Multiple Alignments
175+
176+
Some classes are designed to contain utility methods for manipulating a `MultipleAlignment` object.
177+
The most important ones are ennumerated and briefly described below:
178+
179+
* [MultipleAlignmentScorer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleAlignmentScorer.html): contains frequent names for scores and methods to calculate them.
180+
181+
* [MultipleAlignmentTools](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleAlignmentTools.html): contains helper methods, such as sequence alignment calculation, transform atom arrays of the structures or calculate aligned residue distances between all structures.
182+
183+
* [MultipleAlignmentWriter](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleAlignmentWriter.html): contains methods to generate different types of String outputs of the alignment, e.g. FASTA, XML, FatCat.
184+
185+
* [MultipleSuperimposer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleSuperimposer.html): interface for implementations that calculate the structure superpositions of the alignment. Some examples of implementations are the ReferenceSuperimposer (superimposes all the structures to a reference) and the CoreSuperimposer (only uses EQRs present in all structures, without gaps, to superimpose them).
186+
187+
* [MultipleAlignmentXMLParser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/xml/MultipleAlignmentXMLParser.html): contains a method to create a `MultipleAlignment` object from an XML file representation.
188+
189+
### Overview
190+
191+
As an overview, the `MultipleAlignment` data model:
192+
193+
* Supports any number of aligned structures, **multiple structures**.
194+
* Can support **flexible alignments** and **non-topological alignments**,
195+
and any of their combinatations (e.g. a flexible alignment with topological
196+
rearrangements).
197+
* Can not support **non-sequential alignments**, or they would require a new
198+
`Block` for each EQR, because sequentiality of the residues is a requirement
199+
for each `Block`.
200+
* Can store **any score** in any of the four object hierarchy level, making it
201+
easy to adapt to new requirements and algorithms.
202+
203+
For more examples and information about the `MultipleAlignment` data structure
204+
go to the Demo package on the biojava-structure module or look through the interface
205+
files, where the javadoc explanations can be found.
206+
207+
## Conversion between Data Models
208+
209+
The conversion from an `AFPChain` to a `MultipleAlignment` is possible trough the
210+
ensemble constructor. An example on how to do it programatically is below:
211+
212+
```java
213+
AFPChain afpChain;
214+
Atom[] chain1;
215+
Atom[] chain2;
216+
boolean flexible = false;
217+
MultipleAlignmentEnsemble ensemble = new MultipleAlignmentEnsemble(afpChain, chain1, chain2, false);
218+
MultipleAlignment converted = ensemble.getMultipleAlignments().get(0);
219+
```
220+
221+
There is no method to convert from a `MultipleAlignment` to an `AFPChain`, because
222+
the first representation supports any number of structures, while the second is
223+
only supporting pairwise alignments. However, the conversion can be done with some
224+
lines of code if needed (instantiate a new `AFPChain` and copy one by one the
225+
properties that can be represented from the `MultipleAlignment`.
226+
227+
===
228+
229+
Go back to [Chapter 8 : Structure Alignments](alignment.md).

0 commit comments

Comments
 (0)