Skip to content

Commit 40b7c4d

Browse files
committed
Merge remote-tracking branch 'origin/multiplealignment' into multiplealignment
2 parents 99c7cc2 + b6b0a01 commit 40b7c4d

10 files changed

Lines changed: 229 additions & 108 deletions

File tree

README.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ A brief introduction into [BioJava](https://github.com/biojava/biojava).
66

77
The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava.
88

9-
At the moment this tutorial is still under development. Please check the [BioJava Cookbook](http://biojava.org/wiki/BioJava:CookBook3.0) for a more comprehensive collection of many examples of what is possible with BioJava and how to do things.
9+
At the moment this tutorial is still under development. Please check the [BioJava Cookbook](http://biojava.org/wiki/BioJava:CookBook3.0) for a more comprehensive collection of examples about what is possible with BioJava and how to do things.
1010

1111
## Index
1212

@@ -16,10 +16,9 @@ Book 1: [The Core module](core/README.md), basic working with sequences.
1616

1717
Book 2: [The Alignment module](alignment/README.md), pairwise and multiple alignments of protein sequences.
1818

19-
Book 3: [The Protein Structure modules](structure/README.md), everything related to working with 3D structures.
20-
21-
Book 4: [The Genomics Module](genomics/README.md), working with genomic data
19+
Book 3: [The Structure modules](structure/README.md), everything related to working with 3D structures.
2220

21+
Book 4: [The Genomics Module](genomics/README.md), working with genomic data.
2322

2423
## License
2524

structure/README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1-
The Protein Structure Modules of BioJava
1+
The Structure Modules of BioJava
22
=====================================================
33

4-
A tutorial for the protein structure modules of [BioJava](http://www.biojava.org)
4+
A tutorial for the structure modules of [BioJava](http://www.biojava.org)
55

66
## About
77
<table>
@@ -32,35 +32,35 @@ Chapter 1 - Quick [Installation](installation.md)
3232

3333
Chapter 2 - [First Steps](firststeps.md)
3434

35-
Chapter 3 - The [data model](structure-data-model.md) for the representation of macromolecular structures.
35+
Chapter 3 - The [Structure Data Model](structure-data-model.md), for the representation of macromolecular structures
3636

37-
Chapter 4 - [Local installations](caching.md) of PDB
37+
Chapter 4 - [Local Installations](caching.md) of PDB
3838

3939
Chapter 5 - The [Chemical Component Dictionary](chemcomp.md)
4040

41-
Chapter 6 - How to [work with mmCIF/PDBx files](mmcif.md)
41+
Chapter 6 - How to [Work with mmCIF/PDBx Files](mmcif.md)
4242

43-
Chapter 7 - [SEQRES and ATOM records](seqres.md), mapping to Uniprot (SIFTs)
43+
Chapter 7 - [SEQRES and ATOM Records](seqres.md), mapping to Uniprot (SIFTs)
4444

45-
Chapter 8 - Protein [Structure Alignments](alignment.md)
45+
Chapter 8 - [Structure Alignments](alignment.md)
4646

4747
Chapter 9 - [Biological Assemblies](bioassembly.md)
4848

4949
Chapter 10 - [External Databases](externaldb.md) like SCOP &amp; CATH
5050

5151
Chapter 11 - [Accessible Surface Areas](asa.md)
5252

53-
Chapter 12 - [Contacts within a chain and between chains](contact-map.md)
53+
Chapter 12 - [Contacts Within a Chain and between Chains](contact-map.md)
5454

55-
Chapter 13 - Finding all interfaces in crystal: [crystal contacts](crystal-contacts.md)
55+
Chapter 13 - Finding all Interfaces in Crystal: [Crystal Contacts](crystal-contacts.md)
5656

5757
Chapter 14 - Protein Symmetry
5858

5959
Chapter 15 - Bonds
6060

6161
Chapter 16 - [Special Cases](special.md)
6262

63-
Chapter 17 - [Lists](lists.md) of PDB IDs and PDB [status information](lists.md).
63+
Chapter 17 - [Lists](lists.md) of PDB IDs and PDB [Status Information](lists.md)
6464

6565

6666
### Author:
@@ -88,7 +88,7 @@ The content of this tutorial is available under the [CC-BY](http://creativecommo
8888

8989
Navigation:
9090
[Home](../README.md)
91-
| Book 3: The Protein Structure modules
91+
| Book 3: The Structure modules
9292

9393
Prev: [Book 2: The Alignment module](../alignment/README.md)
9494

structure/alignment-data-model.md

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ Some of the important stored variables are:
1515
* BlockRotationMatrix: rotation component of the superposition transformation.
1616
* BlockShiftVector: translation component of the superposition transformation.
1717

18+
BioJava class: [org.biojava.bio.structure.align.model.AFPChain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/model/AFPChain.html)
19+
1820
### The Optimal Alignment
1921

2022
The residue equivalencies of the alignment (EQRs) are described in the optimal
@@ -80,6 +82,8 @@ In addtition, the data structure is not limited in the number and types of score
8082
it can store, because the scores are stored in a key:value fashion, as it will be
8183
described later.
8284

85+
BioJava class: [org.biojava.bio.structure.align.multiple.MultipleAlignment](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/MultipleAlignment.html)
86+
8387
### Object Hierarchy
8488

8589
The biggest difference with `AFPChain` is that the `MultipleAlignment` data
@@ -167,8 +171,20 @@ on a `MultipleAlignment`:
167171
double bsRMSD = alignment.getScore('bsRMSD');
168172
```
169173

170-
Methods and names for some frequent scores are located in a util class called
171-
`MultipleAlignmentScorer`.
174+
### Manipulating Multiple Alignments
175+
176+
Some classes are designed to contain utility methods for manipulating a `MultipleAlignment` object.
177+
The most important ones are ennumerated and briefly described below:
178+
179+
* [MultipleAlignmentScorer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleAlignmentScorer.html): contains frequent names for scores and methods to calculate them.
180+
181+
* [MultipleAlignmentTools](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleAlignmentTools.html): contains helper methods, such as sequence alignment calculation, transform atom arrays of the structures or calculate aligned residue distances between all structures.
182+
183+
* [MultipleAlignmentWriter](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleAlignmentWriter.html): contains methods to generate different types of String outputs of the alignment, e.g. FASTA, XML, FatCat.
184+
185+
* [MultipleSuperimposer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleSuperimposer.html): interface for implementations that calculate the structure superpositions of the alignment. Some examples of implementations are the ReferenceSuperimposer (superimposes all the structures to a reference) and the CoreSuperimposer (only uses EQRs present in all structures, without gaps, to superimpose them).
186+
187+
* [MultipleAlignmentXMLParser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/xml/MultipleAlignmentXMLParser.html): contains a method to create a `MultipleAlignment` object from an XML file representation.
172188

173189
### Overview
174190

@@ -207,3 +223,7 @@ the first representation supports any number of structures, while the second is
207223
only supporting pairwise alignments. However, the conversion can be done with some
208224
lines of code if needed (instantiate a new `AFPChain` and copy one by one the
209225
properties that can be represented from the `MultipleAlignment`.
226+
227+
===
228+
229+
Go back to [Chapter 8 : Structure Alignments](alignment.md).

structure/alignment.md

Lines changed: 136 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,46 +1,62 @@
1-
Structure Alignment
1+
Structure Alignments
22
===========================
33

44
## What is a Structure Alignment?
55

6-
A **structural alignment** attempts to establish equivalences between two or more polymer structures based on their shape and three-dimensional conformation. In contrast to simple structural superposition (see below), where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions.
7-
8-
**Structural alignment** is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. **Structural alignment** can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be exercised when using the results as evidence for shared evolutionary ancestry, because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.
9-
10-
**Structural alignment** of other biological structures can also be made in BioJava. For example, nucleic acids can
11-
be structurally aligned to find common structural motifs, independent of sequence simililarity. This is specially
12-
important for RNAs, because their 3D structure arrangement is important for their function.
6+
A **structural alignment** attempts to establish equivalences between two or
7+
more polymer structures based on their shape and three-dimensional conformation.
8+
In contrast to simple structural superposition (see below), where at least some
9+
equivalent residues of the two structures are known, structural alignment requires
10+
no a priori knowledge of equivalent positions.
11+
12+
A **structural alignment** is a valuable tool for the comparison of proteins with
13+
low sequence similarity, where evolutionary relationships between proteins cannot
14+
be easily detected by standard sequence alignment techniques. Therefore, a
15+
**structural alignment** can be used to imply evolutionary relationships between
16+
proteins that share very little common sequence. However, caution should be exercised
17+
when using the results as evidence for shared evolutionary ancestry, because of the
18+
possible confounding effects of convergent evolution by which multiple unrelated amino
19+
acid sequences converge on a common tertiary structure.
20+
21+
A **structural alignment** of other biological polymers can also be made in BioJava.
22+
For example, nucleic acids can be structurally aligned to find common structural motifs,
23+
independent of sequence simililarity. This is specially important for RNAs, because their
24+
3D structure arrangement is important for their function.
1325

1426
For more info see the Wikipedia article on [structure alignment](http://en.wikipedia.org/wiki/Structural_alignment).
1527

1628
## Alignment Algorithms supported by BioJava
1729

1830
BioJava comes with a number of algorithms for aligning structures. The following
1931
five options are displayed by default in the graphical user interface (GUI),
20-
although others can be accessed programmatically using the methods in
21-
[StructureAlignmentFactory]
22-
(http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/StructureAlignmentFactory.html).
32+
although others can be accessed programmatically using the methods in
33+
[StructureAlignmentFactory](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/StructureAlignmentFactory.html).
2334

2435
1. Combinatorial Extension (CE)
2536
2. Combinatorial Extension with Circular Permutation (CE-CP)
2637
3. FATCAT - rigid
2738
4. FATCAT - flexible.
2839
5. Smith-Waterman superposition
2940

30-
CE and FATCAT both use structural similarity to align the structures, while
31-
Smith-Waterman performs a local sequence alignment and then displays the result
41+
**CE** and **FATCAT** both use structural similarity to align the structures, while
42+
**Smith-Waterman** performs a local sequence alignment and then displays the result
3243
in 3D. See below for descriptions of the algorithms.
3344

34-
Since BioJava version 4.1.0, multiple structure alignments can be generated and visualized.
45+
Since BioJava version 4.1.0, multiple structures can be compared at the same time in
46+
a **multiple structure alignment**, that can later be visualized in Jmol.
3547
The algorithm is described in detail below. As an overview, it uses any pairwise alignment
36-
algorithm and a reference structure to align all of the structures. Then, it runs a Monte
37-
Carlo optimization method to determine the residue equivalencies between all the strucutures,
38-
identifying conserved structural motifs.
48+
algorithm and a **reference** structure to per perform an alignment of all the structures.
49+
Then, it runs a **Monte Carlo** optimization to determine the residue equivalencies among
50+
all the strucutures, identifying conserved **structural motifs**.
3951

4052
## Alignment User Interface
4153

4254
Before going the details how to use the algorithms programmatically, let's take
43-
a look at the user interface that cames with the *biojava-structure-gui* module.
55+
a look at the user interface that comes with the *biojava-structure-gui* module.
56+
57+
### Pairwise Alignment GUI
58+
59+
Generating an instance of the GUI is just one line of code:
4460

4561
```java
4662
AlignmentGui.getInstance();
@@ -60,9 +76,45 @@ and also a 2D display, that interacts with the 3D display
6076

6177
![2D Alignment of PDB IDs 2hyn and 1zll](img/alignmentpanel.png)
6278

63-
The functionality to perform and visualize these alignments can of course be
64-
used also from your own code. Let's first have a look at the alignment
65-
algorithms.
79+
### Multiple Alignment GUI
80+
81+
Because of the inherent difference between multiple and pairwise alignments,
82+
a separate GUI is used to trigger multiple structural alignments. Generating
83+
an instance of the GUI is analogous to the pairwise alignment GUI:
84+
85+
```java
86+
MultipleAlignmentGUI.getInstance();
87+
```
88+
89+
This code shows the following user interface:
90+
91+
![Multiple Alignment GUI](img/multiple_gui.png)
92+
93+
The input format is a free text field, where the structure identifiers are
94+
indidcated, space separated. A **structure identifier** is a String that
95+
uniquely identifies a structure. It is basically composed of the pdbID, the
96+
chain letters and the ranges of residues of each chain. For the formal description
97+
visit [StructureIdentifier](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIdentifier.html).
98+
99+
As an example, a multiple structure alignment of 6 globins is shown here.
100+
Their structure identifiers are shown in the previous figure of the GUI.
101+
The results are shown in a graphical way, as for the pairwise alignments:
102+
103+
![3D Globin Multiple Alignment](img/multiple_jmol_globins.png)
104+
105+
The only difference with the Pairwise Alignment View is the possibility to show
106+
a subset of structures to be visualized, by checking the boxes under the 3D
107+
window and pressing the Show Only button afterwards.
108+
109+
A **sequence alignment panel** that interacts with the 3D display can also be shown.
110+
111+
![3D Globin Multiple Panel](img/multiple_panel_globins.png)
112+
113+
Explore the coloring options in the *Edit* menu, and through the *View* menu for
114+
alternative representations of the alignment.
115+
116+
The functionality to perform and visualize these alignments can also be
117+
used from your own code. Let's first have a look at the alignment algorithms.
66118

67119
## Pairwise Alignment Algorithms
68120

@@ -175,9 +227,33 @@ interface.
175227

176228
## Multiple Structure Alignment
177229

178-
Since BioJava 4.1.0, multiple structure alignments can be generated.
230+
This Java implementation for multiple structure alignments, named MultipleMC, is based on the original CE-MC implementation by [Guda C, Scheeff ED, Bourne PE &amp; Shindyalov IN in 2001](http://psb.stanford.edu/psb-online/proceedings/psb01/abstracts/p275.html)
231+
[![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/11262947).
232+
233+
The idea remains unchanged: perform **all-to-all pairwise alignments** of the structures, choose the
234+
**reference** as the most similar structure to all others and run a **Monte Carlo optimization** of
235+
the multiple residue equivalencies (EQRs) to minimize a score function that depends on the inter-residue
236+
distances.
237+
238+
Although the main idea is the same as in the original algorithm, some details of the implementation have
239+
been changed in the BioJava version. They are described in the main class, but as a summary:
240+
241+
1. It accepts **any pairwise alignment** algorithm (instead of being attached to CE), so any
242+
of the algorithms described before is suitable for generating a seed for optimization. Note that
243+
this property allows *non-topological* and *flexible* multiple structure alignments, always restricted
244+
by the pairwise alignment algorithm limitations.
245+
2. The **moves** in the Monte Carlo optimization have been simplified to 3, instead of 4.
246+
3. A **new move** to insert and delete individual gaps has been added.
247+
4. The scoring function has been modified to a **continuous** function, maintaining the properties that the authors described.
248+
5. The **probability function** is normalized in synchronization with the optimization progression, to improce the convergence into a score maximum after some random exploration of the multidimensiona space.
249+
250+
The algorithm performs similarly to other multiple structure alignment algorithms for most protein families.
251+
The parameters both for the pairwise aligner and the MC optimization can have an impact on the final result. There is not a unique set of parameters, because they usually depend on the specific case. Thus, trying some parameter combinations, keeping in mind the effect they produce in the score function, is a good practice when doing structure alignments.
179252

180-
## PDB-wide database searches
253+
BioJava class: [org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain]
254+
(www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/mc/MultipleMcMain.html)
255+
256+
## PDB-wide Database Searches
181257

182258
The Alignment GUI also provides functionality for PDB-wide structural searches.
183259
This systematically compares a structure against a non-redundant set of all
@@ -213,10 +289,10 @@ the `PDB_DIR` environmental variable. This operation sped up the search from
213289
about 30 hours to less than 4 hours.
214290

215291

216-
## Creating alignments programmatically
292+
## Creating Alignments Programmatically
217293

218-
The various structure alignment algorithms in BioJava implement the
219-
`StructureAlignment` interface, and are normally accessed through
294+
The **pairwise structure alignment** algorithms in BioJava implement the
295+
`StructureAlignment` interface, and are usually accessed through
220296
`StructureAlignmentFactory`. Here's an example of how to create a CE-CP
221297
alignment and print some information about it.
222298

@@ -242,13 +318,43 @@ To display the alignment using Jmol, use:
242318

243319
```java
244320
GuiWrapper.display(afpChain, ca1, ca2);
245-
// Or StructureAlignmentDisplay.display(afpChain, ca1, ca2);
321+
// Or using the biojava-structure-gui module
322+
StructureAlignmentDisplay.display(afpChain, ca1, ca2);
246323
```
247324

248325
Note that these require that you include the structure-gui package and the Jmol
249326
binary in the classpath at runtime.
250327

251-
## Command-line tools
328+
For creating **multiple structure alignments**, the code is a little bit different, because the
329+
returned data structure and the number of input structures are different. Here is an
330+
example of how to create and display a multiple alignment:
331+
332+
```java
333+
//Specify the structures to align: some ASP-proteinases
334+
List<String> names = Arrays.asList("3app", "4ape", "5pep", "1psn", "4cms", "1bbs.A", "1smr.A");
335+
336+
//Load the CA atoms of the structures
337+
AtomCache cache = new AtomCache();
338+
List<Atom[]> atomArrays = new ArrayList<Atom[]>();
339+
for (String name:names) {
340+
atomArrays.add(cache.getAtoms(name));
341+
}
342+
343+
//Generate the multiple alignment algorithm with the chosen pairwise algorithm
344+
StructureAlignment pairwise = StructureAlignmentFactory.getAlgorithm(CeMain.algorithmName);
345+
MultipleMcMain multiple = new MultipleMcMain(pairwise);
346+
347+
//Perform the alignment
348+
MultipleAlignment result = algorithm.align(atomArrays);
349+
350+
//Output the FASTA sequence alignment
351+
System.out.println(MultipleAlignmentWriter.toFASTA(result));
352+
353+
//Display the results in a 3D view
354+
MultipleAlignmentDisplay.display(result);
355+
```
356+
357+
## Command-Line Tools
252358

253359
Many of the alignment algorithms are available in the form of command line
254360
tools. These can be accessed through the main methods of the StructureAlignment
@@ -265,8 +371,7 @@ alignments in batch mode, or full database searches. Some additional parameters
265371
are available which are not exposed in the GUI, such as outputting results to a
266372
file in various formats.
267373

268-
269-
## See Also
374+
## Alignment Data Model
270375

271376
For details about the structure alignment data models in biojava, see [Structure Alignment Data Model](alignment-data-model.md)
272377

@@ -280,7 +385,7 @@ Thanks to P. Bourne, Yuzhen Ye and A. Godzik for granting permission to freely u
280385

281386
Navigation:
282387
[Home](../README.md)
283-
| [Book 3: The Protein Structure modules](README.md)
388+
| [Book 3: The Structure modules](README.md)
284389
| Chapter 8 : Structure Alignments
285390

286391
Prev: [Chapter 7 : SEQRES and ATOM records](seqres.md)

0 commit comments

Comments
 (0)