Skip to content

Commit af37345

Browse files
committed
Merge 'sbliven/master' with remote-tracking branch 'main/master'
Conflicts: structure/README.md
2 parents aa08476 + 2f05478 commit af37345

16 files changed

Lines changed: 328 additions & 64 deletions

structure/README.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
The Protein Structure Modules of BioJava
22
=====================================================
33

4-
A tutorial for the protein structure modules of BioJava
4+
A tutorial for the protein structure modules of [BioJava](http://www.biojava.org)
55

66
## About
77
<table>
@@ -43,13 +43,15 @@ Chapter 7 - [SEQRES and ATOM records](seqres.md), mapping to Uniprot (SIFTs)
4343

4444
Chapter 8 - Protein [Structure Alignments](alignment.md)
4545

46-
Chapter 9 - Biological Assemblies
46+
Chapter 9 - [Biological Assemblies](bioassembly.md)
4747

48-
Chapter 10 - Protein Symmetry
48+
Chapter 10 - [External Databases](externaldb.md) like SCOP &amp; CATH
49+
50+
Chapter 11 - Protein Symmetry
51+
52+
Chapter 12 - Bonds
4953

50-
Chapter 11 - Bonds
5154

52-
Chapter 12 - [External Databases](externaldb.md) like SCOP &amp; CATH
5355

5456

5557
### Author:
@@ -67,6 +69,6 @@ doi: 10.1093/bioinformatics/bts494
6769

6870
The content of this tutorial is available under the [CC-BY](http://creativecommons.org/licenses/by/3.0/) license.
6971

70-
[view license](license.md)
72+
[view license](../license.md)
7173

7274

structure/alignment.md

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,22 +26,31 @@ Before going the details how to use the algorithms programmatically, let's take
2626
AlignmentGui.getInstance();
2727
</pre>
2828

29-
shows this user interface:
29+
shows the following user interface.
3030

3131
![Alignment GUI](img/alignment_gui.png)
3232

33+
You can manually select protein chains, domains, or custom files to be aligned. Try to align 2hyn vs. 1zll. This will show the results in a graphical way, in 3D:
3334

35+
![3D Alignment of PDB IDs 2hyn and 1zll](img/2hyn_1zll.png)
3436

37+
and also a 2D display, that interacts with the 3D display
3538

36-
## Combinatorial Extension (CE)
39+
![2D Alignment of PDB IDs 2hyn and 1zll](img/alignmentpanel.png)
40+
41+
The functionality to perform and visualize these alignments can of course be used also from your own code. Let's first have a look at the alignment algorithms:
42+
43+
## The Alignment Algorithms
44+
45+
### Combinatorial Extension (CE)
3746

3847
The Combinatorial Extension (CE) algorithm was originally developed by [Shindyalov and Bourne in 1998](http://peds.oxfordjournals.org/content/11/9/739.short).
3948

40-
## Combinatorial Extension with Circular Permutation (CE-CP)
49+
### Combinatorial Extension with Circular Permutation (CE-CP)
4150

42-
## FATCAT - rigid
51+
### FATCAT - rigid
4352

44-
## FATCAR - flexible
53+
### FATCAR - flexible
4554

4655

4756
## Acknowledgements

structure/bioassembly.md

Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,228 @@
11
Asymmetric Unit and Biological Assembly
22
=======================================
33

4+
For many proteins, the asymmetric unit and the biological assembly are the same. However there are quite a few proteins where they are not identical and depending on what you are interested in, it might be important that you work with the biological assembly, instead of the asymmetric unit.
5+
6+
## Asymmetric Unit
7+
8+
The asymmetric unit is the smallest portion of a crystal structure to which symmetry operations can be applied in order to generate the complete unit cell (the crystal repeating unit).
9+
10+
A crystal asymmetric unit may contain:
11+
12+
* one biological assembly
13+
* a portion of a biological assembly
14+
* multiple biological assemblies
15+
16+
## Biological Assembly
17+
18+
The biological assembly (also sometimes referred to as the biological unit) is the macromolecular assembly that has either been shown to be or is believed to be the functional form of the molecule For example, the functional form of hemoglobin has four chains.
19+
20+
The [StructureIO](http://www.biojava.org/docs/api/org/biojava3/structure/StructureIO.html) and [AtomCache](http://www.biojava.org/docs/api/org/biojava/bio/structure/align/util/AtomCache.html) classes in Biojava provide access methods to work with either asymmetric unit or biological assembly.
21+
22+
Let's load both representations of hemoglobin PDB ID [1HHO](http://www.rcsb.org/pdb/explore.do?structureId=1hho) and visualize it:
23+
24+
```java
25+
public static void main(String[] args){
26+
27+
try {
28+
Structure asymUnit = StructureIO.getStructure("1hho");
29+
30+
showStructure(asymUnit);
31+
32+
Structure bioAssembly = StructureIO.getBiologicalAssembly("1hho");
33+
34+
showStructure(bioAssembly);
35+
36+
} catch (Exception e){
37+
e.printStackTrace();
38+
}
39+
40+
}
41+
42+
public static void showStructure(Structure structure){
43+
44+
StructureAlignmentJmol jmolPanel = new StructureAlignmentJmol();
45+
46+
jmolPanel.setStructure(structure);
47+
48+
// send some commands to Jmol
49+
jmolPanel.evalString("select * ; color chain;");
50+
jmolPanel.evalString("select *; spacefill off; wireframe off; cartoon on; ");
51+
jmolPanel.evalString("select ligands; cartoon off; wireframe 0.3; spacefill 0.5; color cpk;");
52+
53+
}
54+
```
55+
56+
<table>
57+
<tr>
58+
<td>
59+
The <b>asymmetric unit</b> of hemoglobin PDB ID <a href="http://www.rcsb.org/pdb/explore.do?structureId=1hho">1HHO</a>
60+
</td>
61+
<td>
62+
The <b>biological assembly</b> of hemoglobin PDB ID <a href="http://www.rcsb.org/pdb/explore.do?structureId=1hho">1HHO</a>
63+
</td>
64+
</tr>
65+
<tr>
66+
<td>
67+
<img src="img/1hho_asym.png"/>
68+
</td>
69+
<td>
70+
<img src="img/1hho_biounit.png"/>
71+
</td>
72+
</tr>
73+
</table>
74+
75+
As we can see, the two representations are quite different! When investigating protein interfaces, ligand binding and for many other applications, you always want to work with the biological assemblies.
76+
77+
Here another example, the bacteriophave GA protein capsid PDB ID [1GAV](http://www.rcsb.org/pdb/explore.do?structureId=1gav)
78+
79+
<table>
80+
<tr>
81+
<td>
82+
The <b>asymmetric unit</b> of bacteriophave GA protein capsid PDB ID <a href="http://www.rcsb.org/pdb/explore.do?structureId=1gav">1GAV</a>
83+
</td>
84+
<td>
85+
The <b>biological assembly</b> of bacteriophave GA protein capsid PDB ID <a href="http://www.rcsb.org/pdb/explore.do?structureId=1gav">1GAV</a>
86+
</td>
87+
</tr>
88+
<tr>
89+
<td>
90+
<img src="img/1gav_asym.png"/>
91+
</td>
92+
<td>
93+
<img src="img/1gav_biounit.png"/>
94+
</td>
95+
</tr>
96+
</table>
97+
98+
## Re-creating Biological Assemblies
99+
100+
Since biological assemblies can be accessed via the StructureIO interface, in principle there is no need to access the lower-level code in BioJava that allows to re-create biological assemblies. If you are interested in looking at the gory details of this, here a couple of pointers into the code. In principle there are two ways for how to get to a biological assembly:
101+
102+
A) The biological assembly needs to be re-built and the atom coordinates of the asymmetric unit need to be rotated according to the instructions in the files. The information required to re-create the biological assemblies is available in both the PDB an mmCIF/PDBx files.
103+
104+
In PDB files the relevant transformations are stored in the *REMARK 350* records. For mmCIF/PDBx, the *_pdbx_struct_assembly* and *_pdbx_struct_oper_list* categories store the corresponding rules.
105+
106+
B) There is also a pre-computed file available that contains an assembled version of a structure. This file can be parsed directly, without having to perform rotation operations on coordinates.
107+
108+
BioJava contains utility classes to re-create biological assemblies for both PDB and mmCIF, as well as to parse the pre-computed file. The [BioUnitDataProvider](http://www.biojava.org/docs/api/org/biojava/bio/structure/quaternary/io/BioUnitDataProvider.html) interface defines what is required to re-build an assembly. The [BioUnitDataProviderFactory](http://www.biojava.org/docs/api/org/biojava/bio/structure/quaternary/io/BioUnitDataProviderFactory.html) allows to specify which of the BioUnitDataProviders is getting used.
109+
110+
Take a look at the method getBiologicalAssembly() in [StructureIO](http://www.biojava.org/docs/api/org/biojava/bio/structure/io/StructureIO.html) to see how the BioUnitDataProviders are used by the *BiologicalAssemblyBuilder*.
111+
112+
## Memory consumption
113+
114+
This example in the next section loads the structure of the PBCV-1 virus capsid (PDB ID [1M4X](http://www.rcsb.org/pdb/explore.do?structureId=1m4x)). It consists of 16 million atoms and has one of the largest, if not the largest biological assembly that is currently available in the PDB. Needless to say it is important to change the maximum heap size parameter, otherwise there is no successfully load this. It requires a minimum of 9GB RAM to load (measured on Java 1.7 on OSX). You can change the heap size by providing the following startup parameter (and assuming you have 10G or more of RAM available on your system)
115+
<pre>
116+
-Xmx10G
117+
</pre>
118+
119+
Note: when loading this structure with 9GB of memory, the Java VM spends a significant amount of time in garbage collection (GC). If you provide more RAM than the minimum requirement, then GC is triggered less often and the biological assembly loads faster.
120+
121+
<table>
122+
<tr>
123+
<td>
124+
<img src="img/1m4x_bio_r_250.jpg"/>
125+
</td>
126+
</tr>
127+
<tr>
128+
<td>
129+
The biological assembly of the PBCV-1 virus capsid. (image source: <a href="http://www.rcsb.org/pdb/explore.do?structureId=1m4x">RCSB</a>)
130+
</td>
131+
</tr>
132+
</table>
133+
134+
## Low level access to parsing pre-assembled biological asssembly files
135+
136+
To load the pre-assembled biological assembly file directly, one can tweak the low-level PDB file parser like this
137+
138+
```java
139+
140+
public static void main(String[] args){
141+
142+
public static void main(String[] args){
143+
144+
// This loads the PBCV-1 virus capsid, one of, if not the biggest biological assembly in terms on nr. of atoms.
145+
// The 1m4x.pdb1.gz file has 313 MB (compressed)
146+
// This Structure requires a minimum of 9 GB of memory to be loaded in memory.
147+
148+
String pdbId = "1M4X";
149+
150+
Structure bigStructure = readStructure(pdbId,1);
151+
152+
// let's take a look how much memory this consumes currently
153+
154+
Runtime r = Runtime.getRuntime();
155+
156+
// let's try to trigger the Java Garbage collector
157+
r.gc();
158+
159+
System.out.println("Memory consumption after " + pdbId +
160+
" structure has been loaded into memory:");
161+
162+
String mem = String.format("Total %dMB, Used %dMB, Free %dMB, Max %dMB",
163+
r.totalMemory() / 1048576,
164+
(r.totalMemory() - r.freeMemory()) / 1048576,
165+
r.freeMemory() / 1048576,
166+
r.maxMemory() / 1048576);
167+
168+
System.out.println(mem);
169+
170+
System.out.println("# atoms: " + StructureTools.getNrAtoms(bigStructure));
171+
172+
}
173+
/** Load a specific biological assembly for a PDB entry
174+
*
175+
* @param pdbId .. the PDB ID
176+
* @param bioAssemblyId .. the first assembly has the bioAssemblyId 1
177+
* @return a Structure object or null if something went wrong.
178+
*/
179+
public static Structure readStructure(String pdbId, int bioAssemblyId) {
180+
181+
// pre-computed files use lower case PDB IDs
182+
pdbId = pdbId.toLowerCase();
183+
184+
// we need to tweak the FileParsing parameters a bit
185+
FileParsingParameters p = new FileParsingParameters();
186+
187+
// some bio assemblies are large, we want an all atom representation and avoid
188+
// switching to a Calpha-only representation for large molecules
189+
// note, this requires several GB of memory for some of the largest assemblies, such a 1MX4
190+
p.setAtomCaThreshold(Integer.MAX_VALUE);
191+
192+
// parse remark 350
193+
p.setParseBioAssembly(true);
194+
195+
// The low level PDB file parser
196+
PDBFileReader pdbreader = new PDBFileReader();
197+
198+
// we just need this to track where to store PDB files
199+
// this checks the PDB_DIR property (and uses a tmp location if not set)
200+
AtomCache cache = new AtomCache();
201+
pdbreader.setPath(cache.getPath());
202+
203+
pdbreader.setFileParsingParameters(p);
204+
205+
// download missing files
206+
pdbreader.setAutoFetch(true);
207+
208+
pdbreader.setBioAssemblyId(bioAssemblyId);
209+
pdbreader.setBioAssemblyFallback(false);
210+
211+
Structure structure = null;
212+
try {
213+
structure = pdbreader.getStructureById(pdbId);
214+
if ( bioAssemblyId > 0 )
215+
structure.setBiologicalAssembly(true);
216+
structure.setPDBCode(pdbId);
217+
} catch (Exception e){
218+
e.printStackTrace();
219+
return null;
220+
}
221+
return structure;
222+
}
223+
```
224+
225+
226+
## Further Reading
227+
228+
The RCSB PDB web site has a great [tutorial on Biological Assemblies](http://www.rcsb.org/pdb/101/static101.do?p=education_discussion/Looking-at-Structures/bioassembly_tutorial.html).

structure/caching.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -8,16 +8,16 @@ The main class that provides this functionality is the [AtomCache](http://www.bi
88

99
It is hidden inside the StructureIO class, that we already encountered earlier.
1010

11-
<pre>
11+
```java
1212
Structure structure = StructureIO.getStructure("4hhb");
13-
</pre>
13+
```
1414

1515
is the same as
1616

17-
<pre>
17+
```java
1818
AtomCache cache = new AtomCache();
1919
cache.getStructure("4hhb");
20-
</pre>
20+
```
2121

2222

2323
## Where are the files getting written to?
@@ -33,11 +33,11 @@ you can configure the AtomCache by setting the PDB_DIR system property
3333

3434
An alternative is to hard-code the path in this way (but setting it as a property is better style)
3535

36-
<pre>
36+
```java
3737
AtomCache cache = new AtomCache();
3838

3939
cache.setPath("/path/to/pdb/files/");
40-
</pre>
40+
```
4141

4242
## File Parsing Parameters
4343

@@ -47,7 +47,7 @@ class is the main place to influence the level of detail and as a consequence th
4747

4848
This example turns on the use of chemical components when loading a structure. (See also the [next chapter](chemcomp.md))
4949

50-
<pre>
50+
```java
5151
AtomCache cache = new AtomCache();
5252

5353
cache.setPath("/tmp/");
@@ -60,20 +60,20 @@ This example turns on the use of chemical components when loading a structure. (
6060

6161
Structure structure = StructureIO.getStructure("4hhb");
6262

63-
</pre>
63+
```
6464

6565
## Caching of other SCOP, CATH
6666

6767
The AtomCache not only provides access to PDB, it can also fetch Structure representations of protein domains, as defined by SCOP and CATH.
6868

69-
<pre>
69+
```java
7070
// uses a SCOP domain definition
7171
Structure domain1 = StructureIO.getStructure("d4hhba_");
7272

7373
// Get a specific protein chain, note: chain IDs are case sensitive, PDB IDs are not.
7474
Structure chain1 = StructureIO.getStructure("4HHB.A");
7575

76-
</pre>
76+
```
7777

7878
There are quite a number of external database IDs that are supported here. See the
7979
<a href="http://www.biojava.org/docs/api/org/biojava/bio/structure/align/util/AtomCache.html#getStructure(java.lang.String)">AtomCache documentation</a> for more details on the supported options.

0 commit comments

Comments
 (0)