Summary
Bio.PDB.PDBParser appears to have thread-safety issues when the same instance is shared across multiple threads in a ThreadPoolExecutor. This causes sporadic parsing failures with errors like "0 defined twice" and "'Atom' object has no attribute 'selected_child'".
Environment
Python: 3.11
BioPython: 1.86
OS: Linux (CentOS/Ubuntu)
from concurrent.futures import ThreadPoolExecutor
from Bio.PDB import PDBParser
from pathlib import Path
def process_pdb(pdb_path, parser):
structure = parser.get_structure(pdb_path.stem, str(pdb_path))
return len(list(structure.get_residues()))
# Shared parser instance
parser = PDBParser(QUIET=True, PERMISSIVE=1)
pdb_files = list(Path("pdb_directory").glob("*.pdb"))
# This fails randomly with "0 defined twice" errors
with ThreadPoolExecutor(max_workers=64) as executor:
results = list(executor.map(lambda p: process_pdb(p, parser), pdb_files))
Expected Behavior
Parsing should succeed for all valid PDB files.
Actual Behavior
Random failures with errors:
- "0 defined twice" (residue ID collision)
- "'Atom' object has no attribute 'selected_child'"
~70% of files fail when using 64 threads, while 100% succeed when processed sequentially.
Workaround
Create a new PDBParser instance for each thread/process:
def process_pdb(pdb_path):
parser = PDBParser(QUIET=True) # New instance per call
structure = parser.get_structure(pdb_path.stem, str(pdb_path))
return len(list(structure.get_residues()))
Or use ProcessPoolExecutor instead of ThreadPoolExecutor.
Suggestion
Consider documenting that PDBParser is not thread-safe, or making internal state thread-local.
Summary
Bio.PDB.PDBParser appears to have thread-safety issues when the same instance is shared across multiple threads in a ThreadPoolExecutor. This causes sporadic parsing failures with errors like "0 defined twice" and "'Atom' object has no attribute 'selected_child'".
Environment
Python: 3.11
BioPython: 1.86
OS: Linux (CentOS/Ubuntu)
Expected Behavior
Parsing should succeed for all valid PDB files.
Actual Behavior
Random failures with errors:
~70% of files fail when using 64 threads, while 100% succeed when processed sequentially.
Workaround
Create a new PDBParser instance for each thread/process:
Or use ProcessPoolExecutor instead of ThreadPoolExecutor.
Suggestion
Consider documenting that PDBParser is not thread-safe, or making internal state thread-local.