Skip to content

PDBParser is not thread-safe when used in ThreadPoolExecutor #5131

@yiyabo

Description

@yiyabo

Summary
Bio.PDB.PDBParser appears to have thread-safety issues when the same instance is shared across multiple threads in a ThreadPoolExecutor. This causes sporadic parsing failures with errors like "0 defined twice" and "'Atom' object has no attribute 'selected_child'".

Environment
Python: 3.11
BioPython: 1.86
OS: Linux (CentOS/Ubuntu)

from concurrent.futures import ThreadPoolExecutor
from Bio.PDB import PDBParser
from pathlib import Path

def process_pdb(pdb_path, parser):
    structure = parser.get_structure(pdb_path.stem, str(pdb_path))
    return len(list(structure.get_residues()))

# Shared parser instance
parser = PDBParser(QUIET=True, PERMISSIVE=1)
pdb_files = list(Path("pdb_directory").glob("*.pdb"))

# This fails randomly with "0 defined twice" errors
with ThreadPoolExecutor(max_workers=64) as executor:
    results = list(executor.map(lambda p: process_pdb(p, parser), pdb_files))

Expected Behavior
Parsing should succeed for all valid PDB files.

Actual Behavior
Random failures with errors:

  • "0 defined twice" (residue ID collision)
  • "'Atom' object has no attribute 'selected_child'"

~70% of files fail when using 64 threads, while 100% succeed when processed sequentially.

Workaround
Create a new PDBParser instance for each thread/process:

def process_pdb(pdb_path):
    parser = PDBParser(QUIET=True)  # New instance per call
    structure = parser.get_structure(pdb_path.stem, str(pdb_path))
    return len(list(structure.get_residues()))

Or use ProcessPoolExecutor instead of ThreadPoolExecutor.

Suggestion
Consider documenting that PDBParser is not thread-safe, or making internal state thread-local.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions