πΈ Python Ecosystem
A modular, research-grade platform for health cohort data curation, quality assessment, semantic harmonization, and intelligent transformation - built for science, by someone who reads and writes PEPs, and who understands that beauty, is found within.
Phenoteka is a scalable, extensible, and ethical platform for phenotypic data analysis, designed for cohort studies and population genomics.
It performs full ETL, semantic harmonization, quality control, and intelligent imputation of heterogeneous biomedical datasets β from clinical records to survey data.
Born in the Brazilian SUS ecosystem, Phenoteka embraces research-grade software design and aims to be a lighthouse in the fog of phenotype data.
- Full ETL Pipelines
Load, clean, transform, and export health data from multiple sources (CSV, Parquet, APIs, databases). - Quality Metrics
Compute research-grade metrics like Completeness, Balancedness, Uniqueness, and Representativeness. - Semantic Harmonization
Map local variables to biomedical ontologies: OMOP, SNOMED-CT, LOINC, HPO, UMLS, and more. - Advanced Imputation
Integrate with TryDINN for deep learning-based missing data imputation. - Modular Dashboards
Interactive Dash/Plotly UI for exploration, filtering, and data export. - Pluggable Architecture
Extend Phenoteka with your own metrics, data sources, or pipelines without modifying core code. - Scalable-Ready
From single-machine analysis to multi-user API-backed deployments.
Phenoteka follows a modern src/ layout with clean separation of concerns.
Phenoteka/
βββ src/phenoteka/
β βββ cli/ # Command-line interface
β βββ core/ # Config, constants, orchestration
β βββ db/ # Database abstraction layer
β βββ dimensions/ # Quality metrics modules
β βββ etl/ # Extract-Transform-Load logic
β βββ external/ # External datasets (DataSUS, PhysioNet, etc.)
β βββ gui/ # Dash/Plotly dashboard
β βββ harmonization/ # Ontology mapping and schema validation
β βββ imputation/ # TryDINN integration
β βββ network/ # API server and endpoints
β βββ semantic/ # NLP & semantic analysis
β βββ summary/ # Profiling and summary stats
β βββ utils/ # Reusable helper functions
# From PyPI (coming soon)
pip install phenoteka
# From source
git clone https://github.com/eggduzao/phenoteka.git
cd phenoteka
pip install -e .[dev]Phenoteka comes with an argparse-based CLI for batch runs.
# Run ETL + Harmonization + Quality metrics
phenoteka run \
--input data/raw/epigen.csv \
--config configs/etl_config.yaml \
--output data/processed/epigen_clean.parquetAvailable commands:
phenoteka etl # Extract, transform, load
phenoteka harmonize # Map to biomedical ontologies
phenoteka metrics # Run quality assessments
phenoteka summary # Generate profiling reports
phenoteka export # Export datasets in multiple formatsfrom phenoteka.etl import run_etl
from phenoteka.dimensions import completeness
df = run_etl("data/raw/study.csv", config="configs/etl_config.yaml")
score = completeness.compute(df)
print(f"Completeness: {score:.2f}")1. Ingest
Load raw health data from files, APIs, or databases. 2. Clean & Transform Standardize column names, types, and formats; remove identifiers. 3. Harmonize Map variables to controlled vocabularies and ontologies. 4. Quality Assessment Quantify completeness, balancedness, representativeness, etc. 5. Impute Fill missing values intelligently with TryDINN. 6. Visualize & Export Use the GUI or CLI to explore and save your results.
| Module | Purpose |
|---|---|
| cli/ | Batch-mode interface with subcommands |
| core/ | Routing, configuration, constants |
| db/ | Database connections and models |
| dimensions/ | Quality metric implementations |
| etl/ | Extract, transform, and load pipelines |
| external/ | Handlers for external datasets |
| gui/ | Dash/Plotly dashboard components |
| harmonization/ | Ontology mapping and schema checks |
| imputation/ | TryDINN adapters |
| network/ | API server and endpoints |
| semantic/ | NLP-based semantic variable analysis |
| summary/ | Profiling and summary statistics |
| utils/ | General-purpose helper functions |
Full documentation is available at: phenoteka.org/docs (coming soon)
# Lint & format
ruff check src tests
ruff format src tests
# Type check
mypy src
# Run tests
pytest --cov=phenoteka --cov-report=term-missingWe welcome contributions from researchers, developers, and data stewards. See CONTRIBUTING.md for guidelines.
Phenoteka is released under the GNU AFFERO GENERAL PUBLIC version 3 - OR LATER License.
If you use Phenoteka in your research, please cite:
@software{phenoteka2025,
author = {Gade Gusmao, Eduardo},
title = {Phenoteka: A modular platform for phenotypic data analysis},
year = {2025},
url = {https://phenoteka.org}
}
Phenoteka: inclusive, beautiful, rigorous, and ethical software for health data.