Phenoteka

🧬DevOps, Meta & Vibes 💖

🌍 Equity, Diversity & Inclusive Software

🌸 Python Ecosystem

](https://scipy.org/)

🎨 Visualization

🌐 Web, Dashboards & Front-End

🖥️ Dev Environment & Tooling

🧠 Machine Learning & Deep Learning

🧪 Testing, Linting, and Quality Tools

📚 Documentation & Science Repos

🐍 Relevant Programming Languages

💾 Databases & Query Systems

⚙️ Environment & Package Managers

🧰 Orchestration & CI/CD

🔐 Secrets, Security & Infra

A modular, research-grade platform for health cohort data curation, quality assessment, semantic harmonization, and intelligent transformation - built for science, by someone who reads and writes PEPs, and who understands that beauty, is found within.

Phenoteka is a scalable, extensible, and ethical platform for phenotypic data analysis, designed for cohort studies and population genomics.
It performs full ETL, semantic harmonization, quality control, and intelligent imputation of heterogeneous biomedical datasets — from clinical records to survey data.
Born in the Brazilian SUS ecosystem, Phenoteka embraces research-grade software design and aims to be a lighthouse in the fog of phenotype data.

✨ Key Features

Full ETL Pipelines
Load, clean, transform, and export health data from multiple sources (CSV, Parquet, APIs, databases).
Quality Metrics
Compute research-grade metrics like Completeness, Balancedness, Uniqueness, and Representativeness.
Semantic Harmonization
Map local variables to biomedical ontologies: OMOP, SNOMED-CT, LOINC, HPO, UMLS, and more.
Advanced Imputation
Integrate with TryDINN for deep learning-based missing data imputation.
Modular Dashboards
Interactive Dash/Plotly UI for exploration, filtering, and data export.
Pluggable Architecture
Extend Phenoteka with your own metrics, data sources, or pipelines without modifying core code.
Scalable-Ready
From single-machine analysis to multi-user API-backed deployments.

🗂 Project Structure

Phenoteka follows a modern src/ layout with clean separation of concerns.

Phenoteka/
├── src/phenoteka/
│   ├── cli/             # Command-line interface
│   ├── core/            # Config, constants, orchestration
│   ├── db/              # Database abstraction layer
│   ├── dimensions/      # Quality metrics modules
│   ├── etl/             # Extract-Transform-Load logic
│   ├── external/        # External datasets (DataSUS, PhysioNet, etc.)
│   ├── gui/             # Dash/Plotly dashboard
│   ├── harmonization/   # Ontology mapping and schema validation
│   ├── imputation/      # TryDINN integration
│   ├── network/         # API server and endpoints
│   ├── semantic/        # NLP & semantic analysis
│   ├── summary/         # Profiling and summary stats
│   └── utils/           # Reusable helper functions

🚀 Quick Start

1️⃣ Installation

# From PyPI (coming soon)
pip install phenoteka

# From source
git clone https://github.com/eggduzao/phenoteka.git
cd phenoteka
pip install -e .[dev]

2️⃣ Command-Line Usage

Phenoteka comes with an argparse-based CLI for batch runs.

# Run ETL + Harmonization + Quality metrics
phenoteka run \
    --input data/raw/epigen.csv \
    --config configs/etl_config.yaml \
    --output data/processed/epigen_clean.parquet

Available commands:

phenoteka etl         # Extract, transform, load
phenoteka harmonize   # Map to biomedical ontologies
phenoteka metrics     # Run quality assessments
phenoteka summary     # Generate profiling reports
phenoteka export      # Export datasets in multiple formats

3️⃣ Python API Usage

from phenoteka.etl import run_etl
from phenoteka.dimensions import completeness

df = run_etl("data/raw/study.csv", config="configs/etl_config.yaml")
score = completeness.compute(df)

print(f"Completeness: {score:.2f}")

📊 Example Workflow

1.	Ingest

Load raw health data from files, APIs, or databases. 2. Clean & Transform Standardize column names, types, and formats; remove identifiers. 3. Harmonize Map variables to controlled vocabularies and ontologies. 4. Quality Assessment Quantify completeness, balancedness, representativeness, etc. 5. Impute Fill missing values intelligently with TryDINN. 6. Visualize & Export Use the GUI or CLI to explore and save your results.

🧩 Module Overview

Module	Purpose
cli/	Batch-mode interface with subcommands
core/	Routing, configuration, constants
db/	Database connections and models
dimensions/	Quality metric implementations
etl/	Extract, transform, and load pipelines
external/	Handlers for external datasets
gui/	Dash/Plotly dashboard components
harmonization/	Ontology mapping and schema checks
imputation/	TryDINN adapters
network/	API server and endpoints
semantic/	NLP-based semantic variable analysis
summary/	Profiling and summary statistics
utils/	General-purpose helper functions

📚 Documentation

Full documentation is available at: phenoteka.org/docs (coming soon)

🛠 Development

# Lint & format
ruff check src tests
ruff format src tests

# Type check
mypy src

# Run tests
pytest --cov=phenoteka --cov-report=term-missing

🤝 Contributing

We welcome contributions from researchers, developers, and data stewards. See CONTRIBUTING.md for guidelines.

📜 License

Phenoteka is released under the GNU AFFERO GENERAL PUBLIC version 3 - OR LATER License.

🌟 Citation

If you use Phenoteka in your research, please cite:

@software{phenoteka2025,
  author       = {Gade Gusmao, Eduardo},
  title        = {Phenoteka: A modular platform for phenotypic data analysis},
  year         = {2025},
  url          = {https://phenoteka.org}
}

Phenoteka: inclusive, beautiful, rigorous, and ethical software for health data.

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
.github		.github
Bio		Bio
BioSQL		BioSQL
Doc		Doc
Scripts		Scripts
Tests		Tests
__cpp9		__cpp9
__quick6		__quick6
__tmp4		__tmp4
__wip8		__wip8
_logo		_logo
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.mypy.ini		.mypy.ini
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIB.rst		CONTRIB.rst
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTING.rst		CONTRIBUTING.rst
DEPRECATED.rst		DEPRECATED.rst
DESIGN.md		DESIGN.md
LICENSE		LICENSE
LICENSE copy		LICENSE copy
LICENSE.rst		LICENSE.rst
MANIFEST.in		MANIFEST.in
NEWS.rst		NEWS.rst
README.md		README.md
README.rst		README.rst
SECURITY.md		SECURITY.md
STYLE_GUIDE.md		STYLE_GUIDE.md
ci-dependencies.txt		ci-dependencies.txt
clean.sh		clean.sh
pyrightconfig.json		pyrightconfig.json
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py
treecloc.sh		treecloc.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Phenoteka

✨ Key Features

🗂 Project Structure

🚀 Quick Start

1️⃣ Installation

2️⃣ Command-Line Usage

3️⃣ Python API Usage

📊 Example Workflow

🧩 Module Overview

📚 Documentation

🛠 Development

🤝 Contributing

📜 License

🌟 Citation

About

Licenses found

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Phenoteka

✨ Key Features

🗂 Project Structure

🚀 Quick Start

1️⃣ Installation

2️⃣ Command-Line Usage

3️⃣ Python API Usage

📊 Example Workflow

🧩 Module Overview

📚 Documentation

🛠 Development

🤝 Contributing

📜 License

🌟 Citation

About

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages