Skip to content

eggduzao/Wildlife

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

182 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Phenoteka

🧬DevOps, Meta & Vibes πŸ’–

Phenoteka Sync Engine Phenoteka Release Last Commit Made with Love

🌍 Equity, Diversity & Inclusive Software

Global Health Diversity Encouraged Inclusive Software Responsive UI Color Blind Friendly Accessibility SUS SUS Public Heath SUS Made In Brazil

🌸 Python Ecosystem

Python Modular Python Pandas NumPy SciPy](https://scipy.org/) Jupyter Regex

🎨 Visualization

OpenPyXL PyArrow Scikit Learn Matplotlib Seaborn

🌐 Web, Dashboards & Front-End

Dash Plotly Flask HTML5 CSS3 NetworkX

πŸ–₯️ Dev Environment & Tooling

Git GitHub Sublime Text Gedit VSCode Ubuntu Mac Friendly

🧠 Machine Learning & Deep Learning

XGBoost LightGBM TensorFlow Keras PyTorch HuggingFace Transformers

πŸ§ͺ Testing, Linting, and Quality Tools

pytest tox black flake8 mypy pre-commit

πŸ“š Documentation & Science Repos

YAML Sphinx Quarto Markdown Binder Zenodo

🐍 Relevant Programming Languages

Bash R C C++ C# Julia Rust Java JavaScript Go Swift Kotlin

πŸ’Ύ Databases & Query Systems

SQL SQLite SQLAlchemy PostgreSQL MySQL MongoDB DuckDB BigQuery

βš™οΈ Environment & Package Managers

Conda Miniconda Mamba Micromamba PyPI

🧰 Orchestration & CI/CD

Docker Kubernetes GitHub Actions Travis CI

πŸ” Secrets, Security & Infra

Vault Encryption Secure Software Zero Trust

A modular, research-grade platform for health cohort data curation, quality assessment, semantic harmonization, and intelligent transformation - built for science, by someone who reads and writes PEPs, and who understands that beauty, is found within.

Phenoteka is a scalable, extensible, and ethical platform for phenotypic data analysis, designed for cohort studies and population genomics.
It performs full ETL, semantic harmonization, quality control, and intelligent imputation of heterogeneous biomedical datasets β€” from clinical records to survey data.
Born in the Brazilian SUS ecosystem, Phenoteka embraces research-grade software design and aims to be a lighthouse in the fog of phenotype data.


✨ Key Features

  • Full ETL Pipelines
    Load, clean, transform, and export health data from multiple sources (CSV, Parquet, APIs, databases).
  • Quality Metrics
    Compute research-grade metrics like Completeness, Balancedness, Uniqueness, and Representativeness.
  • Semantic Harmonization
    Map local variables to biomedical ontologies: OMOP, SNOMED-CT, LOINC, HPO, UMLS, and more.
  • Advanced Imputation
    Integrate with TryDINN for deep learning-based missing data imputation.
  • Modular Dashboards
    Interactive Dash/Plotly UI for exploration, filtering, and data export.
  • Pluggable Architecture
    Extend Phenoteka with your own metrics, data sources, or pipelines without modifying core code.
  • Scalable-Ready
    From single-machine analysis to multi-user API-backed deployments.

πŸ—‚ Project Structure

Phenoteka follows a modern src/ layout with clean separation of concerns.

Phenoteka/
β”œβ”€β”€ src/phenoteka/
β”‚   β”œβ”€β”€ cli/             # Command-line interface
β”‚   β”œβ”€β”€ core/            # Config, constants, orchestration
β”‚   β”œβ”€β”€ db/              # Database abstraction layer
β”‚   β”œβ”€β”€ dimensions/      # Quality metrics modules
β”‚   β”œβ”€β”€ etl/             # Extract-Transform-Load logic
β”‚   β”œβ”€β”€ external/        # External datasets (DataSUS, PhysioNet, etc.)
β”‚   β”œβ”€β”€ gui/             # Dash/Plotly dashboard
β”‚   β”œβ”€β”€ harmonization/   # Ontology mapping and schema validation
β”‚   β”œβ”€β”€ imputation/      # TryDINN integration
β”‚   β”œβ”€β”€ network/         # API server and endpoints
β”‚   β”œβ”€β”€ semantic/        # NLP & semantic analysis
β”‚   β”œβ”€β”€ summary/         # Profiling and summary stats
β”‚   └── utils/           # Reusable helper functions

πŸš€ Quick Start

1️⃣ Installation

# From PyPI (coming soon)
pip install phenoteka

# From source
git clone https://github.com/eggduzao/phenoteka.git
cd phenoteka
pip install -e .[dev]

2️⃣ Command-Line Usage

Phenoteka comes with an argparse-based CLI for batch runs.

# Run ETL + Harmonization + Quality metrics
phenoteka run \
    --input data/raw/epigen.csv \
    --config configs/etl_config.yaml \
    --output data/processed/epigen_clean.parquet

Available commands:

phenoteka etl         # Extract, transform, load
phenoteka harmonize   # Map to biomedical ontologies
phenoteka metrics     # Run quality assessments
phenoteka summary     # Generate profiling reports
phenoteka export      # Export datasets in multiple formats

3️⃣ Python API Usage

from phenoteka.etl import run_etl
from phenoteka.dimensions import completeness

df = run_etl("data/raw/study.csv", config="configs/etl_config.yaml")
score = completeness.compute(df)

print(f"Completeness: {score:.2f}")

πŸ“Š Example Workflow

1.	Ingest

Load raw health data from files, APIs, or databases. 2. Clean & Transform Standardize column names, types, and formats; remove identifiers. 3. Harmonize Map variables to controlled vocabularies and ontologies. 4. Quality Assessment Quantify completeness, balancedness, representativeness, etc. 5. Impute Fill missing values intelligently with TryDINN. 6. Visualize & Export Use the GUI or CLI to explore and save your results.


🧩 Module Overview

Module Purpose
cli/ Batch-mode interface with subcommands
core/ Routing, configuration, constants
db/ Database connections and models
dimensions/ Quality metric implementations
etl/ Extract, transform, and load pipelines
external/ Handlers for external datasets
gui/ Dash/Plotly dashboard components
harmonization/ Ontology mapping and schema checks
imputation/ TryDINN adapters
network/ API server and endpoints
semantic/ NLP-based semantic variable analysis
summary/ Profiling and summary statistics
utils/ General-purpose helper functions

πŸ“š Documentation

Full documentation is available at: phenoteka.org/docs (coming soon)


πŸ›  Development

# Lint & format
ruff check src tests
ruff format src tests

# Type check
mypy src

# Run tests
pytest --cov=phenoteka --cov-report=term-missing

🀝 Contributing

We welcome contributions from researchers, developers, and data stewards. See CONTRIBUTING.md for guidelines.


πŸ“œ License

Phenoteka is released under the GNU AFFERO GENERAL PUBLIC version 3 - OR LATER License.


🌟 Citation

If you use Phenoteka in your research, please cite:

@software{phenoteka2025,
  author       = {Gade Gusmao, Eduardo},
  title        = {Phenoteka: A modular platform for phenotypic data analysis},
  year         = {2025},
  url          = {https://phenoteka.org}
}

Phenoteka: inclusive, beautiful, rigorous, and ethical software for health data.


About

A unified deep learning framework for high-performance multimodal data imputation, integrating neural operators for tabular, EHR, imaging, audio, video, and biological datasets

Resources

License

MIT, Unknown licenses found

Licenses found

MIT
LICENSE
Unknown
LICENSE.rst

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

 
 
 

Contributors