Adeliyio Adeliyio

Tommy Adeliyi

AI Engineer · LLM Systems · Evaluation & Reliability

I build production-grade AI systems that organizations can trust.

My work goes beyond prototypes and demos — I focus on the engineering discipline required to make AI reliable in real-world environments: evaluation systems, calibrated confidence, self-correction loops, safety guardrails, and cost-aware architectures.

🧭 Engineering Philosophy

Most AI systems don’t fail loudly; they fail silently and convincingly.

A model works on five examples, hallucinates on the sixth, and there’s no infrastructure to catch it. I design systems for that reality.

Evaluation is a system → regression testing, golden datasets, LLM-as-judge
Confidence is engineered → calibrated scoring, uncertainty-aware escalation
Safety is layered → deterministic checks + semantic review
Cost matters → hybrid routing (frontier + local models)
Impact is measured → A/B testing, calibration curves, failure analysis

🚀 Featured Systems

🔧 LLM Reliability Engine

→ https://github.com/Adeliyio/ai-system-debugger

LLM systems don’t fail like traditional software — they fail silently with plausible but incorrect outputs.

Ensemble failure detection (LLM-as-judge + embeddings + rules)
Root cause analysis (retrieval, prompt, model, context)
Self-healing with regression-tested fixes
Hybrid routing (GPT-4o + Llama via Ollama)
Meta-evaluation of evaluator reliability

📊 Financial Intelligence Copilot (10-K Q&A)

→ https://github.com/Adeliyio/SEC-filings-knowledge-copilot

A hallucinated financial figure is worse than no answer — correctness must be provable.

Multi-agent LangGraph reasoning pipeline
Claim-level grounding + hallucination correction
RAGAS + regression-based evaluation
Transparent confidence + citations
Fully local deployment (Ollama)

🎧 AI Support Copilot

→ https://github.com/Adeliyio/customer-support-copilot

The goal isn’t generating replies — it’s ensuring responses are safe to send.

Agentic classification with deterministic fallback
Hybrid retrieval (dense + BM25 + reranking)
Dual-layer safety (rules + LLM review)
Confidence-aware escalation
Five-layer evaluation pipeline

📈 Deal Intelligence System

→ https://github.com/Adeliyio/deal-intelligence-for-sales-teams

The real question isn’t scoring deals — it’s whether AI measurably improves sales decisions.

Calibrated win probabilities (Platt scaling)
Temporal feature engineering (deal behavior tracking)
Evidence-grounded risk explanations
Counterfactual simulations for strategy decisions
A/B tested agent impact

🛠️ Core Stack

LLM & GenAI OpenAI · Anthropic Claude · LangChain · LangGraph · LlamaIndex · Ollama

Retrieval & Search FAISS · Pinecone · ChromaDB · BM25 · Cross-encoders

Evaluation Systems RAGAS · LLM-as-judge · Golden datasets · Custom eval harnesses

ML / Modeling PyTorch · TensorFlow · scikit-learn · XGBoost

MLOps / LLMOps MLflow · DVC · GitHub Actions · Docker

Deployment AWS · FastAPI · Streamlit

📦 MLOps Foundations

End-to-end ML pipelines (CI/CD + experiment tracking)
Model versioning (DVC + MLflow)
Real-time computer vision systems (YOLOv8)
Scalable inference APIs (FastAPI + AWS)

🏭 Where I've applied AI across business functions

Domain	Representative systems
HR & Talent	Resume screening · Candidate ranking · Attrition prediction
Finance	Fraud detection · Cash flow forecasting · Invoice automation
Sales & Marketing	Lead scoring · Churn prediction · Customer segmentation
Customer Support	Support copilots · Ticket classification · Escalation systems
Data & Analytics	NL-to-SQL · Forecasting · Insight generation
Legal & Compliance	Contract analysis · Policy enforcement
Operations	Workflow automation · Document processing
Engineering / Product	Code review · Bug triage · Log analysis
Executive / Strategy	KPI dashboards · Scenario planning · decision support

✍️ Writing

Why Most RAG Systems Fail in Production (And How to Fix Them)
Continuous Integration for Data Science: Automating Model Building Pipelines
Version Control for Machine Learning Models: Best Practices and Tools
Automated Model Testing and Monitoring: The Bedrock of Startup MLOps
Taming Data and Model Drift in Startup MLOps
Building Scalable Machine Learning Pipelines in Startup Environments

→ https://medium.com/@tommyadeliyi

🤝 Let’s Connect

If you're building AI systems that need to be:

reliable in production
measurable and testable
safe under real-world conditions

I’m always open to conversations around high-impact AI systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly