Skip to content

Commit 55fad85

Browse files
Add ML System Insights and Agent Instructions documentation
- ML_SYSTEM_INSIGHTS.md: Comprehensive analysis of 300+ production ML systems - Core ML system categories and patterns - Architecture patterns and scaling strategies - Common pitfalls and best practices - MLOps maturity levels and monitoring guidelines - AGENTS.md: Specific instructions for AI agents - Role-specific guidance (coding, analysis, documentation, etc.) - Templates and checklists for each agent type - Quick reference guides and troubleshooting - Updated README.md with links to new resources
1 parent d428898 commit 55fad85

4 files changed

Lines changed: 697 additions & 1 deletion

File tree

.claude/settings.local.json

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,11 @@
22
"permissions": {
33
"allow": [
44
"Bash(git init:*)",
5-
"Bash(git add:*)"
5+
"Bash(git add:*)",
6+
"WebFetch(domain:github.com)",
7+
"WebFetch(domain:raw.githubusercontent.com)",
8+
"WebFetch(domain:api.github.com)",
9+
"Bash(git commit:*)"
610
],
711
"deny": [],
812
"ask": []

AGENTS.md

Lines changed: 375 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,375 @@
1+
# Agents Reference Guide
2+
3+
## AI Agent Instructions for ML System Design
4+
5+
This document provides specific instructions for AI agents working with ML systems, referencing insights from [ML_SYSTEM_INSIGHTS.md](./ML_SYSTEM_INSIGHTS.md).
6+
7+
---
8+
9+
## 🤖 For Coding Agents
10+
11+
### When Building ML Systems
12+
13+
#### 1. **Start with Architecture Pattern**
14+
Reference: `ML_SYSTEM_INSIGHTS.md#universal-architecture-patterns`
15+
16+
```python
17+
# Always implement the three-layer stack
18+
class MLSystem:
19+
def __init__(self):
20+
self.offline_training = OfflineTrainer()
21+
self.nearline_processing = StreamProcessor()
22+
self.online_serving = PredictionServer()
23+
```
24+
25+
#### 2. **Feature Store Implementation**
26+
```python
27+
# Standard feature store interface
28+
class FeatureStore:
29+
def get_online_features(self, entity_ids: List[str]) -> DataFrame
30+
def get_offline_features(self, entity_ids: List[str], timestamp: datetime) -> DataFrame
31+
def register_feature(self, feature_def: FeatureDefinition) -> None
32+
```
33+
34+
#### 3. **Model Serving Checklist**
35+
- [ ] Implement health checks
36+
- [ ] Add request validation
37+
- [ ] Set up caching layer
38+
- [ ] Configure timeout handling
39+
- [ ] Add fallback mechanism
40+
- [ ] Implement gradual rollout
41+
42+
### Error Handling Patterns
43+
44+
```python
45+
# Always implement fallback strategies
46+
class PredictionService:
47+
def predict(self, request):
48+
try:
49+
return self.primary_model.predict(request)
50+
except ModelTimeout:
51+
return self.fallback_model.predict(request)
52+
except Exception as e:
53+
log_error(e)
54+
return self.default_response()
55+
```
56+
57+
---
58+
59+
## 🔍 For Analysis Agents
60+
61+
### System Analysis Framework
62+
63+
Reference: `ML_SYSTEM_INSIGHTS.md#key-design-decisions`
64+
65+
#### 1. **Performance Analysis Checklist**
66+
- [ ] Check latency percentiles (p50, p95, p99)
67+
- [ ] Analyze throughput bottlenecks
68+
- [ ] Review cache hit rates
69+
- [ ] Evaluate model complexity vs accuracy trade-off
70+
- [ ] Assess infrastructure costs
71+
72+
#### 2. **Data Quality Assessment**
73+
```python
74+
# Standard data quality checks
75+
quality_metrics = {
76+
"completeness": check_missing_values(),
77+
"consistency": check_data_types(),
78+
"timeliness": check_data_freshness(),
79+
"validity": check_value_ranges(),
80+
"uniqueness": check_duplicates()
81+
}
82+
```
83+
84+
#### 3. **Drift Detection Analysis**
85+
- Monitor feature distributions
86+
- Track prediction distributions
87+
- Analyze label shift
88+
- Evaluate concept drift
89+
- Check upstream data changes
90+
91+
### Root Cause Analysis Template
92+
93+
1. **Symptom**: What is the observed issue?
94+
2. **Impact**: Business metrics affected
95+
3. **Timeline**: When did it start?
96+
4. **Hypothesis**: Potential causes (reference common pitfalls)
97+
5. **Investigation**: Data/logs to examine
98+
6. **Resolution**: Fix and prevention
99+
100+
---
101+
102+
## 📝 For Documentation Agents
103+
104+
### ML System Documentation Template
105+
106+
Reference: `ML_SYSTEM_INSIGHTS.md#system-design-templates`
107+
108+
#### 1. **System Overview**
109+
```markdown
110+
## System Name
111+
112+
### Purpose
113+
[Business problem being solved]
114+
115+
### Architecture
116+
[Reference architecture pattern from ML_SYSTEM_INSIGHTS.md]
117+
118+
### Key Metrics
119+
- Business: [Revenue, engagement]
120+
- Model: [Accuracy, AUC]
121+
- System: [Latency, throughput]
122+
```
123+
124+
#### 2. **Data Pipeline Documentation**
125+
```markdown
126+
## Data Pipeline
127+
128+
### Sources
129+
- Source A: [Description, update frequency]
130+
- Source B: [Description, update frequency]
131+
132+
### Transformations
133+
1. [Step 1]: [Description]
134+
2. [Step 2]: [Description]
135+
136+
### Output Schema
137+
| Field | Type | Description |
138+
|-------|------|-------------|
139+
| user_id | string | Unique user identifier |
140+
| features | array | Computed feature vector |
141+
```
142+
143+
#### 3. **Model Documentation**
144+
```markdown
145+
## Model Specification
146+
147+
### Training
148+
- Algorithm: [e.g., XGBoost, BERT]
149+
- Training Frequency: [Daily, Weekly]
150+
- Data Window: [e.g., Last 90 days]
151+
152+
### Serving
153+
- Latency SLA: [e.g., <100ms p99]
154+
- Throughput: [e.g., 10K QPS]
155+
- Deployment: [e.g., Kubernetes, SageMaker]
156+
157+
### Monitoring
158+
- Alerts: [List of alert conditions]
159+
- Dashboards: [Links to dashboards]
160+
- On-call: [Team responsible]
161+
```
162+
163+
---
164+
165+
## 🏗️ For Architecture Agents
166+
167+
### Design Decision Framework
168+
169+
Reference: `ML_SYSTEM_INSIGHTS.md#scaling-strategies`
170+
171+
#### 1. **Batch vs Real-time Decision Tree**
172+
```
173+
if latency_requirement < 100ms:
174+
use_real_time()
175+
elif predictions_per_day > 1_million:
176+
use_batch()
177+
elif features_change_frequently:
178+
use_nearline()
179+
else:
180+
use_hybrid()
181+
```
182+
183+
#### 2. **Technology Selection Guide**
184+
185+
| Component | Small Scale | Medium Scale | Large Scale |
186+
|-----------|------------|--------------|-------------|
187+
| Feature Store | PostgreSQL | Redis + PostgreSQL | Feast/Tecton |
188+
| Model Training | Scikit-learn | XGBoost/LightGBM | Distributed TensorFlow |
189+
| Model Serving | Flask | FastAPI + Redis | TorchServe/Triton |
190+
| Monitoring | CloudWatch | Datadog | Custom stack |
191+
192+
#### 3. **Scaling Recommendations**
193+
- **Vertical**: Upgrade instance types for quick wins
194+
- **Horizontal**: Add replicas for stateless services
195+
- **Caching**: Implement multi-tier caching
196+
- **Async**: Move non-critical paths to async
197+
198+
---
199+
200+
## 🔧 For DevOps Agents
201+
202+
### MLOps Implementation Guide
203+
204+
Reference: `ML_SYSTEM_INSIGHTS.md#mlops-maturity-levels`
205+
206+
#### 1. **CI/CD Pipeline Setup**
207+
```yaml
208+
# .github/workflows/ml-pipeline.yml
209+
steps:
210+
- data_validation
211+
- feature_engineering
212+
- model_training
213+
- model_validation
214+
- staged_deployment
215+
- monitoring_setup
216+
```
217+
218+
#### 2. **Infrastructure as Code**
219+
```terraform
220+
# Standard ML infrastructure
221+
module "ml_platform" {
222+
feature_store = true
223+
model_registry = true
224+
experiment_tracking = true
225+
monitoring = true
226+
serving_infrastructure = true
227+
}
228+
```
229+
230+
#### 3. **Monitoring Setup**
231+
```python
232+
# Essential metrics to track
233+
metrics = {
234+
"model": ["accuracy", "auc", "f1"],
235+
"system": ["latency_p99", "error_rate", "throughput"],
236+
"business": ["conversion_rate", "revenue_impact"],
237+
"data": ["feature_coverage", "null_rate", "drift_score"]
238+
}
239+
```
240+
241+
---
242+
243+
## 🧪 For Testing Agents
244+
245+
### ML Testing Strategy
246+
247+
Reference: `ML_SYSTEM_INSIGHTS.md#best-practices-for-production-ml`
248+
249+
#### 1. **Test Pyramid for ML**
250+
```
251+
/\
252+
/ \ End-to-end tests (5%)
253+
/ \
254+
/ \ Integration tests (15%)
255+
/ \
256+
/ \ Component tests (30%)
257+
/ \
258+
/______________\ Unit tests (50%)
259+
```
260+
261+
#### 2. **Test Categories**
262+
```python
263+
# Data validation tests
264+
def test_feature_ranges():
265+
assert features["age"].min() >= 0
266+
assert features["age"].max() <= 120
267+
268+
# Model validation tests
269+
def test_model_performance():
270+
assert model.evaluate(test_data)["auc"] > 0.75
271+
272+
# System integration tests
273+
def test_prediction_latency():
274+
assert predict_latency_p99() < 100 # ms
275+
276+
# A/B test validation
277+
def test_experiment_setup():
278+
assert treatment_allocation == 0.5
279+
assert minimum_sample_size_met()
280+
```
281+
282+
---
283+
284+
## 🚨 For Debugging Agents
285+
286+
### Troubleshooting Guide
287+
288+
Reference: `ML_SYSTEM_INSIGHTS.md#common-pitfalls-solutions`
289+
290+
#### 1. **Debug Decision Tree**
291+
```
292+
Performance Issue?
293+
├── Yes → Check System Metrics
294+
│ ├── High Latency → Profile code, check caching
295+
│ ├── Low Throughput → Scale horizontally
296+
│ └── High Error Rate → Check logs, validate inputs
297+
└── No → Check Model Metrics
298+
├── Low Accuracy → Analyze data drift, retrain
299+
├── Bias Issues → Check data distribution
300+
└── Overfitting → Add regularization, reduce complexity
301+
```
302+
303+
#### 2. **Common Issues & Solutions**
304+
305+
| Symptom | Likely Cause | Solution |
306+
|---------|-------------|----------|
307+
| Predictions all same | Feature pipeline broken | Validate feature generation |
308+
| Sudden accuracy drop | Data drift | Implement drift detection |
309+
| Slow predictions | Model too complex | Use model distillation |
310+
| Memory leaks | Caching issues | Implement TTL, monitor memory |
311+
| Training fails | Data quality issues | Add data validation |
312+
313+
---
314+
315+
## 📊 For Monitoring Agents
316+
317+
### Observability Setup
318+
319+
Reference: `ML_SYSTEM_INSIGHTS.md#monitoring-observability`
320+
321+
#### 1. **Alert Configuration**
322+
```yaml
323+
alerts:
324+
- name: model_accuracy_degradation
325+
condition: accuracy < 0.8
326+
severity: warning
327+
328+
- name: high_latency
329+
condition: p99_latency > 200ms
330+
severity: critical
331+
332+
- name: data_drift_detected
333+
condition: ks_statistic > 0.1
334+
severity: warning
335+
```
336+
337+
#### 2. **Dashboard Requirements**
338+
- Model performance metrics (real-time)
339+
- System health indicators
340+
- Data quality metrics
341+
- Business impact metrics
342+
- Cost tracking
343+
344+
---
345+
346+
## 🔄 Quick Reference for All Agents
347+
348+
### Priority Order for ML Systems
349+
1. **Correctness**: Ensure predictions are accurate
350+
2. **Reliability**: System stays up and handles failures
351+
3. **Latency**: Meet performance requirements
352+
4. **Scalability**: Handle growth in usage
353+
5. **Efficiency**: Optimize resource usage
354+
355+
### Universal Checklist
356+
- [ ] Data validation implemented
357+
- [ ] Model versioning in place
358+
- [ ] Monitoring configured
359+
- [ ] Rollback mechanism ready
360+
- [ ] Documentation complete
361+
- [ ] Tests passing
362+
- [ ] Security review done
363+
- [ ] Cost analysis performed
364+
365+
### When to Escalate
366+
- Data privacy concerns
367+
- Security vulnerabilities
368+
- Significant accuracy degradation
369+
- System-wide outages
370+
- Budget overruns
371+
372+
---
373+
374+
*Reference: [ML_SYSTEM_INSIGHTS.md](./ML_SYSTEM_INSIGHTS.md) for detailed patterns and examples*
375+
*Last Updated: December 2024*

0 commit comments

Comments
 (0)