Skip to content

Commit ae2c65f

Browse files
committed
Clean up: Move experimental files out of project directory
Moved all benchmark, test, debug, and analysis files to ../experiments/: - 10 benchmark files - 6 test files - 3 debug files - 3 profile/analysis files - 1 c-code.py experiment Keeps the project directory clean with only production code.
1 parent f474efa commit ae2c65f

27 files changed

Lines changed: 4970 additions & 400 deletions

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -207,4 +207,6 @@ marimo/_static/
207207
marimo/_lsp/
208208
__marimo__/
209209
c_code.py
210-
toondb_python.txt
210+
c-code.py
211+
toondb_python.txt
212+
sochdb_python.txt

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -857,6 +857,8 @@ db.copy_between_namespaces(
857857

858858
Collections store documents with embeddings for semantic search using HNSW.
859859

860+
**Strategy note:** HNSW is the default, correctness‑first navigator (training‑free, robust under updates). A learned navigator (CHN) is only supported behind a feature gate with strict acceptance checks (recall@k, worst‑case fallback to HNSW, and drift detection). This keeps production behavior stable while allowing controlled experimentation.
861+
860862
### Collection Configuration
861863

862864
```python

benchmarks/README.md

Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
# SochDB Competitive Benchmarks
2+
3+
This directory contains comprehensive benchmarks comparing SochDB against major vector database competitors.
4+
5+
## Quick Start
6+
7+
```bash
8+
# Install dependencies
9+
pip install chromadb qdrant-client lancedb faiss-cpu python-dotenv openai
10+
11+
# Run the ultimate showdown
12+
python benchmarks/ultimate_showdown.py
13+
14+
# Run real embedding demo (requires Azure OpenAI)
15+
python benchmarks/real_search_demo.py
16+
```
17+
18+
## Benchmark Scripts
19+
20+
### 1. `ultimate_showdown.py` - Comprehensive Comparison
21+
Tests SochDB against all available competitors:
22+
- **ChromaDB** - Python-based, simple embedded database
23+
- **Qdrant** - Rust-based with excellent filtering
24+
- **FAISS** - Facebook's C++ library (no persistence)
25+
- **LanceDB** - Columnar embedded database
26+
27+
Dimensions tested: 384 (MiniLM), 768 (BERT), 1536 (OpenAI)
28+
29+
### 2. `real_search_demo.py` - Real Embedding Demo
30+
Demonstrates semantic search using actual Azure OpenAI embeddings. Requires `.env` with:
31+
```
32+
AZURE_OPENAI_API_KEY=your-key
33+
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
34+
```
35+
36+
### 3. `competitive_benchmark.py` - Full Competitive Suite
37+
Extensive benchmark with real embeddings across multiple test sizes.
38+
39+
### 4. `rag_benchmark.py` - RAG-Realistic Workloads
40+
Simulates actual RAG (Retrieval-Augmented Generation) workloads:
41+
- Document ingestion
42+
- Semantic search
43+
- Batch queries (concurrent users)
44+
- Filtered search
45+
- Memory usage
46+
47+
### 5. `feature_benchmark.py` - Feature Differentiators
48+
Tests SochDB's unique features:
49+
- All commercial embedding dimensions (128-3072)
50+
- Concurrent read/write access
51+
- Batch operation efficiency
52+
- Real embedding performance
53+
54+
## Expected Results
55+
56+
Based on testing, SochDB provides:
57+
58+
| Metric | SochDB | ChromaDB | Qdrant | FAISS | LanceDB |
59+
|--------|--------|----------|--------|-------|---------|
60+
| Insert (vec/s) | 2,000-10,000 | 3,000-5,000 | 5,000-10,000 | 50,000+ | 15,000+ |
61+
| Search p50 | 0.3-0.5ms | 1-2ms | 0.5-1ms | 0.1-0.2ms | 5-10ms |
62+
| Filtering ||||||
63+
| Embedded ||||||
64+
| SQL Interface ||||||
65+
66+
## SochDB Advantages
67+
68+
1. **🚀 Rust-Native Performance** - SIMD-accelerated distance calculations (NEON/AVX2)
69+
2. **📦 Truly Embedded** - No server required, like SQLite for vectors
70+
3. **🔢 All Dimensions** - Supports 128-3072 (MiniLM to OpenAI text-embedding-3-large)
71+
4. **💾 SQL Interface** - Query vectors with familiar SQL syntax
72+
5. **🔒 MVCC Transactions** - Safe concurrent reads and writes
73+
6. **🕸️ Graph + Vector** - Hybrid knowledge graph + semantic search
74+
7. **🐍 Python Simplicity** - Native Python bindings via FFI
75+
76+
## Competitors Overview
77+
78+
| Database | Type | Best For | Limitations |
79+
|----------|------|----------|-------------|
80+
| **Pinecone** | Cloud | Managed simplicity | Cloud-only, cost |
81+
| **Weaviate** | Server | Hybrid search | Requires server |
82+
| **Milvus** | Distributed | Large scale | Complexity |
83+
| **Qdrant** | Server | Filtering | Requires server |
84+
| **ChromaDB** | Embedded | Simple Python | Slower performance |
85+
| **FAISS** | Library | Raw speed | No persistence |
86+
| **LanceDB** | Embedded | Analytics | Slower search |
87+
| **pgvector** | Extension | PostgreSQL users | Limited scale |
88+
| **SochDB** | Embedded | AI/ML apps | Feature-rich |
89+
90+
## Running Benchmarks
91+
92+
```bash
93+
# Full competitive analysis
94+
cd sochdb-python-sdk
95+
python benchmarks/ultimate_showdown.py
96+
97+
# Real embeddings (requires Azure OpenAI)
98+
python benchmarks/real_search_demo.py
99+
100+
# RAG-realistic workloads
101+
python benchmarks/rag_benchmark.py
102+
103+
# Feature tests
104+
python benchmarks/feature_benchmark.py
105+
```
106+
107+
## Environment Setup
108+
109+
For real embedding benchmarks, create `.env` in the project root:
110+
111+
```env
112+
AZURE_OPENAI_API_KEY=your-key
113+
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
114+
AZURE_OPENAI_API_VERSION=2024-12-01-preview
115+
```
116+
117+
## Results
118+
119+
Results are saved to:
120+
- `showdown_results.json` - Ultimate showdown results
121+
- `benchmark_results.json` - Competitive benchmark results
122+
- `rag_benchmark_results.json` - RAG benchmark results
123+
- `feature_benchmark_results.json` - Feature benchmark results
124+
125+
---
126+
127+
## 📊 Industry-Standard Performance Metrics
128+
129+
Based on **ANN-Benchmarks** (ann-benchmarks.com), **VectorDBBench** (Zilliz), and **Qdrant Benchmarks**:
130+
131+
### Primary Metrics (Required for Credible Benchmarks)
132+
133+
| Metric | Definition | Why It Matters |
134+
|--------|------------|----------------|
135+
| **Recall@k** | Fraction of true k-nearest neighbors found | Measures search accuracy - the most critical metric |
136+
| **QPS (Queries Per Second)** | Number of queries processed per second | Raw throughput for parallel workloads |
137+
| **Latency p50/p95/p99** | Response time percentiles | User-perceived performance |
138+
| **Index Build Time** | Time to construct the HNSW index | Critical for data ingestion pipelines |
139+
| **Index Size (Memory)** | RAM required for the index | Cost and scalability factor |
140+
141+
### Recall vs QPS Tradeoff (The Gold Standard)
142+
143+
> **"The speed of vector databases should only be compared if they achieve the same precision."**
144+
> — Qdrant Benchmarks
145+
146+
ANN search is fundamentally about trading **precision for speed**. Any benchmark comparing two systems must use the **same recall threshold** (typically 0.95 or 0.99).
147+
148+
```
149+
Recall@10 = (# of true neighbors in results) / 10
150+
```
151+
152+
### Standard Benchmark Datasets (ANN-Benchmarks)
153+
154+
| Dataset | Vectors | Dimensions | Distance | Use Case |
155+
|---------|---------|------------|----------|----------|
156+
| **SIFT-1M** | 1,000,000 | 128 | Euclidean | Classic image descriptors |
157+
| **GloVe-100** | 1,200,000 | 100 | Cosine | Word embeddings |
158+
| **Fashion-MNIST** | 60,000 | 784 | Euclidean | Image classification |
159+
| **GIST-960** | 1,000,000 | 960 | Euclidean | Scene recognition |
160+
| **DBpedia-OpenAI-1M** | 1,000,000 | 1536 | Cosine | Real OpenAI embeddings |
161+
| **Deep-Image-96** | 10,000,000 | 96 | Cosine | Large-scale images |
162+
163+
### VectorDBBench Scenarios
164+
165+
VectorDBBench (github.com/zilliztech/VectorDBBench) tests:
166+
167+
| Case Type | Vectors | Dimensions | Purpose |
168+
|-----------|---------|------------|---------|
169+
| Performance768D1M | 1M | 768 | BERT-class embeddings |
170+
| Performance768D10M | 10M | 768 | Scale test |
171+
| Performance1536D500K | 500K | 1536 | OpenAI embeddings |
172+
| Performance1536D5M | 5M | 1536 | Large OpenAI scale |
173+
| CapacityDim128 | Max | 128 | Stress test (SIFT) |
174+
| CapacityDim960 | Max | 960 | Stress test (GIST) |
175+
176+
### Latency Percentiles Explained
177+
178+
| Percentile | Meaning | Target |
179+
|------------|---------|--------|
180+
| **p50 (median)** | Half of requests faster than this | < 1ms |
181+
| **p95** | 95% of requests faster than this | < 5ms |
182+
| **p99** | 99% of requests faster than this | < 10ms |
183+
| **p999** | 99.9% (tail latency) | < 50ms |
184+
185+
High p99/p999 indicates **tail latency issues** that affect user experience.
186+
187+
### HNSW Index Parameters
188+
189+
| Parameter | Effect on Recall | Effect on Speed | Effect on Memory |
190+
|-----------|------------------|-----------------|------------------|
191+
| **M** (connections) | ↑ M = ↑ Recall | ↑ M = ↓ Speed | ↑ M = ↑ Memory |
192+
| **ef_construction** | ↑ ef = ↑ Recall | ↑ ef = ↓ Build | No effect |
193+
| **ef_search** | ↑ ef = ↑ Recall | ↑ ef = ↓ QPS | No effect |
194+
195+
Typical configurations:
196+
- **High Recall (0.99+)**: M=32, ef_construction=256, ef_search=256
197+
- **Balanced (0.95-0.98)**: M=16, ef_construction=128, ef_search=100
198+
- **High Speed (0.90-0.95)**: M=8, ef_construction=64, ef_search=50
199+
200+
### Benchmark Methodology (Best Practices)
201+
202+
1. **Same Hardware**: All systems must run on identical hardware
203+
2. **Same Dataset**: Use standard datasets (SIFT, GloVe, DBpedia)
204+
3. **Same Recall**: Only compare at equivalent precision thresholds
205+
4. **Warm Cache**: Run warmup queries before measurement
206+
5. **Multiple Runs**: Report median of 5+ runs
207+
6. **Separate Client/Server**: Use different machines for client and server (if applicable)
208+
209+
### Reference Hardware (VectorDBBench Standard)
210+
211+
```
212+
Client: 8 vCPUs, 16 GB RAM (Azure Standard D8ls v5)
213+
Server: 8 vCPUs, 32 GB RAM (Azure Standard D8s v3)
214+
CPU: Intel Xeon Platinum 8375C @ 2.90GHz
215+
Memory Limit: 25 GB (to ensure fairness)
216+
```
217+
218+
### How to Interpret Results
219+
220+
#### Good Benchmark Report Shows:
221+
✅ Recall@k vs QPS curves (the gold standard chart)
222+
✅ Multiple precision thresholds (0.90, 0.95, 0.99)
223+
✅ Latency percentiles (p50, p95, p99)
224+
✅ Index build time and memory usage
225+
✅ Dataset and hardware specifications
226+
227+
#### Red Flags in Benchmarks:
228+
❌ No recall measurement (speed without accuracy is meaningless)
229+
❌ Single data point (no precision/speed tradeoff shown)
230+
❌ Unknown or unreproducible hardware
231+
❌ Proprietary datasets
232+
233+
---
234+
235+
## 🏆 SochDB Performance Targets
236+
237+
Based on industry benchmarks, SochDB targets:
238+
239+
| Metric | Target | Compared To |
240+
|--------|--------|-------------|
241+
| Recall@10 | ≥ 0.95 | Standard ANN threshold |
242+
| QPS (single-thread) | ≥ 1,000 | ChromaDB baseline |
243+
| Latency p50 | < 1ms | Qdrant/Milvus class |
244+
| Latency p99 | < 10ms | Production-ready |
245+
| Index Build | < 60s/1M vectors | Competitive |
246+
| Memory | < 2x raw vector size | Efficient |
247+
248+
### Distance Metrics Supported
249+
250+
| Metric | Formula | Use Case |
251+
|--------|---------|----------|
252+
| **Cosine** | 1 - (a·b / \|a\|\|b\|) | Text embeddings (default) |
253+
| **Euclidean (L2)** | √Σ(aᵢ-bᵢ)² | Image features |
254+
| **Dot Product** | -a·b | Pre-normalized vectors |
255+
256+
---
257+
258+
## 📚 References
259+
260+
- **ANN-Benchmarks**: https://ann-benchmarks.com/
261+
- **VectorDBBench**: https://github.com/zilliztech/VectorDBBench
262+
- **Qdrant Benchmarks**: https://qdrant.tech/benchmarks/
263+
- **Zilliz Leaderboard**: https://zilliz.com/benchmark
264+
- **Erik Bernhardsson's ANN Benchmarks**: https://github.com/erikbern/ann-benchmarks

benchmarks/compare_search_fast.py

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Compare search() vs search_fast() performance.
4+
"""
5+
6+
import sys
7+
import time
8+
import numpy as np
9+
from pathlib import Path
10+
11+
sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
12+
13+
from sochdb.vector import VectorIndex
14+
15+
# Configuration
16+
DIM = 384
17+
N_VECTORS = 10000
18+
N_QUERIES = 1000
19+
K = 10
20+
21+
np.random.seed(42)
22+
23+
print("=" * 70)
24+
print("SEARCH vs SEARCH_FAST COMPARISON")
25+
print("=" * 70)
26+
print(f"Config: {N_VECTORS} vectors, {DIM}D, {N_QUERIES} queries, k={K}")
27+
28+
# Generate data
29+
vectors = np.random.randn(N_VECTORS, DIM).astype(np.float32)
30+
vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)
31+
32+
queries = vectors[:N_QUERIES].copy() + np.random.randn(N_QUERIES, DIM).astype(np.float32) * 0.01
33+
queries = queries / np.linalg.norm(queries, axis=1, keepdims=True)
34+
35+
# Ground truth
36+
similarities = queries @ vectors.T
37+
ground_truth = np.argsort(-similarities, axis=1)[:, :K]
38+
39+
# Create index
40+
print("\nCreating index...")
41+
index = VectorIndex(dimension=DIM, max_connections=32, ef_construction=200)
42+
ids = np.arange(N_VECTORS, dtype=np.uint64)
43+
index.insert_batch_fast(ids, vectors)
44+
index.ef_search = 500
45+
print(f"Index created with {len(index)} vectors")
46+
47+
# Warmup
48+
print("\nWarming up...")
49+
for i in range(100):
50+
index.search(queries[i % N_QUERIES], k=K)
51+
index.search_fast(queries[i % N_QUERIES], k=K)
52+
53+
# Benchmark search()
54+
print("\nBenchmarking search()...")
55+
times_search = []
56+
recalls_search = []
57+
for i in range(N_QUERIES):
58+
start = time.perf_counter_ns()
59+
results = index.search(queries[i], k=K)
60+
elapsed = time.perf_counter_ns() - start
61+
times_search.append(elapsed / 1000) # µs
62+
63+
pred = [r[0] for r in results]
64+
recall = len(set(pred) & set(ground_truth[i])) / K
65+
recalls_search.append(recall)
66+
67+
p50_search = np.percentile(times_search, 50)
68+
p99_search = np.percentile(times_search, 99)
69+
recall_search = np.mean(recalls_search)
70+
71+
print(f" search(): p50={p50_search:.1f}µs ({p50_search/1000:.2f}ms), p99={p99_search:.1f}µs, recall={recall_search:.3f}")
72+
73+
# Benchmark search_fast()
74+
print("\nBenchmarking search_fast()...")
75+
times_fast = []
76+
recalls_fast = []
77+
for i in range(N_QUERIES):
78+
start = time.perf_counter_ns()
79+
results = index.search_fast(queries[i], k=K)
80+
elapsed = time.perf_counter_ns() - start
81+
times_fast.append(elapsed / 1000) # µs
82+
83+
pred = [r[0] for r in results]
84+
recall = len(set(pred) & set(ground_truth[i])) / K
85+
recalls_fast.append(recall)
86+
87+
p50_fast = np.percentile(times_fast, 50)
88+
p99_fast = np.percentile(times_fast, 99)
89+
recall_fast = np.mean(recalls_fast)
90+
91+
print(f" search_fast(): p50={p50_fast:.1f}µs ({p50_fast/1000:.2f}ms), p99={p99_fast:.1f}µs, recall={recall_fast:.3f}")
92+
93+
# Summary
94+
print("\n" + "=" * 70)
95+
print("SUMMARY")
96+
print("=" * 70)
97+
speedup = p50_search / p50_fast
98+
print(f" search(): {p50_search:.1f}µs ({p50_search/1000:.2f}ms)")
99+
print(f" search_fast(): {p50_fast:.1f}µs ({p50_fast/1000:.2f}ms)")
100+
print(f" Speedup: {speedup:.2f}x faster")
101+
print(f" Recall diff: {recall_search:.3f} vs {recall_fast:.3f}")
102+
103+
if speedup > 1.2:
104+
print(f"\n ✅ search_fast() is {speedup:.1f}x faster!")
105+
elif speedup < 0.8:
106+
print(f"\n ⚠️ search_fast() is slower - investigating needed")
107+
else:
108+
print(f"\n ≈ Performance is similar")

0 commit comments

Comments
 (0)