sochdb
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 1 deletion b/‎.gitignore‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 2 additions & 0 deletions b/‎README.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎benchmarks/README.md‎
Lines changed: 264 additions & 0 deletions b/‎benchmarks/README.md‎
Lines changed: 264 additions & 0 deletions
diff --git a/‎benchmarks/compare_search_fast.py‎
Lines changed: 108 additions & 0 deletions b/‎benchmarks/compare_search_fast.py‎
Lines changed: 108 additions & 0 deletions
@@ -207,4 +207,6 @@ marimo/_static/
 marimo/_lsp/
 __marimo__/
 c_code.py
-toondb_python.txt
+c-code.py
+toondb_python.txt
+sochdb_python.txt
@@ -857,6 +857,8 @@ db.copy_between_namespaces(
 
 Collections store documents with embeddings for semantic search using HNSW.
 
+**Strategy note:** HNSW is the default, correctness‑first navigator (training‑free, robust under updates). A learned navigator (CHN) is only supported behind a feature gate with strict acceptance checks (recall@k, worst‑case fallback to HNSW, and drift detection). This keeps production behavior stable while allowing controlled experimentation.
+
 ### Collection Configuration
 
 ```python
 
@@ -0,0 +1,264 @@
+# SochDB Competitive Benchmarks
+
+This directory contains comprehensive benchmarks comparing SochDB against major vector database competitors.
+
+## Quick Start
+
+```bash
+# Install dependencies
+pip install chromadb qdrant-client lancedb faiss-cpu python-dotenv openai
+
+# Run the ultimate showdown
+python benchmarks/ultimate_showdown.py
+
+# Run real embedding demo (requires Azure OpenAI)
+python benchmarks/real_search_demo.py
+```
+
+## Benchmark Scripts
+
+### 1. `ultimate_showdown.py` - Comprehensive Comparison
+Tests SochDB against all available competitors:
+- **ChromaDB** - Python-based, simple embedded database
+- **Qdrant** - Rust-based with excellent filtering
+- **FAISS** - Facebook's C++ library (no persistence)
+- **LanceDB** - Columnar embedded database
+
+Dimensions tested: 384 (MiniLM), 768 (BERT), 1536 (OpenAI)
+
+### 2. `real_search_demo.py` - Real Embedding Demo
+Demonstrates semantic search using actual Azure OpenAI embeddings. Requires `.env` with:
+```
+AZURE_OPENAI_API_KEY=your-key
+AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
+```
+
+### 3. `competitive_benchmark.py` - Full Competitive Suite
+Extensive benchmark with real embeddings across multiple test sizes.
+
+### 4. `rag_benchmark.py` - RAG-Realistic Workloads
+Simulates actual RAG (Retrieval-Augmented Generation) workloads:
+- Document ingestion
+- Semantic search
+- Batch queries (concurrent users)
+- Filtered search
+- Memory usage
+
+### 5. `feature_benchmark.py` - Feature Differentiators
+Tests SochDB's unique features:
+- All commercial embedding dimensions (128-3072)
+- Concurrent read/write access
+- Batch operation efficiency
+- Real embedding performance
+
+## Expected Results
+
+Based on testing, SochDB provides:
+
+| Metric | SochDB | ChromaDB | Qdrant | FAISS | LanceDB |
+|--------|--------|----------|--------|-------|---------|
+| Insert (vec/s) | 2,000-10,000 | 3,000-5,000 | 5,000-10,000 | 50,000+ | 15,000+ |
+| Search p50 | 0.3-0.5ms | 1-2ms | 0.5-1ms | 0.1-0.2ms | 5-10ms |
+| Filtering | ✅ | ✅ | ✅ | ❌ | ✅ |
+| Embedded | ✅ | ✅ | ❌ | ✅ | ✅ |
+| SQL Interface | ✅ | ❌ | ❌ | ❌ | ❌ |
+
+## SochDB Advantages
+
+1. **🚀 Rust-Native Performance** - SIMD-accelerated distance calculations (NEON/AVX2)
+2. **📦 Truly Embedded** - No server required, like SQLite for vectors
+3. **🔢 All Dimensions** - Supports 128-3072 (MiniLM to OpenAI text-embedding-3-large)
+4. **💾 SQL Interface** - Query vectors with familiar SQL syntax
+5. **🔒 MVCC Transactions** - Safe concurrent reads and writes
+6. **🕸️ Graph + Vector** - Hybrid knowledge graph + semantic search
+7. **🐍 Python Simplicity** - Native Python bindings via FFI
+
+## Competitors Overview
+
+| Database | Type | Best For | Limitations |
+|----------|------|----------|-------------|
+| **Pinecone** | Cloud | Managed simplicity | Cloud-only, cost |
+| **Weaviate** | Server | Hybrid search | Requires server |
+| **Milvus** | Distributed | Large scale | Complexity |
+| **Qdrant** | Server | Filtering | Requires server |
+| **ChromaDB** | Embedded | Simple Python | Slower performance |
+| **FAISS** | Library | Raw speed | No persistence |
+| **LanceDB** | Embedded | Analytics | Slower search |
+| **pgvector** | Extension | PostgreSQL users | Limited scale |
+| **SochDB** | Embedded | AI/ML apps | Feature-rich |
+
+## Running Benchmarks
+
+```bash
+# Full competitive analysis
+cd sochdb-python-sdk
+python benchmarks/ultimate_showdown.py
+
+# Real embeddings (requires Azure OpenAI)
+python benchmarks/real_search_demo.py
+
+# RAG-realistic workloads
+python benchmarks/rag_benchmark.py
+
+# Feature tests
+python benchmarks/feature_benchmark.py
+```
+
+## Environment Setup
+
+For real embedding benchmarks, create `.env` in the project root:
+
+```env
+AZURE_OPENAI_API_KEY=your-key
+AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
+AZURE_OPENAI_API_VERSION=2024-12-01-preview
+```
+
+## Results
+
+Results are saved to:
+- `showdown_results.json` - Ultimate showdown results
+- `benchmark_results.json` - Competitive benchmark results
+- `rag_benchmark_results.json` - RAG benchmark results
+- `feature_benchmark_results.json` - Feature benchmark results
+
+---
+
+## 📊 Industry-Standard Performance Metrics
+
+Based on **ANN-Benchmarks** (ann-benchmarks.com), **VectorDBBench** (Zilliz), and **Qdrant Benchmarks**:
+
+### Primary Metrics (Required for Credible Benchmarks)
+
+| Metric | Definition | Why It Matters |
+|--------|------------|----------------|
+| **Recall@k** | Fraction of true k-nearest neighbors found | Measures search accuracy - the most critical metric |
+| **QPS (Queries Per Second)** | Number of queries processed per second | Raw throughput for parallel workloads |
+| **Latency p50/p95/p99** | Response time percentiles | User-perceived performance |
+| **Index Build Time** | Time to construct the HNSW index | Critical for data ingestion pipelines |
+| **Index Size (Memory)** | RAM required for the index | Cost and scalability factor |
+
+### Recall vs QPS Tradeoff (The Gold Standard)
+
+> **"The speed of vector databases should only be compared if they achieve the same precision."** 
+> — Qdrant Benchmarks
+
+ANN search is fundamentally about trading **precision for speed**. Any benchmark comparing two systems must use the **same recall threshold** (typically 0.95 or 0.99).
+
+```
+Recall@10 = (# of true neighbors in results) / 10
+```
+
+### Standard Benchmark Datasets (ANN-Benchmarks)
+
+| Dataset | Vectors | Dimensions | Distance | Use Case |
+|---------|---------|------------|----------|----------|
+| **SIFT-1M** | 1,000,000 | 128 | Euclidean | Classic image descriptors |
+| **GloVe-100** | 1,200,000 | 100 | Cosine | Word embeddings |
+| **Fashion-MNIST** | 60,000 | 784 | Euclidean | Image classification |
+| **GIST-960** | 1,000,000 | 960 | Euclidean | Scene recognition |
+| **DBpedia-OpenAI-1M** | 1,000,000 | 1536 | Cosine | Real OpenAI embeddings |
+| **Deep-Image-96** | 10,000,000 | 96 | Cosine | Large-scale images |
+
+### VectorDBBench Scenarios
+
+VectorDBBench (github.com/zilliztech/VectorDBBench) tests:
+
+| Case Type | Vectors | Dimensions | Purpose |
+|-----------|---------|------------|---------|
+| Performance768D1M | 1M | 768 | BERT-class embeddings |
+| Performance768D10M | 10M | 768 | Scale test |
+| Performance1536D500K | 500K | 1536 | OpenAI embeddings |
+| Performance1536D5M | 5M | 1536 | Large OpenAI scale |
+| CapacityDim128 | Max | 128 | Stress test (SIFT) |
+| CapacityDim960 | Max | 960 | Stress test (GIST) |
+
+### Latency Percentiles Explained
+
+| Percentile | Meaning | Target |
+|------------|---------|--------|
+| **p50 (median)** | Half of requests faster than this | < 1ms |
+| **p95** | 95% of requests faster than this | < 5ms |
+| **p99** | 99% of requests faster than this | < 10ms |
+| **p999** | 99.9% (tail latency) | < 50ms |
+
+High p99/p999 indicates **tail latency issues** that affect user experience.
+
+### HNSW Index Parameters
+
+| Parameter | Effect on Recall | Effect on Speed | Effect on Memory |
+|-----------|------------------|-----------------|------------------|
+| **M** (connections) | ↑ M = ↑ Recall | ↑ M = ↓ Speed | ↑ M = ↑ Memory |
+| **ef_construction** | ↑ ef = ↑ Recall | ↑ ef = ↓ Build | No effect |
+| **ef_search** | ↑ ef = ↑ Recall | ↑ ef = ↓ QPS | No effect |
+
+Typical configurations:
+- **High Recall (0.99+)**: M=32, ef_construction=256, ef_search=256
+- **Balanced (0.95-0.98)**: M=16, ef_construction=128, ef_search=100
+- **High Speed (0.90-0.95)**: M=8, ef_construction=64, ef_search=50
+
+### Benchmark Methodology (Best Practices)
+
+1. **Same Hardware**: All systems must run on identical hardware
+2. **Same Dataset**: Use standard datasets (SIFT, GloVe, DBpedia)
+3. **Same Recall**: Only compare at equivalent precision thresholds
+4. **Warm Cache**: Run warmup queries before measurement
+5. **Multiple Runs**: Report median of 5+ runs
+6. **Separate Client/Server**: Use different machines for client and server (if applicable)
+
+### Reference Hardware (VectorDBBench Standard)
+
+```
+Client: 8 vCPUs, 16 GB RAM (Azure Standard D8ls v5)
+Server: 8 vCPUs, 32 GB RAM (Azure Standard D8s v3)
+CPU: Intel Xeon Platinum 8375C @ 2.90GHz
+Memory Limit: 25 GB (to ensure fairness)
+```
+
+### How to Interpret Results
+
+#### Good Benchmark Report Shows:
+✅ Recall@k vs QPS curves (the gold standard chart)  
+✅ Multiple precision thresholds (0.90, 0.95, 0.99)  
+✅ Latency percentiles (p50, p95, p99)  
+✅ Index build time and memory usage  
+✅ Dataset and hardware specifications  
+
+#### Red Flags in Benchmarks:
+❌ No recall measurement (speed without accuracy is meaningless)  
+❌ Single data point (no precision/speed tradeoff shown)  
+❌ Unknown or unreproducible hardware  
+❌ Proprietary datasets  
+
+---
+
+## 🏆 SochDB Performance Targets
+
+Based on industry benchmarks, SochDB targets:
+
+| Metric | Target | Compared To |
+|--------|--------|-------------|
+| Recall@10 | ≥ 0.95 | Standard ANN threshold |
+| QPS (single-thread) | ≥ 1,000 | ChromaDB baseline |
+| Latency p50 | < 1ms | Qdrant/Milvus class |
+| Latency p99 | < 10ms | Production-ready |
+| Index Build | < 60s/1M vectors | Competitive |
+| Memory | < 2x raw vector size | Efficient |
+
+### Distance Metrics Supported
+
+| Metric | Formula | Use Case |
+|--------|---------|----------|
+| **Cosine** | 1 - (a·b / \|a\|\|b\|) | Text embeddings (default) |
+| **Euclidean (L2)** | √Σ(aᵢ-bᵢ)² | Image features |
+| **Dot Product** | -a·b | Pre-normalized vectors |
+
+---
+
+## 📚 References
+
+- **ANN-Benchmarks**: https://ann-benchmarks.com/
+- **VectorDBBench**: https://github.com/zilliztech/VectorDBBench
+- **Qdrant Benchmarks**: https://qdrant.tech/benchmarks/
+- **Zilliz Leaderboard**: https://zilliz.com/benchmark
+- **Erik Bernhardsson's ANN Benchmarks**: https://github.com/erikbern/ann-benchmarks
@@ -0,0 +1,108 @@
+#!/usr/bin/env python3
+"""
+Compare search() vs search_fast() performance.
+"""
+
+import sys
+import time
+import numpy as np
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).parent.parent / 'src'))
+
+from sochdb.vector import VectorIndex
+
+# Configuration
+DIM = 384
+N_VECTORS = 10000
+N_QUERIES = 1000
+K = 10
+
+np.random.seed(42)
+
+print("=" * 70)
+print("SEARCH vs SEARCH_FAST COMPARISON")
+print("=" * 70)
+print(f"Config: {N_VECTORS} vectors, {DIM}D, {N_QUERIES} queries, k={K}")
+
+# Generate data
+vectors = np.random.randn(N_VECTORS, DIM).astype(np.float32)
+vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)
+
+queries = vectors[:N_QUERIES].copy() + np.random.randn(N_QUERIES, DIM).astype(np.float32) * 0.01
+queries = queries / np.linalg.norm(queries, axis=1, keepdims=True)
+
+# Ground truth
+similarities = queries @ vectors.T
+ground_truth = np.argsort(-similarities, axis=1)[:, :K]
+
+# Create index
+print("\nCreating index...")
+index = VectorIndex(dimension=DIM, max_connections=32, ef_construction=200)
+ids = np.arange(N_VECTORS, dtype=np.uint64)
+index.insert_batch_fast(ids, vectors)
+index.ef_search = 500
+print(f"Index created with {len(index)} vectors")
+
+# Warmup
+print("\nWarming up...")
+for i in range(100):
+    index.search(queries[i % N_QUERIES], k=K)
+    index.search_fast(queries[i % N_QUERIES], k=K)
+
+# Benchmark search()
+print("\nBenchmarking search()...")
+times_search = []
+recalls_search = []
+for i in range(N_QUERIES):
+    start = time.perf_counter_ns()
+    results = index.search(queries[i], k=K)
+    elapsed = time.perf_counter_ns() - start
+    times_search.append(elapsed / 1000)  # µs
+    
+    pred = [r[0] for r in results]
+    recall = len(set(pred) & set(ground_truth[i])) / K
+    recalls_search.append(recall)
+
+p50_search = np.percentile(times_search, 50)
+p99_search = np.percentile(times_search, 99)
+recall_search = np.mean(recalls_search)
+
+print(f"  search():     p50={p50_search:.1f}µs ({p50_search/1000:.2f}ms), p99={p99_search:.1f}µs, recall={recall_search:.3f}")
+
+# Benchmark search_fast()
+print("\nBenchmarking search_fast()...")
+times_fast = []
+recalls_fast = []
+for i in range(N_QUERIES):
+    start = time.perf_counter_ns()
+    results = index.search_fast(queries[i], k=K)
+    elapsed = time.perf_counter_ns() - start
+    times_fast.append(elapsed / 1000)  # µs
+    
+    pred = [r[0] for r in results]
+    recall = len(set(pred) & set(ground_truth[i])) / K
+    recalls_fast.append(recall)
+
+p50_fast = np.percentile(times_fast, 50)
+p99_fast = np.percentile(times_fast, 99)
+recall_fast = np.mean(recalls_fast)
+
+print(f"  search_fast(): p50={p50_fast:.1f}µs ({p50_fast/1000:.2f}ms), p99={p99_fast:.1f}µs, recall={recall_fast:.3f}")
+
+# Summary
+print("\n" + "=" * 70)
+print("SUMMARY")
+print("=" * 70)
+speedup = p50_search / p50_fast
+print(f"  search():      {p50_search:.1f}µs ({p50_search/1000:.2f}ms)")
+print(f"  search_fast(): {p50_fast:.1f}µs ({p50_fast/1000:.2f}ms)")
+print(f"  Speedup:       {speedup:.2f}x faster")
+print(f"  Recall diff:   {recall_search:.3f} vs {recall_fast:.3f}")
+
+if speedup > 1.2:
+    print(f"\n  ✅ search_fast() is {speedup:.1f}x faster!")
+elif speedup < 0.8:
+    print(f"\n  ⚠️ search_fast() is slower - investigating needed")
+else:
+    print(f"\n  ≈ Performance is similar")