Related: #94545 #94544
Motivation
The existing vector_similarity index (HNSW via USearch) is purely in-memory, which becomes prohibitively expensive at billion-scale. SPANN (NeurIPS 2021, deployed in Microsoft Bing) solves this with a memory-disk hybrid inverted index: only cluster centroids live in memory, posting lists live on SSD. This is the natural follow-up to #94545, which the team redirected toward SPANN.
Plan
Implement SPANN as a new index type in MergeTree, mirroring the structure of MergeTreeIndexVectorSimilarity. Starting with random centroid sampling and iterating toward hierarchical balanced clustering as @rschu1ze suggested.
Open questions
- Per-part vs. global centroid index
- Part merge rebuild strategy
- Async I/O integration point for posting list reads
References
Related: #94545 #94544
Motivation
The existing
vector_similarityindex (HNSW via USearch) is purely in-memory, which becomes prohibitively expensive at billion-scale. SPANN (NeurIPS 2021, deployed in Microsoft Bing) solves this with a memory-disk hybrid inverted index: only cluster centroids live in memory, posting lists live on SSD. This is the natural follow-up to #94545, which the team redirected toward SPANN.Plan
Implement SPANN as a new index type in MergeTree, mirroring the structure of
MergeTreeIndexVectorSimilarity. Starting with random centroid sampling and iterating toward hierarchical balanced clustering as @rschu1ze suggested.Open questions
References