Add SPANN memory-disk hybrid vector similarity index

Related: #94545 #94544

## Motivation

The existing `vector_similarity` index (HNSW via USearch) is purely in-memory, which becomes prohibitively expensive at billion-scale. SPANN (NeurIPS 2021, deployed in Microsoft Bing) solves this with a memory-disk hybrid inverted index: only cluster centroids live in memory, posting lists live on SSD. This is the natural follow-up to #94545, which the team redirected toward SPANN.

## Plan

Implement SPANN as a new index type in MergeTree, mirroring the structure of `MergeTreeIndexVectorSimilarity`. Starting with random centroid sampling and iterating toward hierarchical balanced clustering as @rschu1ze suggested.

## Open questions

- Per-part vs. global centroid index
- Part merge rebuild strategy
- Async I/O integration point for posting list reads

## References

- Paper: https://arxiv.org/pdf/2111.08566
- SPTAG implementation: https://github.com/microsoft/SPTAG/tree/main/AnnService/inc/Core/SPANN
- turbopuffer writeup: https://turbopuffer.com/blog/ann-v3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SPANN memory-disk hybrid vector similarity index #102146

Motivation

Plan

Open questions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add SPANN memory-disk hybrid vector similarity index #102146

Description

Motivation

Plan

Open questions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions