Skip to content

Add SPANN memory-disk hybrid vector similarity index #102146

@zex-hyd

Description

@zex-hyd

Related: #94545 #94544

Motivation

The existing vector_similarity index (HNSW via USearch) is purely in-memory, which becomes prohibitively expensive at billion-scale. SPANN (NeurIPS 2021, deployed in Microsoft Bing) solves this with a memory-disk hybrid inverted index: only cluster centroids live in memory, posting lists live on SSD. This is the natural follow-up to #94545, which the team redirected toward SPANN.

Plan

Implement SPANN as a new index type in MergeTree, mirroring the structure of MergeTreeIndexVectorSimilarity. Starting with random centroid sampling and iterating toward hierarchical balanced clustering as @rschu1ze suggested.

Open questions

  • Per-part vs. global centroid index
  • Part merge rebuild strategy
  • Async I/O integration point for posting list reads

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions