|
| 1 | +# [Alpha] Vector Database |
| 2 | +**Warning**: This is an _experimental_ feature. To our knowledge, this is stable, but there are still rough edges in the experience. Contributions are welcome! |
| 3 | + |
| 4 | +## Overview |
| 5 | +Vector database allows user to store and retrieve embeddings. Feast provides general APIs to store and retrieve embeddings. |
| 6 | + |
| 7 | +## Integration |
| 8 | +Below are supported vector databases and implemented features: |
| 9 | + |
| 10 | +| Vector Database | Retrieval | Indexing | |
| 11 | +|-----------------|-----------|----------| |
| 12 | +| Pgvector | [x] | [ ] | |
| 13 | +| Elasticsearch | [ ] | [ ] | |
| 14 | +| Milvus | [ ] | [ ] | |
| 15 | +| Faiss | [ ] | [ ] | |
| 16 | + |
| 17 | + |
| 18 | +## Example |
| 19 | + |
| 20 | +See [https://github.com/feast-dev/feast-workshop/blob/rag/module_4_rag](https://github.com/feast-dev/feast-workshop/blob/rag/module_4_rag) for an example on how to use vector database. |
| 21 | + |
| 22 | +### **Prepare offline embedding dataset** |
| 23 | +Run the following commands to prepare the embedding dataset: |
| 24 | +```shell |
| 25 | +python pull_states.py |
| 26 | +python batch_score_documents.py |
| 27 | +``` |
| 28 | +The output will be stored in `data/city_wikipedia_summaries.csv.` |
| 29 | + |
| 30 | +### **Initialize Feast feature store and materialize the data to the online store** |
| 31 | +Use the feature_tore.yaml file to initialize the feature store. This will use the data as offline store, and Pgvector as online store. |
| 32 | + |
| 33 | +```yaml |
| 34 | +project: feast_demo_local |
| 35 | +provider: local |
| 36 | +registry: |
| 37 | + registry_type: sql |
| 38 | + path: postgresql://@localhost:5432/feast |
| 39 | +online_store: |
| 40 | + type: postgres |
| 41 | + pgvector_enabled: true |
| 42 | + vector_len: 384 |
| 43 | + host: 127.0.0.1 |
| 44 | + port: 5432 |
| 45 | + database: feast |
| 46 | + user: "" |
| 47 | + password: "" |
| 48 | + |
| 49 | + |
| 50 | +offline_store: |
| 51 | + type: file |
| 52 | +entity_key_serialization_version: 2 |
| 53 | +``` |
| 54 | +Run the following command in terminal to apply the feature store configuration: |
| 55 | +
|
| 56 | +```shell |
| 57 | +feast apply |
| 58 | +``` |
| 59 | + |
| 60 | +Note that when you run `feast apply` you are going to apply the following Feature View that we will use for retrieval later: |
| 61 | + |
| 62 | +```python |
| 63 | +city_embeddings_feature_view = FeatureView( |
| 64 | + name="city_embeddings", |
| 65 | + entities=[item], |
| 66 | + schema=[ |
| 67 | + Field(name="Embeddings", dtype=Array(Float32)), |
| 68 | + ], |
| 69 | + source=source, |
| 70 | + ttl=timedelta(hours=2), |
| 71 | +) |
| 72 | +``` |
| 73 | + |
| 74 | +Then run the following command in the terminal to materialize the data to the online store: |
| 75 | + |
| 76 | +```shell |
| 77 | +CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S") |
| 78 | +feast materialize-incremental $CURRENT_TIME |
| 79 | +``` |
| 80 | + |
| 81 | +### **Prepare a query embedding** |
| 82 | +```python |
| 83 | +from batch_score_documents import run_model, TOKENIZER, MODEL |
| 84 | +from transformers import AutoTokenizer, AutoModel |
| 85 | + |
| 86 | +question = "the most populous city in the U.S. state of Texas?" |
| 87 | + |
| 88 | +tokenizer = AutoTokenizer.from_pretrained(TOKENIZER) |
| 89 | +model = AutoModel.from_pretrained(MODEL) |
| 90 | +query_embedding = run_model(question, tokenizer, model) |
| 91 | +query = query_embedding.detach().cpu().numpy().tolist()[0] |
| 92 | +``` |
| 93 | + |
| 94 | +### **Retrieve the top 5 similar documents** |
| 95 | +First create a feature store instance, and use the `retrieve_online_documents` API to retrieve the top 5 similar documents to the specified query. |
| 96 | + |
| 97 | +```python |
| 98 | +from feast import FeatureStore |
| 99 | +store = FeatureStore(repo_path=".") |
| 100 | +features = store.retrieve_online_documents( |
| 101 | + feature="city_embeddings:Embeddings", |
| 102 | + query=query, |
| 103 | + top_k=5 |
| 104 | +).to_dict() |
| 105 | + |
| 106 | +def print_online_features(features): |
| 107 | + for key, value in sorted(features.items()): |
| 108 | + print(key, " : ", value) |
| 109 | + |
| 110 | +print_online_features(features) |
| 111 | +``` |
0 commit comments