Skip to content

Commit 37f36b6

Browse files
authored
fix: Add vector database doc (#4165)
1 parent 8e44125 commit 37f36b6

File tree

1 file changed

+111
-0
lines changed

1 file changed

+111
-0
lines changed
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# [Alpha] Vector Database
2+
**Warning**: This is an _experimental_ feature. To our knowledge, this is stable, but there are still rough edges in the experience. Contributions are welcome!
3+
4+
## Overview
5+
Vector database allows user to store and retrieve embeddings. Feast provides general APIs to store and retrieve embeddings.
6+
7+
## Integration
8+
Below are supported vector databases and implemented features:
9+
10+
| Vector Database | Retrieval | Indexing |
11+
|-----------------|-----------|----------|
12+
| Pgvector | [x] | [ ] |
13+
| Elasticsearch | [ ] | [ ] |
14+
| Milvus | [ ] | [ ] |
15+
| Faiss | [ ] | [ ] |
16+
17+
18+
## Example
19+
20+
See [https://github.com/feast-dev/feast-workshop/blob/rag/module_4_rag](https://github.com/feast-dev/feast-workshop/blob/rag/module_4_rag) for an example on how to use vector database.
21+
22+
### **Prepare offline embedding dataset**
23+
Run the following commands to prepare the embedding dataset:
24+
```shell
25+
python pull_states.py
26+
python batch_score_documents.py
27+
```
28+
The output will be stored in `data/city_wikipedia_summaries.csv.`
29+
30+
### **Initialize Feast feature store and materialize the data to the online store**
31+
Use the feature_tore.yaml file to initialize the feature store. This will use the data as offline store, and Pgvector as online store.
32+
33+
```yaml
34+
project: feast_demo_local
35+
provider: local
36+
registry:
37+
registry_type: sql
38+
path: postgresql://@localhost:5432/feast
39+
online_store:
40+
type: postgres
41+
pgvector_enabled: true
42+
vector_len: 384
43+
host: 127.0.0.1
44+
port: 5432
45+
database: feast
46+
user: ""
47+
password: ""
48+
49+
50+
offline_store:
51+
type: file
52+
entity_key_serialization_version: 2
53+
```
54+
Run the following command in terminal to apply the feature store configuration:
55+
56+
```shell
57+
feast apply
58+
```
59+
60+
Note that when you run `feast apply` you are going to apply the following Feature View that we will use for retrieval later:
61+
62+
```python
63+
city_embeddings_feature_view = FeatureView(
64+
name="city_embeddings",
65+
entities=[item],
66+
schema=[
67+
Field(name="Embeddings", dtype=Array(Float32)),
68+
],
69+
source=source,
70+
ttl=timedelta(hours=2),
71+
)
72+
```
73+
74+
Then run the following command in the terminal to materialize the data to the online store:
75+
76+
```shell
77+
CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
78+
feast materialize-incremental $CURRENT_TIME
79+
```
80+
81+
### **Prepare a query embedding**
82+
```python
83+
from batch_score_documents import run_model, TOKENIZER, MODEL
84+
from transformers import AutoTokenizer, AutoModel
85+
86+
question = "the most populous city in the U.S. state of Texas?"
87+
88+
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER)
89+
model = AutoModel.from_pretrained(MODEL)
90+
query_embedding = run_model(question, tokenizer, model)
91+
query = query_embedding.detach().cpu().numpy().tolist()[0]
92+
```
93+
94+
### **Retrieve the top 5 similar documents**
95+
First create a feature store instance, and use the `retrieve_online_documents` API to retrieve the top 5 similar documents to the specified query.
96+
97+
```python
98+
from feast import FeatureStore
99+
store = FeatureStore(repo_path=".")
100+
features = store.retrieve_online_documents(
101+
feature="city_embeddings:Embeddings",
102+
query=query,
103+
top_k=5
104+
).to_dict()
105+
106+
def print_online_features(features):
107+
for key, value in sorted(features.items()):
108+
print(key, " : ", value)
109+
110+
print_online_features(features)
111+
```

0 commit comments

Comments
 (0)