Skip to content

Commit 8cbf60a

Browse files
committed
add point on to_bigquery()
Signed-off-by: Danny Chiao <danny@tecton.ai>
1 parent b5d0f4f commit 8cbf60a

File tree

1 file changed

+33
-4
lines changed

1 file changed

+33
-4
lines changed

module_0/README.md

Lines changed: 33 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -52,18 +52,19 @@ A quick explanation of what's happening here:
5252
- Generally, custom offline + online stores and providers are supported and can plug in.
5353
- e.g. see [adding a new offline store](https://docs.feast.dev/how-to-guides/adding-a-new-offline-store), [adding a new online store](https://docs.feast.dev/how-to-guides/adding-support-for-a-new-online-store)
5454
- **Project**
55-
- users can only request features from a single project
55+
- Users can only request features from a single project
5656
- **Provider**
57-
- defaults can be easily overriden in `feature_store.yaml`.
57+
- Defaults can be easily overriden in `feature_store.yaml`.
5858
- For example, one can use the `aws` provider and specify Snowflake as the offline store.
5959
- **Offline Store**
60-
- we recommend users use data warehouses or Spark as their offline store for performant training dataset generation.
60+
- We recommend users use data warehouses or Spark as their offline store for performant training dataset generation.
6161
- Here, we use file sources for instructional purposes. This will directly read from files (local or remote) and use Dask to execute point-in-time joins.
6262
- A project can only support one type of offline store (cannot mix Snowflake + file for example)
63+
- Each offline store has its own configurations which map to YAML. (e.g. see [BigQueryOfflineStoreConfig](https://rtd.feast.dev/en/master/index.html#feast.infra.offline_stores.bigquery.BigQueryOfflineStoreConfig)):
6364
- **Online Store**
6465
- If you don't need to power real time models with fresh features, this is not needed.
6566
- If you are precomputing predictions in batch ("batch scoring"), then the online store is optional. You should be using the offline store and running `feature_store.get_historical_features`
66-
67+
- Each online store has its own configurations which map to YAML. (e.g. [RedisOnlineStoreConfig](https://rtd.feast.dev/en/master/feast.infra.online_stores.html#feast.infra.online_stores.redis.RedisOnlineStoreConfig))
6768
With the `feature_store.yaml` setup, you can now run `feast apply` to create & populate the registry.
6869

6970
### Step 2: Adding the feature repo to version control
@@ -154,6 +155,34 @@ training_df = store.get_historical_features(
154155
predictions = model.predict(training_df)
155156
```
156157

158+
### A note on scalability
159+
You may note that the above example uses a `to_df()` method to load the training dataset into memory and may be wondering how this scales if you have very large datasets.
160+
161+
`get_historical_features` actually returns a `RetrievalJob` object that lazily executes the point-in-time join. The `RetrievalJob` class is extended by each offline store to allow flushing results to e.g. the data warehouse or data lakes.
162+
163+
Let's look at an example with BigQuery as the offline store.
164+
```yaml
165+
project: feast_demo_gcp
166+
provider: gcp
167+
registry: gs://[YOUR BUCKET]/registry.pb
168+
offline_store:
169+
type: bigquery
170+
location: EU
171+
flags:
172+
alpha_features: true
173+
on_demand_transforms: true
174+
```
175+
176+
Retrieving the data with `get_historical_features` gives a `BigQueryRetrievalJob` object ([reference](https://rtd.feast.dev/en/master/index.html#feast.infra.offline_stores.bigquery.BigQueryRetrievalJob)) which exposes a `to_bigquery()` method. Thus, you can do:
177+
178+
```python
179+
path = training_df = store.get_historical_features(
180+
entity_df=entity_df, features=store.get_feature_service("model_v2"),
181+
).to_bigquery()
182+
183+
# Continue with distributed training or batch predictions from the BigQuery dataset.
184+
```
185+
157186
# Conclusion
158187
As a result:
159188
- You have file sources (possibly remote) and a remote registry (e.g. in S3)

0 commit comments

Comments
 (0)