You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[Examine the Airflow DAG](#examine-the-airflow-dag)
24
+
-[Q: What if different feature views have different freshness requirements?](#q-what-if-different-feature-views-have-different-freshness-requirements)
24
25
-[A note on Feast feature servers + push servers](#a-note-on-feast-feature-servers--push-servers)
25
26
-[Conclusion](#conclusion)
26
27
-[FAQ](#faq)
27
28
-[How do you synchronize materialized features with pushed features from streaming?](#how-do-you-synchronize-materialized-features-with-pushed-features-from-streaming)
28
29
-[Does Feast allow pushing features to the offline store?](#does-feast-allow-pushing-features-to-the-offline-store)
29
30
-[Can feature / push servers refresh their registry in response to an event? e.g. after a PR merges and `feast apply` is run?](#can-feature--push-servers-refresh-their-registry-in-response-to-an-event-eg-after-a-pr-merges-and-feast-apply-is-run)
30
-
-[How do I speed up or scale up materialization?](#how-do-i-speed-up-or-scale-up-materialization)
31
31
32
32
# Workshop
33
33
## Step 1: Install Feast
@@ -54,50 +54,54 @@ The key thing to note is that there are now a `miles_driven` field and a `daily_
The key thing to note for now is the online store has been configured to be Redis. This is specifically for a single Redis node. If you want to use a Redis cluster, then you'd change this to something like:
68
+
The key thing to note for now is the registry is now swapped for a SQL backed registry (Postgres) and the online store has been configured to be Redis. This is specifically for a single Redis node. If you want to use a Redis cluster, then you'd change this to something like:
Because we use `redis-py` under the hood, this means Feast also works well with hosted Redis instances like AWS Elasticache ([docs](https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/ElastiCache-Getting-Started-Tutorials-Connecting.html)).
We then use Docker Compose to spin up the services we need.
88
90
- This leverages a script (in `kafka_demo/`) that creates a topic, reads from `feature_repo/data/driver_stats.parquet`, generates newer timestamps, and emits them to the topic.
89
91
- This also deploys an instance of Redis.
92
+
- **Note:** one big difference between this and the previous module is its choice of using Postgres as the registry. See [Using Scalable Registry](https://docs.feast.dev/tutorials/using-scalable-registry) for details.
90
93
- This also deploys a Feast push server (on port 6567) + a Feast feature server (on port 6566).
91
94
- These servers embed a `feature_store.yaml` file that enables them to connect to a remote registry. The Dockerfile mostly delegates to calling the `feast serve` CLI command, which instantiates a Feast python server ([docs](https://docs.feast.dev/reference/feature-servers/python-feature-server)):
# Needed to reach online store within Docker network.
102
+
# Needed to reach online store and registry within Docker network.
100
103
RUN sed -i 's/localhost:6379/redis:6379/g' feature_store.yaml
104
+
RUN sed -i 's/127.0.0.1:55001/registry:5432/g' feature_store.yaml
101
105
ENV FEAST_USAGE=False
102
106
103
107
CMD ["feast", "serve", "-h", "0.0.0.0"]
@@ -115,7 +119,8 @@ Creating broker ... done
115
119
Creating feast_feature_server ... done
116
120
Creating feast_push_server ... done
117
121
Creating kafka_events ... done
118
-
Attaching to zookeeper, redis, broker, feast_push_server, feast_feature_server, kafka_events
122
+
Creating registry ... done
123
+
Attaching to zookeeper, redis, broker, feast_push_server, feast_feature_server, kafka_events, registry
119
124
...
120
125
```
121
126
## Step 5: Why register streaming features in Feast?
@@ -168,51 +173,78 @@ In the upcoming release, Feast will support the concept of a `StreamFeatureView`
168
173
169
174
We'll switch gears into a Jupyter notebook. This will guide you through:
170
175
- Registering a `FeatureView` that has a single schema across both a batch source (`FileSource`) with aggregate features and a stream source (`PushSource`).
171
-
- **Note:** Feast will, in the future, also support directly authoring a `StreamFeatureView` that contains stream transformations / aggregations (e.g. via Spark, Flink, or Bytewax)
176
+
- **Note:** Feast also supports directly authoring a `StreamFeatureView` that contains stream transformations / aggregations (e.g. via Spark, Flink, or Bytewax), but the onus is on you to actually execute those transformations.
172
177
- Materializing feature view values from batch sources to the online store (e.g. Redis).
173
178
- Ingesting feature view values from streaming sources (e.g. window aggregate features from Spark + Kafka)
174
179
- Retrieve features at low latency from Redis through Feast.
175
180
- Working with a Feast push server + feature server to ingest and retrieve features through HTTP endpoints (instead of needing `feature_store.yaml` and `FeatureStore` instances)
176
181
177
182
Run the Jupyter notebook ([feature_repo/workshop.ipynb](feature_repo/module_1.ipynb)).
178
183
184
+
### Configuring materialization
185
+
By default, materialization will pull all the latest feature values for each unique entity into memory, and then write to the online store.
186
+
187
+
You can speed up / scale this up in different ways:
188
+
- Using a more scalable materialization mechanism (e.g. using the Bytewax or Spark materialization engines)
189
+
- Running materialization jobs on a per feature view basis
190
+
- Running materialization jobs in parallel
191
+
192
+
To run many parallel materialization jobs, you'll want to use the **SQL registry** (which is already used in this module).
193
+
Then you could run multiple materialization jobs in parallel (e.g. using `feast materialize [FEATURE_VIEW_NAME] start_time end_time`)
194
+
179
195
### Scheduling materialization
180
-
To ensure fresh features, you'll want to schedule materialization jobs regularly. This can be as simple as having a cron job that calls `feast materialize-incremental`.
196
+
To ensure fresh features, you'll want to schedule materialization jobs regularly. This can be as simple as having a cron job that calls `feast materialize-incremental`.
181
197
182
-
Users may also be interested in integrating with Airflow, in which case you can build a custom Airflow image with the Feast SDK installed, and then use a `BashOperator` (with `feast materialize-incremental`) or `PythonOperator` (with `store.materialize_incremental(datetime.datetime.now())`):
198
+
Users may also be interested in integrating with Airflow, in which case you can build a custom Airflow image with the Feast SDK installed, and then use a `PythonOperator` (with `store.materialize`).
183
199
184
-
#### Airflow PythonOperator
200
+
We setup a standalone version of Airflow to set up the PythonOperator (Airflow now prefers @task for this).
The example dag is going to run on a daily basis and materialize *all* feature views based on the start and end interval. Note that there is a 1 hr overlap in the start time to account for potential late arriving data in the offline store.
# Note: normally, you'll probably have different feature views with different freshness requirements, instead of materializing all feature views every day.
See also [FAQ: How do I speed up or scale up materialization?](#how-do-i-speed-up-or-scale-up-materialization)
241
+
In this test case, you can also use a single command `feast materialize-incremental $(date +%Y-%m-%d)` and that will materialize until the current time.
242
+
243
+
#### Q: What if different feature views have different freshness requirements?
244
+
245
+
There's no built in mechanism for this, but you could store this logic in the feature view tags (e.g. a `batch_schedule`).
246
+
247
+
Then, you can parse these feature view in your Airflow job. You could for example have one DAG that runs all the daily `batch_schedule` feature views, and another DAG that runs all feature views with an hourly `batch_schedule`.
216
248
217
249
### A note on Feast feature servers + push servers
218
250
The above notebook introduces a way to curl an HTTP endpoint to push or retrieve features from Redis.
@@ -253,24 +285,7 @@ This relies on individual online store implementations. The existing Redis onlin
253
285
Doing this event timestamp checking is expensive though and slows down writes. In many cases, this is not preferred. Databases often support storing multiple versions of the same value, so you can leverage that (+ TTLs) to query the most recent version at read time.
254
286
255
287
### Does Feast allow pushing features to the offline store?
256
-
Not yet! See more details at https://github.com/feast-dev/feast/issues/2732
257
-
258
-
Many users have asked for this functionality as a quick way to get started, but often users work with two flows:
259
-
- To power the online store, using stream processing to generate fresh features and pushing to the online store
260
-
- To power the offline store, using some ETL / ELT pipelines that process and clean the raw data.
261
-
262
-
Though this is more complex, one key advantage of this is that you can construct new features based on the data while iterating on model training. If you rely on streaming features to generate historical feature values, then you need to rely on a log-and-wait approach, which can slow down model iteration.
288
+
Yes! See more details at https://docs.feast.dev/reference/data-sources/push#pushing-data
263
289
264
290
### Can feature / push servers refresh their registry in response to an event? e.g. after a PR merges and `feast apply` is run?
265
291
Unfortunately, currently the servers don't support this. Feel free to contribute a PR though to enable this! The tricky part here is that Feast would need to keep track of these servers in the registry (or in some other way), which is not the way Feast is currently designed.
266
-
267
-
### How do I speed up or scale up materialization?
268
-
Materialization in Feast by default pulls the latest feature values for each unique entity locally and writes in batches to the online store.
269
-
270
-
- Feast users can materialize multiple feature views by using the CLI:
- **Caveat**: By default, Feast's registry store is a single protobuf written to a file. This means that there's the chance that metadata around materialization intervals gets lost if the registry has changed during materialization.
273
-
- The community is ideating on how to improve this. See [RFC-035: Scalable Materialization](https://docs.google.com/document/d/1tCZzClj3H8CfhJzccCytWK-bNDw_lkZk4e3fUbPYIP0/edit#)
274
-
- Users often also implement their own custom providers. The provider interface has a `materialize_single_feature_view` method, which users are free to implement differently (e.g. materializing with Spark or Dataflow jobs).
275
-
276
-
In general, the community is actively investigating ways to speed up materialization. Contributions are welcome!
0 commit comments