You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[Step 7b: Examine the Airflow DAG](#step-7b-examine-the-airflow-dag)
26
+
-[Q: What if different feature views have different freshness requirements?](#q-what-if-different-feature-views-have-different-freshness-requirements)
27
+
-[Step 7c: Enable the Airflow DAG](#step-7c-enable-the-airflow-dag)
28
+
-[Step 7d (optional): Run a backfill](#step-7d-optional-run-a-backfill)
26
29
-[Conclusion](#conclusion)
27
30
-[FAQ](#faq)
28
31
-[How do you synchronize materialized features with pushed features from streaming?](#how-do-you-synchronize-materialized-features-with-pushed-features-from-streaming)
@@ -32,9 +35,9 @@ In this module, we focus on building features for online serving, and keeping th
32
35
# Workshop
33
36
## Step 1: Install Feast
34
37
35
-
First, we install Feast with Spark and Redis support:
38
+
First, we install Feast with Spark and Postgres and Redis support:
36
39
```bash
37
-
pip install "feast[spark,redis]"
40
+
pip install "feast[spark,postgres,redis]"
38
41
```
39
42
40
43
## Step 2: Inspect the data
@@ -181,7 +184,32 @@ We'll switch gears into a Jupyter notebook. This will guide you through:
181
184
182
185
Run the Jupyter notebook ([feature_repo/workshop.ipynb](feature_repo/module_1.ipynb)).
183
186
184
-
### Configuring materialization
187
+
### A note on Feast feature servers + push servers
188
+
The above notebook introduces a way to curl an HTTP endpoint to push or retrieve features from Redis.
189
+
190
+
The servers by default cache the registry (expiring and reloading every 10 minutes). If you want to customize that time period, you can do so in `feature_store.yaml`.
191
+
192
+
Let's look at the `feature_store.yaml` used in this module (which configures the registry differently than in the previous module):
193
+
194
+
```yaml
195
+
project: feast_demo_local
196
+
provider: local
197
+
registry:
198
+
path: data/local_registry.db
199
+
cache_ttl_seconds: 5
200
+
online_store:
201
+
type: redis
202
+
connection_string: localhost:6379
203
+
offline_store:
204
+
type: file
205
+
```
206
+
207
+
The `registry` config maps to constructor arguments for `RegistryConfig` Pydantic model([reference](https://rtd.feast.dev/en/master/index.html#feast.repo_config.RegistryConfig)).
208
+
- In the `feature_store.yaml` above, note that there is a `cache_ttl_seconds` of 5. This ensures that every five seconds, the feature server and push server will expire its registry cache. On the following request, it will refresh its registry by pulling from the registry path.
209
+
- Feast adds a convenience wrapper so if you specify just `registry: [path]`, Feast will map that to `RegistryConfig(path=[your path])`.
210
+
211
+
## Step 7: Scaling up and scheduling materialization
212
+
### Background: configuring materialization
185
213
By default, materialization will pull all the latest feature values for each unique entity into memory, and then write to the online store.
186
214
187
215
You can speed up / scale this up in different ways:
@@ -192,7 +220,7 @@ You can speed up / scale this up in different ways:
192
220
To run many parallel materialization jobs, you'll want to use the **SQL registry** (which is already used in this module).
193
221
Then you could run multiple materialization jobs in parallel (e.g. using `feast materialize [FEATURE_VIEW_NAME] start_time end_time`)
194
222
195
-
### Scheduling materialization
223
+
### Step 7a: Scheduling materialization
196
224
To ensure fresh features, you'll want to schedule materialization jobs regularly. This can be as simple as having a cron job that calls `feast materialize-incremental`.
197
225
198
226
Users may also be interested in integrating with Airflow, in which case you can build a custom Airflow image with the Feast SDK installed, and then use a `PythonOperator` (with `store.materialize`).
@@ -203,7 +231,7 @@ We setup a standalone version of Airflow to set up the PythonOperator (Airflow n
203
231
cd airflow_demo; sh setup_airflow.sh
204
232
```
205
233
206
-
#### Examine the Airflow DAG
234
+
#### Step 7b: Examine the Airflow DAG
207
235
208
236
The example dag is going to run on a daily basis and materialize *all* feature views based on the start and end interval. Note that there is a 1 hr overlap in the start time to account for potential late arriving data in the offline store.
209
237
@@ -246,29 +274,22 @@ There's no built in mechanism for this, but you could store this logic in the fe
246
274
247
275
Then, you can parse these feature view in your Airflow job. You could for example have one DAG that runs all the daily `batch_schedule` feature views, and another DAG that runs all feature views with an hourly `batch_schedule`.
248
276
249
-
### A note on Feast feature servers + push servers
250
-
The above notebook introduces a way to curl an HTTP endpoint to push or retrieve features from Redis.
277
+
#### Step 7c: Enable the Airflow DAG
278
+
Now go to `localhost:8080`, use Airflow's auto-generated admin password to login, and toggle on the `materialize_dag`. It should run one task automatically.
251
279
252
-
The servers by default cache the registry (expiring and reloading every 10 minutes). If you want to customize that time period, you can do so in `feature_store.yaml`.
280
+
### Step 7d (optional): Run a backfill
281
+
To run a backfill (i.e. process previous days of the above while letting Airflow manage state), you can do (from the `airflow_demo` directory):
253
282
254
-
Let's look at the `feature_store.yaml` used in this module (which configures the registry differently than in the previous module):
283
+
> **Warning:** This works correctly with the Redis online store because it conditionally writes. This logic has not been implemented for other online stores yet, and so can result in incorrect behavior
255
284
256
-
```yaml
257
-
project: feast_demo_local
258
-
provider: local
259
-
registry:
260
-
path: data/local_registry.db
261
-
cache_ttl_seconds: 5
262
-
online_store:
263
-
type: redis
264
-
connection_string: localhost:6379
265
-
offline_store:
266
-
type: file
285
+
```bash
286
+
export AIRFLOW_HOME=$(pwd)/airflow_home
287
+
airflow dags backfill \
288
+
--start-date 2019-11-21 \
289
+
--end-date 2019-11-25 \
290
+
materialize_dag
267
291
```
268
292
269
-
The `registry` config maps to constructor arguments for `RegistryConfig` Pydantic model([reference](https://rtd.feast.dev/en/master/index.html#feast.repo_config.RegistryConfig)).
270
-
- In the `feature_store.yaml` above, note that there is a `cache_ttl_seconds` of 5. This ensures that every five seconds, the feature server and push server will expire its registry cache. On the following request, it will refresh its registry by pulling from the registry path.
271
-
- Feast adds a convenience wrapper so if you specify just `registry: [path]`, Feast will map that to `RegistryConfig(path=[your path])`.
272
293
273
294
# Conclusion
274
295
By the end of this module, you will have learned how to build streaming features power real time models with Feast. Feast abstracts away the need to think about data modeling in the online store and helps you:
0 commit comments