Skip to content

Commit 3d7a1ca

Browse files
authored
Merge pull request #11 from adchia/main
Updating READMEs for the Airflow components
2 parents 5e1f0df + 3ac536e commit 3d7a1ca

File tree

1 file changed

+48
-27
lines changed

1 file changed

+48
-27
lines changed

module_1/README.md

Lines changed: 48 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,14 @@ In this module, we focus on building features for online serving, and keeping th
1818
- [Step 5: Why register streaming features in Feast?](#step-5-why-register-streaming-features-in-feast)
1919
- [Understanding the PushSource](#understanding-the-pushsource)
2020
- [Step 6: Materialize batch features & ingest streaming features](#step-6-materialize-batch-features--ingest-streaming-features)
21-
- [Configuring materialization](#configuring-materialization)
22-
- [Scheduling materialization](#scheduling-materialization)
23-
- [Examine the Airflow DAG](#examine-the-airflow-dag)
24-
- [Q: What if different feature views have different freshness requirements?](#q-what-if-different-feature-views-have-different-freshness-requirements)
2521
- [A note on Feast feature servers + push servers](#a-note-on-feast-feature-servers--push-servers)
22+
- [Step 7: Scaling up and scheduling materialization](#step-7-scaling-up-and-scheduling-materialization)
23+
- [Background: configuring materialization](#background-configuring-materialization)
24+
- [Step 7a: Scheduling materialization](#step-7a-scheduling-materialization)
25+
- [Step 7b: Examine the Airflow DAG](#step-7b-examine-the-airflow-dag)
26+
- [Q: What if different feature views have different freshness requirements?](#q-what-if-different-feature-views-have-different-freshness-requirements)
27+
- [Step 7c: Enable the Airflow DAG](#step-7c-enable-the-airflow-dag)
28+
- [Step 7d (optional): Run a backfill](#step-7d-optional-run-a-backfill)
2629
- [Conclusion](#conclusion)
2730
- [FAQ](#faq)
2831
- [How do you synchronize materialized features with pushed features from streaming?](#how-do-you-synchronize-materialized-features-with-pushed-features-from-streaming)
@@ -32,9 +35,9 @@ In this module, we focus on building features for online serving, and keeping th
3235
# Workshop
3336
## Step 1: Install Feast
3437

35-
First, we install Feast with Spark and Redis support:
38+
First, we install Feast with Spark and Postgres and Redis support:
3639
```bash
37-
pip install "feast[spark,redis]"
40+
pip install "feast[spark,postgres,redis]"
3841
```
3942

4043
## Step 2: Inspect the data
@@ -181,7 +184,32 @@ We'll switch gears into a Jupyter notebook. This will guide you through:
181184

182185
Run the Jupyter notebook ([feature_repo/workshop.ipynb](feature_repo/module_1.ipynb)).
183186

184-
### Configuring materialization
187+
### A note on Feast feature servers + push servers
188+
The above notebook introduces a way to curl an HTTP endpoint to push or retrieve features from Redis.
189+
190+
The servers by default cache the registry (expiring and reloading every 10 minutes). If you want to customize that time period, you can do so in `feature_store.yaml`.
191+
192+
Let's look at the `feature_store.yaml` used in this module (which configures the registry differently than in the previous module):
193+
194+
```yaml
195+
project: feast_demo_local
196+
provider: local
197+
registry:
198+
path: data/local_registry.db
199+
cache_ttl_seconds: 5
200+
online_store:
201+
type: redis
202+
connection_string: localhost:6379
203+
offline_store:
204+
type: file
205+
```
206+
207+
The `registry` config maps to constructor arguments for `RegistryConfig` Pydantic model([reference](https://rtd.feast.dev/en/master/index.html#feast.repo_config.RegistryConfig)).
208+
- In the `feature_store.yaml` above, note that there is a `cache_ttl_seconds` of 5. This ensures that every five seconds, the feature server and push server will expire its registry cache. On the following request, it will refresh its registry by pulling from the registry path.
209+
- Feast adds a convenience wrapper so if you specify just `registry: [path]`, Feast will map that to `RegistryConfig(path=[your path])`.
210+
211+
## Step 7: Scaling up and scheduling materialization
212+
### Background: configuring materialization
185213
By default, materialization will pull all the latest feature values for each unique entity into memory, and then write to the online store.
186214

187215
You can speed up / scale this up in different ways:
@@ -192,7 +220,7 @@ You can speed up / scale this up in different ways:
192220
To run many parallel materialization jobs, you'll want to use the **SQL registry** (which is already used in this module).
193221
Then you could run multiple materialization jobs in parallel (e.g. using `feast materialize [FEATURE_VIEW_NAME] start_time end_time`)
194222

195-
### Scheduling materialization
223+
### Step 7a: Scheduling materialization
196224
To ensure fresh features, you'll want to schedule materialization jobs regularly. This can be as simple as having a cron job that calls `feast materialize-incremental`.
197225

198226
Users may also be interested in integrating with Airflow, in which case you can build a custom Airflow image with the Feast SDK installed, and then use a `PythonOperator` (with `store.materialize`).
@@ -203,7 +231,7 @@ We setup a standalone version of Airflow to set up the PythonOperator (Airflow n
203231
cd airflow_demo; sh setup_airflow.sh
204232
```
205233

206-
#### Examine the Airflow DAG
234+
#### Step 7b: Examine the Airflow DAG
207235

208236
The example dag is going to run on a daily basis and materialize *all* feature views based on the start and end interval. Note that there is a 1 hr overlap in the start time to account for potential late arriving data in the offline store.
209237

@@ -246,29 +274,22 @@ There's no built in mechanism for this, but you could store this logic in the fe
246274

247275
Then, you can parse these feature view in your Airflow job. You could for example have one DAG that runs all the daily `batch_schedule` feature views, and another DAG that runs all feature views with an hourly `batch_schedule`.
248276

249-
### A note on Feast feature servers + push servers
250-
The above notebook introduces a way to curl an HTTP endpoint to push or retrieve features from Redis.
277+
#### Step 7c: Enable the Airflow DAG
278+
Now go to `localhost:8080`, use Airflow's auto-generated admin password to login, and toggle on the `materialize_dag`. It should run one task automatically.
251279

252-
The servers by default cache the registry (expiring and reloading every 10 minutes). If you want to customize that time period, you can do so in `feature_store.yaml`.
280+
### Step 7d (optional): Run a backfill
281+
To run a backfill (i.e. process previous days of the above while letting Airflow manage state), you can do (from the `airflow_demo` directory):
253282

254-
Let's look at the `feature_store.yaml` used in this module (which configures the registry differently than in the previous module):
283+
> **Warning:** This works correctly with the Redis online store because it conditionally writes. This logic has not been implemented for other online stores yet, and so can result in incorrect behavior
255284

256-
```yaml
257-
project: feast_demo_local
258-
provider: local
259-
registry:
260-
path: data/local_registry.db
261-
cache_ttl_seconds: 5
262-
online_store:
263-
type: redis
264-
connection_string: localhost:6379
265-
offline_store:
266-
type: file
285+
```bash
286+
export AIRFLOW_HOME=$(pwd)/airflow_home
287+
airflow dags backfill \
288+
--start-date 2019-11-21 \
289+
--end-date 2019-11-25 \
290+
materialize_dag
267291
```
268292

269-
The `registry` config maps to constructor arguments for `RegistryConfig` Pydantic model([reference](https://rtd.feast.dev/en/master/index.html#feast.repo_config.RegistryConfig)).
270-
- In the `feature_store.yaml` above, note that there is a `cache_ttl_seconds` of 5. This ensures that every five seconds, the feature server and push server will expire its registry cache. On the following request, it will refresh its registry by pulling from the registry path.
271-
- Feast adds a convenience wrapper so if you specify just `registry: [path]`, Feast will map that to `RegistryConfig(path=[your path])`.
272293

273294
# Conclusion
274295
By the end of this module, you will have learned how to build streaming features power real time models with Feast. Feast abstracts away the need to think about data modeling in the online store and helps you:

0 commit comments

Comments
 (0)