Skip to content

Commit b5e734b

Browse files
authored
docs: Add docs on using Feast with dbt + Airflow + missing m13n docs (feast-dev#3304)
* docs: Add documentation on using Feast with dbt + Airflow + adding missing docs Signed-off-by: Danny Chiao <danny@tecton.ai> * fix SUMMARY Signed-off-by: Danny Chiao <danny@tecton.ai> * revert Signed-off-by: Danny Chiao <danny@tecton.ai> * fix Signed-off-by: Danny Chiao <danny@tecton.ai> * fix Signed-off-by: Danny Chiao <danny@tecton.ai> * fix Signed-off-by: Danny Chiao <danny@tecton.ai> * link to lambda docker Signed-off-by: Danny Chiao <danny@tecton.ai> * link to lambda docker Signed-off-by: Danny Chiao <danny@tecton.ai> Signed-off-by: Danny Chiao <danny@tecton.ai>
1 parent 54faf61 commit b5e734b

14 files changed

Lines changed: 127 additions & 24 deletions

File tree

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,10 @@ Feast allows ML platform teams to:
2727
* **Avoid data leakage** by generating point-in-time correct feature sets so data scientists can focus on feature engineering rather than debugging error-prone dataset joining logic. This ensure that future feature values do not leak to models during training.
2828
* **Decouple ML from data infrastructure** by providing a single data access layer that abstracts feature storage from feature retrieval, ensuring models remain portable as you move from training models to serving models, from batch models to realtime models, and from one data infra system to another.
2929

30-
Please see our [documentation](https://docs.feast.dev/) for more information about the project.
30+
Please see our [documentation](https://docs.feast.dev/) for more information about the project, or sign up for an [email newsletter](https://feast.dev/).
3131

3232
## 📐 Architecture
33-
![](docs/assets/feast-marchitecture.png)
33+
![](docs/assets/feast_marchitecture.png)
3434

3535
The above architecture is the minimal Feast deployment. Want to run the full Feast on Snowflake/GCP/AWS? Click [here](https://docs.feast.dev/how-to-guides/feast-snowflake-gcp-aws).
3636

docs/SUMMARY.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,8 @@
104104
* [Batch Materialization Engines](reference/batch-materialization/README.md)
105105
* [Bytewax](reference/batch-materialization/bytewax.md)
106106
* [Snowflake](reference/batch-materialization/snowflake.md)
107+
* [AWS Lambda (alpha)](reference/batch-materialization/lambda.md)
108+
* [Spark (contrib)](reference/batch-materialization/spark.md)
107109
* [Feature repository](reference/feature-repository/README.md)
108110
* [feature\_store.yaml](reference/feature-repository/feature-store-yaml.md)
109111
* [.feastignore](reference/feature-repository/feast-ignore.md)
-77.7 KB
Binary file not shown.
219 KB
Loading

docs/getting-started/quickstart.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -555,6 +555,7 @@ show up in the upcoming concepts + architecture + tutorial pages as well.
555555

556556
## Next steps
557557

558+
* Join the [email newsletter](https://feast.dev/) to get new updates on Feast / feature stores.
558559
* Read the [Concepts](concepts/) page to understand the Feast data model.
559560
* Read the [Architecture](architecture-and-components/) page.
560561
* Check out our [Tutorials](../tutorials/tutorials-overview/) section for more examples on how to use Feast.

docs/how-to-guides/running-feast-in-production.md

Lines changed: 18 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,8 @@ The first step to setting up a deployment of Feast is to create a Git repository
3434

3535
Out of the box, Feast serializes all of its state into a file-based registry. When running Feast in production, we recommend using the more scalable SQL-based registry that is backed by a database. Details are available [here](./scaling-feast.md#scaling-feast-registry).
3636

37+
> **Note:** A SQL-based registry primarily works with a Python feature server. The Java feature server does not understand this registry type yet.
38+
3739
### 1.3 Setting up CI/CD to automatically update the registry
3840

3941
We recommend typically setting up CI/CD to automatically run `feast plan` and `feast apply` when pull requests are opened / merged.
@@ -78,7 +80,7 @@ batch_engine:
7880
key: aws-secret-access-key
7981
```
8082
81-
### 2.2 Scheduled materialization
83+
### 2.2 Scheduled materialization with Airflow
8284
8385
> See also [data ingestion](../getting-started/concepts/data-ingestion.md#batch-data-ingestion) for code snippets
8486
@@ -91,34 +93,34 @@ However, the amount of work can quickly outgrow the resources of a single machin
9193
If you are using Airflow as a scheduler, Feast can be invoked through a [PythonOperator](https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html) after the [Python SDK](https://pypi.org/project/feast/) has been installed into a virtual environment and your feature repo has been synced:
9294
9395
```python
94-
import datetime
95-
from airflow.operators.python_operator import PythonOperator
96+
from airflow.decorators import task
9697
from feast import RepoConfig, FeatureStore
9798
from feast.infra.online_stores.dynamodb import DynamoDBOnlineStoreConfig
9899
from feast.repo_config import RegistryConfig
99100

100101
# Define Python callable
101-
def materialize():
102+
@task()
103+
def materialize(data_interval_start=None, data_interval_end=None):
102104
repo_config = RepoConfig(
103105
registry=RegistryConfig(path="s3://[YOUR BUCKET]/registry.pb"),
104106
project="feast_demo_aws",
105107
provider="aws",
106108
offline_store="file",
107-
online_store=DynamoDBOnlineStoreConfig(region="us-west-2")
109+
online_store=DynamoDBOnlineStoreConfig(region="us-west-2"),
110+
entity_key_serialization_version=2
108111
)
109112
store = FeatureStore(config=repo_config)
110113
# Option 1: materialize just one feature view
111114
# store.materialize_incremental(datetime.datetime.now(), feature_views=["my_fv_name"])
112115
# Option 2: materialize all feature views incrementally
113-
store.materialize_incremental(datetime.datetime.now())
114-
115-
# Use Airflow PythonOperator
116-
materialize_python = PythonOperator(
117-
task_id='materialize_python',
118-
python_callable=materialize,
119-
)
116+
# store.materialize_incremental(datetime.datetime.now())
117+
# Option 3: Let Airflow manage materialization state
118+
# Add 1 hr overlap to account for late data
119+
store.materialize(data_interval_start.subtract(hours=1), data_interval_end)
120120
```
121121

122+
You can see more in an example at [Feast Workshop - Module 1](https://github.com/feast-dev/feast-workshop/blob/main/module_1/README.md#step-7-scaling-up-and-scheduling-materialization).
123+
122124
{% hint style="success" %}
123125
Important note: Airflow worker must have read and write permissions to the registry file on GCS / S3 since it pulls configuration and updates materialization history.
124126
{% endhint %}
@@ -128,6 +130,8 @@ See more details at [data ingestion](../getting-started/concepts/data-ingestion.
128130

129131
This supports pushing feature values into Feast to both online or offline stores.
130132

133+
### 2.4 Scheduled batch transformations with Airflow + dbt
134+
Feast does not orchestrate batch transformation DAGs. For this, you can rely on tools like Airflow + dbt. See [Feast Workshop - Module 3](https://github.com/feast-dev/feast-workshop/blob/main/module_3/) for an example and some tips.
131135

132136
## 3. How to use Feast for model training
133137

@@ -238,7 +242,7 @@ helm install feast-release feast-charts/feast-feature-server \
238242
--set feature_store_yaml_base64=$(base64 feature_store.yaml)
239243
```
240244

241-
This will deploy a single service. The service must have read access to the registry file on cloud storage. It will keep a copy of the registry in their memory and periodically refresh it, so expect some delays in update propagation in exchange for better performance.
245+
This will deploy a single service. The service must have read access to the registry file on cloud storage and to the online store (e.g. via [podAnnotations](https://kubernetes-on-aws.readthedocs.io/en/latest/user-guide/iam-roles.html)). It will keep a copy of the registry in their memory and periodically refresh it, so expect some delays in update propagation in exchange for better performance.
242246

243247
## 5. Using environment variables in your yaml configuration
244248

@@ -272,7 +276,7 @@ In summary, the overall architecture in production may look like:
272276

273277
* Feast SDK is being triggered by CI (eg, Github Actions). It applies the latest changes from the feature repo to the Feast database-backed registry
274278
* Data ingestion
275-
* **Batch data**: Airflow manages materialization jobs to ingest batch data from DWH to the online store periodically. When working with large datasets to materialize, we recommend using a batch materialization engine
279+
* **Batch data**: Airflow manages batch transformation jobs + materialization jobs to ingest batch data from DWH to the online store periodically. When working with large datasets to materialize, we recommend using a batch materialization engine
276280
* If your offline and online workloads are in Snowflake, the Snowflake materialization engine is likely the best option.
277281
* If your offline and online workloads are not using Snowflake, but using Kubernetes is an option, the Bytewax materialization engine is likely the best option.
278282
* If none of these engines suite your needs, you may continue using the in-process engine, or write a custom engine (e.g with Spark or Ray).

docs/reference/batch-materialization/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,7 @@ Please see [Batch Materialization Engine](../../getting-started/architecture-and
55
{% page-ref page="snowflake.md" %}
66

77
{% page-ref page="bytewax.md" %}
8+
9+
{% page-ref page="lambda.md" %}
10+
11+
{% page-ref page="spark.md" %}
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# AWS Lambda (alpha)
2+
3+
## Description
4+
5+
The AWS Lambda batch materialization engine is considered alpha status. It relies on the offline store to output feature values to S3 via `to_remote_storage`, and then loads them into the online store.
6+
7+
See [LambdaMaterializationEngineConfig](https://rtd.feast.dev/en/master/index.html?highlight=LambdaMaterializationEngine#feast.infra.materialization.aws_lambda.lambda_engine.LambdaMaterializationEngineConfig) for configuration options.
8+
9+
See also [Dockerfile](https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/materialization/aws_lambda/Dockerfile) for a Dockerfile that can be used below with `materialization_image`.
10+
11+
## Example
12+
13+
{% code title="feature_store.yaml" %}
14+
```yaml
15+
...
16+
offline_store:
17+
type: snowflake.offline
18+
...
19+
batch_engine:
20+
type: lambda
21+
lambda_role: [your iam role]
22+
materialization_image: [image uri of above Docker image]
23+
```
24+
{% endcode %}
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# Spark (alpha)
2+
3+
## Description
4+
5+
The Spark batch materialization engine is considered alpha status. It relies on the offline store to output feature values to S3 via `to_remote_storage`, and then loads them into the online store.
6+
7+
See [SparkMaterializationEngine](https://rtd.feast.dev/en/master/index.html?highlight=SparkMaterializationEngine#feast.infra.materialization.spark.spark_materialization_engine.SparkMaterializationEngineConfig) for configuration options.
8+
9+
## Example
10+
11+
{% code title="feature_store.yaml" %}
12+
```yaml
13+
...
14+
offline_store:
15+
type: snowflake.offline
16+
...
17+
batch_engine:
18+
type: spark.engine
19+
partitions: [optional num partitions to use to write to online store]
20+
```
21+
{% endcode %}

infra/templates/README.md.jinja2

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,10 @@ Feast allows ML platform teams to:
2525
* **Avoid data leakage** by generating point-in-time correct feature sets so data scientists can focus on feature engineering rather than debugging error-prone dataset joining logic. This ensure that future feature values do not leak to models during training.
2626
* **Decouple ML from data infrastructure** by providing a single data access layer that abstracts feature storage from feature retrieval, ensuring models remain portable as you move from training models to serving models, from batch models to realtime models, and from one data infra system to another.
2727

28-
Please see our [documentation](https://docs.feast.dev/) for more information about the project.
28+
Please see our [documentation](https://docs.feast.dev/) for more information about the project, or sign up for an [email newsletter](https://feast.dev/).
2929

3030
## 📐 Architecture
31-
![](docs/assets/feast-marchitecture.png)
31+
![](docs/assets/feast_marchitecture.png)
3232

3333
The above architecture is the minimal Feast deployment. Want to run the full Feast on Snowflake/GCP/AWS? Click [here](https://docs.feast.dev/how-to-guides/feast-snowflake-gcp-aws).
3434

0 commit comments

Comments
 (0)