You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: Add docs on using Feast with dbt + Airflow + missing m13n docs (feast-dev#3304)
* docs: Add documentation on using Feast with dbt + Airflow + adding missing docs
Signed-off-by: Danny Chiao <danny@tecton.ai>
* fix SUMMARY
Signed-off-by: Danny Chiao <danny@tecton.ai>
* revert
Signed-off-by: Danny Chiao <danny@tecton.ai>
* fix
Signed-off-by: Danny Chiao <danny@tecton.ai>
* fix
Signed-off-by: Danny Chiao <danny@tecton.ai>
* fix
Signed-off-by: Danny Chiao <danny@tecton.ai>
* link to lambda docker
Signed-off-by: Danny Chiao <danny@tecton.ai>
* link to lambda docker
Signed-off-by: Danny Chiao <danny@tecton.ai>
Signed-off-by: Danny Chiao <danny@tecton.ai>
Copy file name to clipboardExpand all lines: README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,10 +27,10 @@ Feast allows ML platform teams to:
27
27
***Avoid data leakage** by generating point-in-time correct feature sets so data scientists can focus on feature engineering rather than debugging error-prone dataset joining logic. This ensure that future feature values do not leak to models during training.
28
28
***Decouple ML from data infrastructure** by providing a single data access layer that abstracts feature storage from feature retrieval, ensuring models remain portable as you move from training models to serving models, from batch models to realtime models, and from one data infra system to another.
29
29
30
-
Please see our [documentation](https://docs.feast.dev/) for more information about the project.
30
+
Please see our [documentation](https://docs.feast.dev/) for more information about the project, or sign up for an [email newsletter](https://feast.dev/).
31
31
32
32
## 📐 Architecture
33
-

33
+

34
34
35
35
The above architecture is the minimal Feast deployment. Want to run the full Feast on Snowflake/GCP/AWS? Click [here](https://docs.feast.dev/how-to-guides/feast-snowflake-gcp-aws).
Copy file name to clipboardExpand all lines: docs/how-to-guides/running-feast-in-production.md
+18-14Lines changed: 18 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,6 +34,8 @@ The first step to setting up a deployment of Feast is to create a Git repository
34
34
35
35
Out of the box, Feast serializes all of its state into a file-based registry. When running Feast in production, we recommend using the more scalable SQL-based registry that is backed by a database. Details are available [here](./scaling-feast.md#scaling-feast-registry).
36
36
37
+
> **Note:** A SQL-based registry primarily works with a Python feature server. The Java feature server does not understand this registry type yet.
38
+
37
39
### 1.3 Setting up CI/CD to automatically update the registry
38
40
39
41
We recommend typically setting up CI/CD to automatically run `feast plan` and `feast apply` when pull requests are opened / merged.
@@ -78,7 +80,7 @@ batch_engine:
78
80
key: aws-secret-access-key
79
81
```
80
82
81
-
### 2.2 Scheduled materialization
83
+
### 2.2 Scheduled materialization with Airflow
82
84
83
85
> See also [data ingestion](../getting-started/concepts/data-ingestion.md#batch-data-ingestion) for code snippets
84
86
@@ -91,34 +93,34 @@ However, the amount of work can quickly outgrow the resources of a single machin
91
93
If you are using Airflow as a scheduler, Feast can be invoked through a [PythonOperator](https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html) after the [Python SDK](https://pypi.org/project/feast/) has been installed into a virtual environment and your feature repo has been synced:
92
94
93
95
```python
94
-
import datetime
95
-
from airflow.operators.python_operator import PythonOperator
96
+
from airflow.decorators import task
96
97
from feast import RepoConfig, FeatureStore
97
98
from feast.infra.online_stores.dynamodb import DynamoDBOnlineStoreConfig
You can see more in an example at [Feast Workshop - Module 1](https://github.com/feast-dev/feast-workshop/blob/main/module_1/README.md#step-7-scaling-up-and-scheduling-materialization).
123
+
122
124
{% hint style="success" %}
123
125
Important note: Airflow worker must have read and write permissions to the registry file on GCS / S3 since it pulls configuration and updates materialization history.
124
126
{% endhint %}
@@ -128,6 +130,8 @@ See more details at [data ingestion](../getting-started/concepts/data-ingestion.
128
130
129
131
This supports pushing feature values into Feast to both online or offline stores.
130
132
133
+
### 2.4 Scheduled batch transformations with Airflow + dbt
134
+
Feast does not orchestrate batch transformation DAGs. For this, you can rely on tools like Airflow + dbt. See [Feast Workshop - Module 3](https://github.com/feast-dev/feast-workshop/blob/main/module_3/) for an example and some tips.
This will deploy a single service. The service must have read access to the registry file on cloud storage. It will keep a copy of the registry in their memory and periodically refresh it, so expect some delays in update propagation in exchange for better performance.
245
+
This will deploy a single service. The service must have read access to the registry file on cloud storage and to the online store (e.g. via [podAnnotations](https://kubernetes-on-aws.readthedocs.io/en/latest/user-guide/iam-roles.html)). It will keep a copy of the registry in their memory and periodically refresh it, so expect some delays in update propagation in exchange for better performance.
242
246
243
247
## 5. Using environment variables in your yaml configuration
244
248
@@ -272,7 +276,7 @@ In summary, the overall architecture in production may look like:
272
276
273
277
* Feast SDK is being triggered by CI (eg, Github Actions). It applies the latest changes from the feature repo to the Feast database-backed registry
274
278
* Data ingestion
275
-
* **Batch data**: Airflow manages materialization jobs to ingest batch data from DWH to the online store periodically. When working with large datasets to materialize, we recommend using a batch materialization engine
279
+
* **Batch data**: Airflow manages batch transformation jobs + materialization jobs to ingest batch data from DWH to the online store periodically. When working with large datasets to materialize, we recommend using a batch materialization engine
276
280
* If your offline and online workloads are in Snowflake, the Snowflake materialization engine is likely the best option.
277
281
* If your offline and online workloads are not using Snowflake, but using Kubernetes is an option, the Bytewax materialization engine is likely the best option.
278
282
* If none of these engines suite your needs, you may continue using the in-process engine, or write a custom engine (e.g with Spark or Ray).
The AWS Lambda batch materialization engine is considered alpha status. It relies on the offline store to output feature values to S3 via `to_remote_storage`, and then loads them into the online store.
6
+
7
+
See [LambdaMaterializationEngineConfig](https://rtd.feast.dev/en/master/index.html?highlight=LambdaMaterializationEngine#feast.infra.materialization.aws_lambda.lambda_engine.LambdaMaterializationEngineConfig) for configuration options.
8
+
9
+
See also [Dockerfile](https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/materialization/aws_lambda/Dockerfile) for a Dockerfile that can be used below with `materialization_image`.
10
+
11
+
## Example
12
+
13
+
{% code title="feature_store.yaml" %}
14
+
```yaml
15
+
...
16
+
offline_store:
17
+
type: snowflake.offline
18
+
...
19
+
batch_engine:
20
+
type: lambda
21
+
lambda_role: [your iam role]
22
+
materialization_image: [image uri of above Docker image]
The Spark batch materialization engine is considered alpha status. It relies on the offline store to output feature values to S3 via `to_remote_storage`, and then loads them into the online store.
6
+
7
+
See [SparkMaterializationEngine](https://rtd.feast.dev/en/master/index.html?highlight=SparkMaterializationEngine#feast.infra.materialization.spark.spark_materialization_engine.SparkMaterializationEngineConfig) for configuration options.
8
+
9
+
## Example
10
+
11
+
{% code title="feature_store.yaml" %}
12
+
```yaml
13
+
...
14
+
offline_store:
15
+
type: snowflake.offline
16
+
...
17
+
batch_engine:
18
+
type: spark.engine
19
+
partitions: [optional num partitions to use to write to online store]
Copy file name to clipboardExpand all lines: infra/templates/README.md.jinja2
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -25,10 +25,10 @@ Feast allows ML platform teams to:
25
25
* **Avoid data leakage** by generating point-in-time correct feature sets so data scientists can focus on feature engineering rather than debugging error-prone dataset joining logic. This ensure that future feature values do not leak to models during training.
26
26
* **Decouple ML from data infrastructure** by providing a single data access layer that abstracts feature storage from feature retrieval, ensuring models remain portable as you move from training models to serving models, from batch models to realtime models, and from one data infra system to another.
27
27
28
-
Please see our [documentation](https://docs.feast.dev/) for more information about the project.
28
+
Please see our [documentation](https://docs.feast.dev/) for more information about the project, or sign up for an [email newsletter](https://feast.dev/).
29
29
30
30
## 📐 Architecture
31
-

31
+

32
32
33
33
The above architecture is the minimal Feast deployment. Want to run the full Feast on Snowflake/GCP/AWS? Click [here](https://docs.feast.dev/how-to-guides/feast-snowflake-gcp-aws).
0 commit comments