docs: Add docs on using Feast with dbt + Airflow + missing m13n docs (feast-dev#3304)

adchia · web-flow · commit b5e734b8e123 · 2022-10-19T10:25:35.000-07:00
* docs: Add documentation on using Feast with dbt + Airflow + adding missing docs

Signed-off-by: Danny Chiao &lt;danny@tecton.ai&gt;

* fix SUMMARY

Signed-off-by: Danny Chiao &lt;danny@tecton.ai&gt;

* revert

Signed-off-by: Danny Chiao &lt;danny@tecton.ai&gt;

* fix

Signed-off-by: Danny Chiao &lt;danny@tecton.ai&gt;

* fix

Signed-off-by: Danny Chiao &lt;danny@tecton.ai&gt;

* fix

Signed-off-by: Danny Chiao &lt;danny@tecton.ai&gt;

* link to lambda docker

Signed-off-by: Danny Chiao &lt;danny@tecton.ai&gt;

* link to lambda docker

Signed-off-by: Danny Chiao &lt;danny@tecton.ai&gt;

Signed-off-by: Danny Chiao &lt;danny@tecton.ai&gt;
diff --git a/README.md b/README.md
@@ -27,10 +27,10 @@ Feast allows ML platform teams to:
 * **Avoid data leakage** by generating point-in-time correct feature sets so data scientists can focus on feature engineering rather than debugging error-prone dataset joining logic. This ensure that future feature values do not leak to models during training.
 * **Decouple ML from data infrastructure** by providing a single data access layer that abstracts feature storage from feature retrieval, ensuring models remain portable as you move from training models to serving models, from batch models to realtime models, and from one data infra system to another.
 
-Please see our [documentation](https://docs.feast.dev/) for more information about the project.
+Please see our [documentation](https://docs.feast.dev/) for more information about the project, or sign up for an [email newsletter](https://feast.dev/).
 
 ## 📐 Architecture
-![](docs/assets/feast-marchitecture.png)
+![](docs/assets/feast_marchitecture.png)
 
 The above architecture is the minimal Feast deployment. Want to run the full Feast on Snowflake/GCP/AWS? Click [here](https://docs.feast.dev/how-to-guides/feast-snowflake-gcp-aws).
 
diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
@@ -104,6 +104,8 @@
 * [Batch Materialization Engines](reference/batch-materialization/README.md)
   * [Bytewax](reference/batch-materialization/bytewax.md)
   * [Snowflake](reference/batch-materialization/snowflake.md)
+  * [AWS Lambda (alpha)](reference/batch-materialization/lambda.md)
+  * [Spark (contrib)](reference/batch-materialization/spark.md)
 * [Feature repository](reference/feature-repository/README.md)
   * [feature\_store.yaml](reference/feature-repository/feature-store-yaml.md)
   * [.feastignore](reference/feature-repository/feast-ignore.md)
diff --git a/docs/assets/feast-marchitecture.png b/docs/assets/feast-marchitecture.png
diff --git a/docs/assets/feast_marchitecture.png b/docs/assets/feast_marchitecture.png
diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md
@@ -555,6 +555,7 @@ show up in the upcoming concepts + architecture + tutorial pages as well.
 
 ## Next steps
 
+* Join the [email newsletter](https://feast.dev/) to get new updates on Feast / feature stores.
 * Read the [Concepts](concepts/) page to understand the Feast data model.
 * Read the [Architecture](architecture-and-components/) page.
 * Check out our [Tutorials](../tutorials/tutorials-overview/) section for more examples on how to use Feast.
diff --git a/docs/how-to-guides/running-feast-in-production.md b/docs/how-to-guides/running-feast-in-production.md
@@ -34,6 +34,8 @@ The first step to setting up a deployment of Feast is to create a Git repository
 
 Out of the box, Feast serializes all of its state into a file-based registry. When running Feast in production, we recommend using the more scalable SQL-based registry that is backed by a database. Details are available [here](./scaling-feast.md#scaling-feast-registry).
 
+> **Note:** A SQL-based registry primarily works with a Python feature server. The Java feature server does not understand this registry type yet.
+
 ### 1.3 Setting up CI/CD to automatically update the registry
 
 We recommend typically setting up CI/CD to automatically run `feast plan` and `feast apply` when pull requests are opened / merged.
@@ -78,7 +80,7 @@ batch_engine:
           key: aws-secret-access-key
 ```
 
-### 2.2 Scheduled materialization
+### 2.2 Scheduled materialization with Airflow
 
 > See also [data ingestion](../getting-started/concepts/data-ingestion.md#batch-data-ingestion) for code snippets
 
@@ -91,34 +93,34 @@ However, the amount of work can quickly outgrow the resources of a single machin
 If you are using Airflow as a scheduler, Feast can be invoked through a  [PythonOperator](https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html) after the [Python SDK](https://pypi.org/project/feast/) has been installed into a virtual environment and your feature repo has been synced:
 
 ```python
-import datetime
-from airflow.operators.python_operator import PythonOperator
+from airflow.decorators import task
 from feast import RepoConfig, FeatureStore
 from feast.infra.online_stores.dynamodb import DynamoDBOnlineStoreConfig
 from feast.repo_config import RegistryConfig
 
 # Define Python callable
-def materialize():
+@task()
+def materialize(data_interval_start=None, data_interval_end=None):
   repo_config = RepoConfig(
     registry=RegistryConfig(path="s3://[YOUR BUCKET]/registry.pb"),
     project="feast_demo_aws",
     provider="aws",
     offline_store="file",
-    online_store=DynamoDBOnlineStoreConfig(region="us-west-2")
+    online_store=DynamoDBOnlineStoreConfig(region="us-west-2"),
+    entity_key_serialization_version=2
   )
   store = FeatureStore(config=repo_config)
   # Option 1: materialize just one feature view
   # store.materialize_incremental(datetime.datetime.now(), feature_views=["my_fv_name"])
   # Option 2: materialize all feature views incrementally
-  store.materialize_incremental(datetime.datetime.now())
-
-# Use Airflow PythonOperator
-materialize_python = PythonOperator(
-  task_id='materialize_python',
-  python_callable=materialize,
-)
+  # store.materialize_incremental(datetime.datetime.now())
+  # Option 3: Let Airflow manage materialization state
+  # Add 1 hr overlap to account for late data
+  store.materialize(data_interval_start.subtract(hours=1), data_interval_end)
 ```
 
+You can see more in an example at [Feast Workshop - Module 1](https://github.com/feast-dev/feast-workshop/blob/main/module_1/README.md#step-7-scaling-up-and-scheduling-materialization).
+
 {% hint style="success" %}
 Important note: Airflow worker must have read and write permissions to the registry file on GCS / S3 since it pulls configuration and updates materialization history.
 {% endhint %}
@@ -128,6 +130,8 @@ See more details at [data ingestion](../getting-started/concepts/data-ingestion.
 
 This supports pushing feature values into Feast to both online or offline stores.
 
+### 2.4 Scheduled batch transformations with Airflow + dbt
+Feast does not orchestrate batch transformation DAGs. For this, you can rely on tools like Airflow + dbt. See [Feast Workshop - Module 3](https://github.com/feast-dev/feast-workshop/blob/main/module_3/) for an example and some tips.
 
 ## 3. How to use Feast for model training
 
@@ -238,7 +242,7 @@ helm install feast-release feast-charts/feast-feature-server \
     --set feature_store_yaml_base64=$(base64 feature_store.yaml)    
 ```
 
-This will deploy a single service. The service must have read access to the registry file on cloud storage. It will keep a copy of the registry in their memory and periodically refresh it, so expect some delays in update propagation in exchange for better performance. 
+This will deploy a single service. The service must have read access to the registry file on cloud storage and to the online store (e.g. via [podAnnotations](https://kubernetes-on-aws.readthedocs.io/en/latest/user-guide/iam-roles.html)). It will keep a copy of the registry in their memory and periodically refresh it, so expect some delays in update propagation in exchange for better performance. 
 
 ## 5. Using environment variables in your yaml configuration
 
@@ -272,7 +276,7 @@ In summary, the overall architecture in production may look like:
 
 * Feast SDK is being triggered by CI (eg, Github Actions). It applies the latest changes from the feature repo to the Feast database-backed registry 
 * Data ingestion
-  * **Batch data**: Airflow manages materialization jobs to ingest batch data from DWH to the online store periodically. When working with large datasets to materialize, we recommend using a batch materialization engine 
+  * **Batch data**: Airflow manages batch transformation jobs + materialization jobs to ingest batch data from DWH to the online store periodically. When working with large datasets to materialize, we recommend using a batch materialization engine 
     * If your offline and online workloads are in Snowflake, the Snowflake materialization engine is likely the best option.
     * If your offline and online workloads are not using Snowflake, but using Kubernetes is an option, the Bytewax materialization engine is likely the best option.
     * If none of these engines suite your needs, you may continue using the in-process engine, or write a custom engine (e.g with Spark or Ray).
diff --git a/docs/reference/batch-materialization/README.md b/docs/reference/batch-materialization/README.md
@@ -5,3 +5,7 @@ Please see [Batch Materialization Engine](../../getting-started/architecture-and
 {% page-ref page="snowflake.md" %}
 
 {% page-ref page="bytewax.md" %}
+
+{% page-ref page="lambda.md" %}
+
+{% page-ref page="spark.md" %}
diff --git a/docs/reference/batch-materialization/lambda.md b/docs/reference/batch-materialization/lambda.md
@@ -0,0 +1,24 @@
+# AWS Lambda (alpha)
+
+## Description
+
+The AWS Lambda batch materialization engine is considered alpha status. It relies on the offline store to output feature values to S3 via `to_remote_storage`, and then loads them into the online store.
+
+See [LambdaMaterializationEngineConfig](https://rtd.feast.dev/en/master/index.html?highlight=LambdaMaterializationEngine#feast.infra.materialization.aws_lambda.lambda_engine.LambdaMaterializationEngineConfig) for configuration options.
+
+See also [Dockerfile](https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/materialization/aws_lambda/Dockerfile) for a Dockerfile that can be used below with `materialization_image`.
+
+## Example
+
+{% code title="feature_store.yaml" %}
+```yaml
+...
+offline_store:
+  type: snowflake.offline
+...
+batch_engine:
+  type: lambda
+  lambda_role: [your iam role]
+  materialization_image: [image uri of above Docker image]
+```
+{% endcode %}
diff --git a/docs/reference/batch-materialization/spark.md b/docs/reference/batch-materialization/spark.md
@@ -0,0 +1,21 @@
+# Spark (alpha)
+
+## Description
+
+The Spark batch materialization engine is considered alpha status. It relies on the offline store to output feature values to S3 via `to_remote_storage`, and then loads them into the online store.
+
+See [SparkMaterializationEngine](https://rtd.feast.dev/en/master/index.html?highlight=SparkMaterializationEngine#feast.infra.materialization.spark.spark_materialization_engine.SparkMaterializationEngineConfig) for configuration options.
+
+## Example
+
+{% code title="feature_store.yaml" %}
+```yaml
+...
+offline_store:
+  type: snowflake.offline
+...
+batch_engine:
+  type: spark.engine
+  partitions: [optional num partitions to use to write to online store]
+```
+{% endcode %}
diff --git a/infra/templates/README.md.jinja2 b/infra/templates/README.md.jinja2
@@ -25,10 +25,10 @@ Feast allows ML platform teams to:
 * **Avoid data leakage** by generating point-in-time correct feature sets so data scientists can focus on feature engineering rather than debugging error-prone dataset joining logic. This ensure that future feature values do not leak to models during training.
 * **Decouple ML from data infrastructure** by providing a single data access layer that abstracts feature storage from feature retrieval, ensuring models remain portable as you move from training models to serving models, from batch models to realtime models, and from one data infra system to another.
 
-Please see our [documentation](https://docs.feast.dev/) for more information about the project.
+Please see our [documentation](https://docs.feast.dev/) for more information about the project, or sign up for an [email newsletter](https://feast.dev/).
 
 ## 📐 Architecture
-![](docs/assets/feast-marchitecture.png)
+![](docs/assets/feast_marchitecture.png)
 
 The above architecture is the minimal Feast deployment. Want to run the full Feast on Snowflake/GCP/AWS? Click [here](https://docs.feast.dev/how-to-guides/feast-snowflake-gcp-aws).
 
diff --git a/sdk/python/docs/index.rst b/sdk/python/docs/index.rst
@@ -448,3 +448,15 @@ Snowflake Engine
 
 .. autoclass:: feast.infra.materialization.aws_lambda.lambda_engine.LambdaMaterializationJob
     :members:
+
+(Alpha) Spark Engine
+---------------------------
+
+.. autoclass:: feast.infra.materialization.contrib.spark.spark_materialization_engine.SparkMaterializationEngine
+    :members:
+
+.. autoclass:: feast.infra.materialization.contrib.spark.spark_materialization_engine.SparkMaterializationEngineConfig
+    :members:
+
+.. autoclass:: feast.infra.materialization.contrib.spark.spark_materialization_engine.SparkMaterializationJob
+    :members:
diff --git a/sdk/python/docs/source/feast.infra.online_stores.rst b/sdk/python/docs/source/feast.infra.online_stores.rst
@@ -12,6 +12,14 @@ Subpackages
 Submodules
 ----------
 
+feast.infra.online\_stores.bigtable module
+------------------------------------------
+
+.. automodule:: feast.infra.online_stores.bigtable
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
 feast.infra.online\_stores.datastore module
 -------------------------------------------
 
diff --git a/sdk/python/docs/source/index.rst b/sdk/python/docs/source/index.rst
@@ -448,3 +448,15 @@ Snowflake Engine
 
 .. autoclass:: feast.infra.materialization.aws_lambda.lambda_engine.LambdaMaterializationJob
     :members:
+
+(Alpha) Spark Engine
+---------------------------
+
+.. autoclass:: feast.infra.materialization.contrib.spark.spark_materialization_engine.SparkMaterializationEngine
+    :members:
+
+.. autoclass:: feast.infra.materialization.contrib.spark.spark_materialization_engine.SparkMaterializationEngineConfig
+    :members:
+
+.. autoclass:: feast.infra.materialization.contrib.spark.spark_materialization_engine.SparkMaterializationJob
+    :members:
diff --git a/ui/feature_repo/features.py b/ui/feature_repo/features.py
@@ -10,7 +10,10 @@
 zipcode = Entity(
     name="zipcode",
     description="A zipcode",
-    tags={"owner": "danny@tecton.ai", "team": "hack week",},
+    tags={
+        "owner": "danny@tecton.ai",
+        "team": "hack week",
+    },
 )
 
 zipcode_source = FileSource(
@@ -85,7 +88,10 @@
 dob_ssn = Entity(
     name="dob_ssn",
     description="Date of birth and last four digits of social security number",
-    tags={"owner": "tony@tecton.ai", "team": "hack week",},
+    tags={
+        "owner": "tony@tecton.ai",
+        "team": "hack week",
+    },
 )
 
 credit_history_source = FileSource(
@@ -123,14 +129,19 @@
 # Define a request data source which encodes features / information only
 # available at request time (e.g. part of the user initiated HTTP request)
 input_request = RequestSource(
-    name="transaction", schema=[Field(name="transaction_amt", dtype=Int64),],
+    name="transaction",
+    schema=[
+        Field(name="transaction_amt", dtype=Int64),
+    ],
 )
 
 # Define an on demand feature view which can generate new features based on
 # existing feature views and RequestSource features
 @on_demand_feature_view(
     sources=[credit_history, input_request],
-    schema=[Field(name="transaction_gt_last_credit_card_due", dtype=Bool),],
+    schema=[
+        Field(name="transaction_gt_last_credit_card_due", dtype=Bool),
+    ],
 )
 def transaction_gt_last_credit_card_due(inputs: pd.DataFrame) -> pd.DataFrame:
     df = pd.DataFrame()
@@ -173,14 +184,18 @@ def transaction_gt_last_credit_card_due(inputs: pd.DataFrame) -> pd.DataFrame:
 
 zipcode_model = FeatureService(
     name="zipcode_model",
-    features=[zipcode_features,],
+    features=[
+        zipcode_features,
+    ],
     tags={"owner": "amanda@tecton.ai", "stage": "dev"},
     description="Location model",
 )
 
 zipcode_model_v2 = FeatureService(
     name="zipcode_model_v2",
-    features=[zipcode_money_features,],
+    features=[
+        zipcode_money_features,
+    ],
     tags={"owner": "amanda@tecton.ai", "stage": "dev"},
     description="Location model",
 )