Skip to content

Commit a3dd520

Browse files
achalsadchia
authored andcommitted
docs: Update Running Feast in Production guide (#3160)
* docs: Update Running Feast in Production guide Signed-off-by: Achal Shah <achals@gmail.com> * remove bad import: Signed-off-by: Achal Shah <achals@gmail.com> * fix ordering Signed-off-by: Achal Shah <achals@gmail.com> Signed-off-by: Achal Shah <achals@gmail.com>
1 parent 0e13573 commit a3dd520

File tree

2 files changed

+65
-38
lines changed

2 files changed

+65
-38
lines changed
164 KB
Loading

docs/how-to-guides/running-feast-in-production.md

Lines changed: 65 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,13 @@ Overview of typical production configuration is given below:
99
![Overview](production-simple.png)
1010

1111
{% hint style="success" %}
12-
**Important note:** Feast is highly customizable and modular. Most Feast blocks are loosely connected and can be used independently. Hence, you are free to build your own production configuration.
12+
**Important note:** Feast is highly customizable and modular.
13+
14+
Most Feast blocks are loosely connected and can be used independently. Hence, you are free to build your own production configuration.
1315

1416
For example, you might not have a stream source and, thus, no need to write features in real-time to an online store. Or you might not need to retrieve online features. Feast also often provides multiple options to achieve the same goal. We discuss tradeoffs below.
17+
18+
Additionally, please check the how-to guide for some specific recommendations on [how to scale Feast](./scaling-feast.md).
1519
{% endhint %}
1620

1721
In this guide we will show you how to:
@@ -24,15 +28,19 @@ In this guide we will show you how to:
2428

2529
## 1. Automatically deploying changes to your feature definitions
2630

27-
### Setting up a feature repository
31+
### 1.1 Setting up a feature repository
2832

2933
The first step to setting up a deployment of Feast is to create a Git repository that contains your feature definitions. The recommended way to version and track your feature definitions is by committing them to a repository and tracking changes through commits. If you recall, running `feast apply` commits feature definitions to a **registry**, which users can then read elsewhere.
3034

31-
### Setting up CI/CD to automatically update the registry
35+
### 1.2 Setting up a database-backed registry
36+
37+
Out of the box, Feast serializes all of its state into a file-based registry. When running Feast in production, we recommend using the more scalable SQL-based registry that is backed by a database. Details are available [here](./scaling-feast.md#scaling-feast-registry).
38+
39+
### 1.3 Setting up CI/CD to automatically update the registry
3240

3341
We recommend typically setting up CI/CD to automatically run `feast plan` and `feast apply` when pull requests are opened / merged.
3442

35-
### Setting up multiple environments
43+
### 1.4 Setting up multiple environments
3644

3745
A common scenario when using Feast in production is to want to test changes to Feast object definitions. For this, we recommend setting up a _staging_ environment for your offline and online stores, which mirrors _production_ (with potentially a smaller data set).
3846
Having this separate environment allows users to test changes by first applying them to staging, and then promoting the changes to production after verifying the changes on staging.
@@ -43,7 +51,37 @@ Different options are presented in the [how-to guide](structuring-repos.md).
4351

4452
To keep your online store up to date, you need to run a job that loads feature data from your feature view sources into your online store. In Feast, this loading operation is called materialization.
4553

46-
### 2.1. Manual materializations
54+
### 2.1 Scalable Materialization
55+
56+
Out of the box, Feast's materialization process uses an in-process materialization engine. This engine loads all the data being materialized into memory from the offline store, and writes it into the online store.
57+
58+
This approach may not scale to large amounts of data, which users of Feast may be dealing with in production.
59+
In this case, we recommend using one of the more [scalable materialization engines](./scaling-feast.md#scaling-materialization), such as the [Bytewax Materialization Engine](../reference/batch-materialization/bytewax.md), or the [Snowflake Materialization Engine](../reference/batch-materialization/snowflake.md).
60+
Users may also need to [write a custom materialization engine](../how-to-guides/customizing-feast/creating-a-custom-materialization-engine.md) to work on their existing infrastructure.
61+
62+
The Bytewax materialization engine can run materialization on existing kubernetes cluster. An example configuration of this in a `feature_store.yaml` is as follows:
63+
64+
```yaml
65+
batch_engine:
66+
type: bytewax
67+
namespace: bytewax
68+
image: bytewax/bytewax-feast:latest
69+
env:
70+
- name: AWS_ACCESS_KEY_ID
71+
valueFrom:
72+
secretKeyRef:
73+
name: aws-credentials
74+
key: aws-access-key-id
75+
- name: AWS_SECRET_ACCESS_KEY
76+
valueFrom:
77+
secretKeyRef:
78+
name: aws-credentials
79+
key: aws-secret-access-key
80+
```
81+
82+
83+
84+
### 2.2 Manual materialization
4785
4886
The simplest way to schedule materialization is to run an **incremental** materialization using the Feast CLI:
4987
@@ -53,7 +91,7 @@ feast materialize-incremental 2022-01-01T00:00:00
5391

5492
The above command will load all feature values from all feature view sources into the online store up to the time `2022-01-01T00:00:00`.
5593

56-
A timestamp is required to set the end date for materialization. If your source is fully up to date then the end date would be the current time. However, if you are querying a source where data is not yet available, then you do not want to set the timestamp to the current time. You would want to use a timestamp that ends at a date for which data is available. The next time `materialize-incremental` is run, Feast will load data that starts from the previous end date, so it is important to ensure that the materialization interval does not overlap with time periods for which data has not been made available. This is commonly the case when your source is an ETL pipeline that is scheduled on a daily basis.
94+
A timestamp is required to set the end date for materialization. If your source is fully up-to-date then the end date would be the current time. However, if you are querying a source where data is not yet available, then you do not want to set the timestamp to the current time. You would want to use a timestamp that ends at a date for which data is available. The next time `materialize-incremental` is run, Feast will load data that starts from the previous end date, so it is important to ensure that the materialization interval does not overlap with time periods for which data has not been made available. This is commonly the case when your source is an ETL pipeline that is scheduled on a daily basis.
5795

5896
An alternative approach to incremental materialization (where Feast tracks the intervals of data that need to be ingested), is to call Feast directly from your scheduler like Airflow. In this case, Airflow is the system that tracks the intervals that have been ingested.
5997

@@ -65,13 +103,15 @@ In the above example we are materializing the source data from the `driver_hourl
65103

66104
The timestamps above should match the interval of data that has been computed by the data transformation system.
67105

68-
### 2.2. Automate periodic materializations
106+
### 2.3 Automate periodic materialization
69107

70108
It is up to you which orchestration/scheduler to use to periodically run `$ feast materialize`. Feast keeps the history of materialization in its registry so that the choice could be as simple as a [unix cron util](https://en.wikipedia.org/wiki/Cron). Cron util should be sufficient when you have just a few materialization jobs (it's usually one materialization job per feature view) triggered infrequently. However, the amount of work can quickly outgrow the resources of a single machine. That happens because the materialization job needs to repackage all rows before writing them to an online store. That leads to high utilization of CPU and memory. In this case, you might want to use a job orchestrator to run multiple jobs in parallel using several workers. Kubernetes Jobs or Airflow are good choices for more comprehensive job orchestration.
71109

72110
If you are using Airflow as a scheduler, Feast can be invoked through the [BashOperator](https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/bash.html) after the [Python SDK](https://pypi.org/project/feast/) has been installed into a virtual environment and your feature repo has been synced:
73111

74112
```python
113+
import datetime
114+
75115
materialize = BashOperator(
76116
task_id='materialize',
77117
bash_command=f'feast materialize-incremental {datetime.datetime.now().replace(microsecond=0).isoformat()}',
@@ -82,6 +122,8 @@ materialize = BashOperator(
82122
Important note: Airflow worker must have read and write permissions to the registry file on GS / S3 since it pulls configuration and updates materialization history.
83123
{% endhint %}
84124

125+
126+
85127
## 3. How to use Feast for model training
86128

87129
After we've defined our features and data sources in the repository, we can generate training datasets.
@@ -91,6 +133,8 @@ The first thing we need to do in our training code is to create a `FeatureStore`
91133
One way to ensure your production clients have access to the feature store is to provide a copy of the `feature_store.yaml` to those pipelines. This `feature_store.yaml` file will have a reference to the feature store registry, which allows clients to retrieve features from offline or online stores.
92134

93135
```python
136+
from feast import FeatureStore
137+
94138
fs = FeatureStore(repo_path="production/")
95139
```
96140

@@ -114,11 +158,12 @@ model = ml.fit(training_df)
114158
The most common way to productionize ML models is by storing and versioning models in a "model store", and then deploying these models into production. When using Feast, it is recommended that the list of feature references also be saved alongside the model. This ensures that models and the features they are trained on are paired together when being shipped into production:
115159

116160
```python
161+
import json
117162
# Save model
118163
model.save('my_model.bin')
119164

120165
# Save features
121-
open('feature_refs.json', 'w') as f:
166+
with open('feature_refs.json', 'w') as f:
122167
json.dump(feature_refs, f)
123168
```
124169

@@ -217,6 +262,10 @@ from feast import FeatureStore
217262

218263
store = FeatureStore(...)
219264

265+
spark = SparkSession.builder.getOrCreate()
266+
267+
streamingDF = spark.readStream.format(...).load()
268+
220269
def feast_writer(spark_df):
221270
pandas_df = spark_df.to_pandas()
222271
store.push("driver_hourly_stats", pandas_df)
@@ -230,15 +279,7 @@ Alternatively, if you want to ingest features directly from a broker (eg, Kafka
230279

231280
If you are using Kafka, [HTTP Sink](https://docs.confluent.io/kafka-connect-http/current/overview.html) could be utilized as a middleware. In this case, the "push service" can be deployed on Kubernetes or as a Serverless function.
232281

233-
## 6. Monitoring
234-
235-
Feast services can report their metrics to a StatsD-compatible collector. To activate this function, you'll need to provide a StatsD IP address and a port when deploying the helm chart (in future, this will be added to `feature_store.yaml`).
236-
237-
We use an [InfluxDB-style extension](https://github.com/prometheus/statsd\_exporter#tagging-extensions) for StatsD format to be able to send tags along with metrics. Keep that in mind while selecting the collector ([telegraph](https://www.influxdata.com/blog/getting-started-with-sending-statsd-metrics-to-telegraf-influxdb/#introducing-influx-statsd) will work for sure).
238-
239-
We chose StatsD since it's a de-facto standard with various implementations (eg, [1](https://github.com/prometheus/statsd\_exporter), [2](https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/README.md)) and metrics can be easily exported to Prometheus, InfluxDB, AWS CloudWatch, etc.
240-
241-
## 7. Using environment variables in your yaml configuration
282+
## 6. Using environment variables in your yaml configuration
242283

243284
You might want to dynamically set parts of your configuration from your environment. For instance to deploy Feast to production and development with the same configuration, but a different server. Or to inject secrets without exposing them in your git repo. To do this, it is possible to use the `${ENV_VAR}` syntax in your `feature_store.yaml` file. For instance:
244285

@@ -268,30 +309,16 @@ online_store:
268309

269310
Summarizing it all together we want to show several options of architecture that will be most frequently used in production:
270311

271-
### Option #1 (currently preferred)
312+
### Current Recommendation
272313

273314
* Feast SDK is being triggered by CI (eg, Github Actions). It applies the latest changes from the feature repo to the Feast registry
274315
* Airflow manages materialization jobs to ingest data from DWH to the online store periodically
275316
* For the stream ingestion Feast Python SDK is used in the existing Spark / Beam pipeline
276-
* Online features are served via either a Python feature server or a high performance Go feature server
277-
* The Go feature server can be deployed on a Kubernetes cluster (via Helm charts)
317+
* For Batch Materialization Engine:
318+
* If your offline and online workloads are in Snowflake, the Snowflake Engine is likely the best option.
319+
* If your offline and online workloads are not using Snowflake, but using Kubernetes is an option, the Bytewax engine is likely the best option.
320+
* If none of these engines suite your needs, you may continue using the in-process engine, or write a custom engine.
321+
* Online features are served via the Python feature server over HTTP, or consumed using the Feast Python SDK.
278322
* Feast Python SDK is called locally to generate a training dataset
279323

280-
![From Repository to Production: Feast Production Architecture](production-spark.png)
281-
282-
### Option #2 _(still in development)_
283-
284-
Same as Option #1, except:
285-
286-
* Push service is deployed as AWS Lambda / Google Cloud Run and is configured as a sink for Kinesis or PubSub to ingest features directly from a stream broker. Lambda / Cloud Run is being managed by Feast SDK (from CI environment)
287-
* Materialization jobs are managed inside Kubernetes via Kubernetes Job (currently not managed by Helm)
288-
289-
![With Push Service as Lambda](production-lambda.png)
290-
291-
### Option #3 _(still in development)_
292-
293-
Same as Option #2, except:
294-
295-
* Push service is deployed on Kubernetes cluster and exposes an HTTP API that can be used as a sink for Kafka (via kafka-http connector) or accessed directly.
296-
297-
![With Push Service in Kubernetes](production-kube.png)
324+
![From Repository to Production: Feast Production Architecture](production-spark-bytewax.png)

0 commit comments

Comments
 (0)