You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: Update Running Feast in Production guide (#3160)
* docs: Update Running Feast in Production guide
Signed-off-by: Achal Shah <achals@gmail.com>
* remove bad import:
Signed-off-by: Achal Shah <achals@gmail.com>
* fix ordering
Signed-off-by: Achal Shah <achals@gmail.com>
Signed-off-by: Achal Shah <achals@gmail.com>
Copy file name to clipboardExpand all lines: docs/how-to-guides/running-feast-in-production.md
+65-38Lines changed: 65 additions & 38 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,9 +9,13 @@ Overview of typical production configuration is given below:
9
9

10
10
11
11
{% hint style="success" %}
12
-
**Important note:** Feast is highly customizable and modular. Most Feast blocks are loosely connected and can be used independently. Hence, you are free to build your own production configuration.
12
+
**Important note:** Feast is highly customizable and modular.
13
+
14
+
Most Feast blocks are loosely connected and can be used independently. Hence, you are free to build your own production configuration.
13
15
14
16
For example, you might not have a stream source and, thus, no need to write features in real-time to an online store. Or you might not need to retrieve online features. Feast also often provides multiple options to achieve the same goal. We discuss tradeoffs below.
17
+
18
+
Additionally, please check the how-to guide for some specific recommendations on [how to scale Feast](./scaling-feast.md).
15
19
{% endhint %}
16
20
17
21
In this guide we will show you how to:
@@ -24,15 +28,19 @@ In this guide we will show you how to:
24
28
25
29
## 1. Automatically deploying changes to your feature definitions
26
30
27
-
### Setting up a feature repository
31
+
### 1.1 Setting up a feature repository
28
32
29
33
The first step to setting up a deployment of Feast is to create a Git repository that contains your feature definitions. The recommended way to version and track your feature definitions is by committing them to a repository and tracking changes through commits. If you recall, running `feast apply` commits feature definitions to a **registry**, which users can then read elsewhere.
30
34
31
-
### Setting up CI/CD to automatically update the registry
35
+
### 1.2 Setting up a database-backed registry
36
+
37
+
Out of the box, Feast serializes all of its state into a file-based registry. When running Feast in production, we recommend using the more scalable SQL-based registry that is backed by a database. Details are available [here](./scaling-feast.md#scaling-feast-registry).
38
+
39
+
### 1.3 Setting up CI/CD to automatically update the registry
32
40
33
41
We recommend typically setting up CI/CD to automatically run `feast plan` and `feast apply` when pull requests are opened / merged.
34
42
35
-
### Setting up multiple environments
43
+
### 1.4 Setting up multiple environments
36
44
37
45
A common scenario when using Feast in production is to want to test changes to Feast object definitions. For this, we recommend setting up a _staging_ environment for your offline and online stores, which mirrors _production_ (with potentially a smaller data set).
38
46
Having this separate environment allows users to test changes by first applying them to staging, and then promoting the changes to production after verifying the changes on staging.
@@ -43,7 +51,37 @@ Different options are presented in the [how-to guide](structuring-repos.md).
43
51
44
52
To keep your online store up to date, you need to run a job that loads feature data from your feature view sources into your online store. In Feast, this loading operation is called materialization.
45
53
46
-
### 2.1. Manual materializations
54
+
### 2.1 Scalable Materialization
55
+
56
+
Out of the box, Feast's materialization process uses an in-process materialization engine. This engine loads all the data being materialized into memory from the offline store, and writes it into the online store.
57
+
58
+
This approach may not scale to large amounts of data, which users of Feast may be dealing with in production.
59
+
In this case, we recommend using one of the more [scalable materialization engines](./scaling-feast.md#scaling-materialization), such as the [Bytewax Materialization Engine](../reference/batch-materialization/bytewax.md), or the [Snowflake Materialization Engine](../reference/batch-materialization/snowflake.md).
60
+
Users may also need to [write a custom materialization engine](../how-to-guides/customizing-feast/creating-a-custom-materialization-engine.md) to work on their existing infrastructure.
61
+
62
+
The Bytewax materialization engine can run materialization on existing kubernetes cluster. An example configuration of this in a `feature_store.yaml` is as follows:
63
+
64
+
```yaml
65
+
batch_engine:
66
+
type: bytewax
67
+
namespace: bytewax
68
+
image: bytewax/bytewax-feast:latest
69
+
env:
70
+
- name: AWS_ACCESS_KEY_ID
71
+
valueFrom:
72
+
secretKeyRef:
73
+
name: aws-credentials
74
+
key: aws-access-key-id
75
+
- name: AWS_SECRET_ACCESS_KEY
76
+
valueFrom:
77
+
secretKeyRef:
78
+
name: aws-credentials
79
+
key: aws-secret-access-key
80
+
```
81
+
82
+
83
+
84
+
### 2.2 Manual materialization
47
85
48
86
The simplest way to schedule materialization is to run an **incremental** materialization using the Feast CLI:
The above command will load all feature values from all feature view sources into the online store up to the time `2022-01-01T00:00:00`.
55
93
56
-
A timestamp is required to set the end date for materialization. If your source is fully up to date then the end date would be the current time. However, if you are querying a source where data is not yet available, then you do not want to set the timestamp to the current time. You would want to use a timestamp that ends at a date for which data is available. The next time `materialize-incremental` is run, Feast will load data that starts from the previous end date, so it is important to ensure that the materialization interval does not overlap with time periods for which data has not been made available. This is commonly the case when your source is an ETL pipeline that is scheduled on a daily basis.
94
+
A timestamp is required to set the end date for materialization. If your source is fully up-to-date then the end date would be the current time. However, if you are querying a source where data is not yet available, then you do not want to set the timestamp to the current time. You would want to use a timestamp that ends at a date for which data is available. The next time `materialize-incremental` is run, Feast will load data that starts from the previous end date, so it is important to ensure that the materialization interval does not overlap with time periods for which data has not been made available. This is commonly the case when your source is an ETL pipeline that is scheduled on a daily basis.
57
95
58
96
An alternative approach to incremental materialization (where Feast tracks the intervals of data that need to be ingested), is to call Feast directly from your scheduler like Airflow. In this case, Airflow is the system that tracks the intervals that have been ingested.
59
97
@@ -65,13 +103,15 @@ In the above example we are materializing the source data from the `driver_hourl
65
103
66
104
The timestamps above should match the interval of data that has been computed by the data transformation system.
67
105
68
-
### 2.2. Automate periodic materializations
106
+
### 2.3 Automate periodic materialization
69
107
70
108
It is up to you which orchestration/scheduler to use to periodically run `$ feast materialize`. Feast keeps the history of materialization in its registry so that the choice could be as simple as a [unix cron util](https://en.wikipedia.org/wiki/Cron). Cron util should be sufficient when you have just a few materialization jobs (it's usually one materialization job per feature view) triggered infrequently. However, the amount of work can quickly outgrow the resources of a single machine. That happens because the materialization job needs to repackage all rows before writing them to an online store. That leads to high utilization of CPU and memory. In this case, you might want to use a job orchestrator to run multiple jobs in parallel using several workers. Kubernetes Jobs or Airflow are good choices for more comprehensive job orchestration.
71
109
72
110
If you are using Airflow as a scheduler, Feast can be invoked through the [BashOperator](https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/bash.html) after the [Python SDK](https://pypi.org/project/feast/) has been installed into a virtual environment and your feature repo has been synced:
Important note: Airflow worker must have read and write permissions to the registry file on GS / S3 since it pulls configuration and updates materialization history.
83
123
{% endhint %}
84
124
125
+
126
+
85
127
## 3. How to use Feast for model training
86
128
87
129
After we've defined our features and data sources in the repository, we can generate training datasets.
@@ -91,6 +133,8 @@ The first thing we need to do in our training code is to create a `FeatureStore`
91
133
One way to ensure your production clients have access to the feature store is to provide a copy of the `feature_store.yaml` to those pipelines. This `feature_store.yaml` file will have a reference to the feature store registry, which allows clients to retrieve features from offline or online stores.
92
134
93
135
```python
136
+
from feast import FeatureStore
137
+
94
138
fs = FeatureStore(repo_path="production/")
95
139
```
96
140
@@ -114,11 +158,12 @@ model = ml.fit(training_df)
114
158
The most common way to productionize ML models is by storing and versioning models in a "model store", and then deploying these models into production. When using Feast, it is recommended that the list of feature references also be saved alongside the model. This ensures that models and the features they are trained on are paired together when being shipped into production:
115
159
116
160
```python
161
+
import json
117
162
# Save model
118
163
model.save('my_model.bin')
119
164
120
165
# Save features
121
-
open('feature_refs.json', 'w') as f:
166
+
withopen('feature_refs.json', 'w') as f:
122
167
json.dump(feature_refs, f)
123
168
```
124
169
@@ -217,6 +262,10 @@ from feast import FeatureStore
217
262
218
263
store = FeatureStore(...)
219
264
265
+
spark = SparkSession.builder.getOrCreate()
266
+
267
+
streamingDF = spark.readStream.format(...).load()
268
+
220
269
deffeast_writer(spark_df):
221
270
pandas_df = spark_df.to_pandas()
222
271
store.push("driver_hourly_stats", pandas_df)
@@ -230,15 +279,7 @@ Alternatively, if you want to ingest features directly from a broker (eg, Kafka
230
279
231
280
If you are using Kafka, [HTTP Sink](https://docs.confluent.io/kafka-connect-http/current/overview.html) could be utilized as a middleware. In this case, the "push service" can be deployed on Kubernetes or as a Serverless function.
232
281
233
-
## 6. Monitoring
234
-
235
-
Feast services can report their metrics to a StatsD-compatible collector. To activate this function, you'll need to provide a StatsD IP address and a port when deploying the helm chart (in future, this will be added to `feature_store.yaml`).
236
-
237
-
We use an [InfluxDB-style extension](https://github.com/prometheus/statsd\_exporter#tagging-extensions) for StatsD format to be able to send tags along with metrics. Keep that in mind while selecting the collector ([telegraph](https://www.influxdata.com/blog/getting-started-with-sending-statsd-metrics-to-telegraf-influxdb/#introducing-influx-statsd) will work for sure).
238
-
239
-
We chose StatsD since it's a de-facto standard with various implementations (eg, [1](https://github.com/prometheus/statsd\_exporter), [2](https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/README.md)) and metrics can be easily exported to Prometheus, InfluxDB, AWS CloudWatch, etc.
240
-
241
-
## 7. Using environment variables in your yaml configuration
282
+
## 6. Using environment variables in your yaml configuration
242
283
243
284
You might want to dynamically set parts of your configuration from your environment. For instance to deploy Feast to production and development with the same configuration, but a different server. Or to inject secrets without exposing them in your git repo. To do this, it is possible to use the `${ENV_VAR}` syntax in your `feature_store.yaml` file. For instance:
244
285
@@ -268,30 +309,16 @@ online_store:
268
309
269
310
Summarizing it all together we want to show several options of architecture that will be most frequently used in production:
270
311
271
-
### Option #1 (currently preferred)
312
+
### Current Recommendation
272
313
273
314
* Feast SDK is being triggered by CI (eg, Github Actions). It applies the latest changes from the feature repo to the Feast registry
274
315
* Airflow manages materialization jobs to ingest data from DWH to the online store periodically
275
316
* For the stream ingestion Feast Python SDK is used in the existing Spark / Beam pipeline
276
-
* Online features are served via either a Python feature server or a high performance Go feature server
277
-
* The Go feature server can be deployed on a Kubernetes cluster (via Helm charts)
317
+
* For Batch Materialization Engine:
318
+
* If your offline and online workloads are in Snowflake, the Snowflake Engine is likely the best option.
319
+
* If your offline and online workloads are not using Snowflake, but using Kubernetes is an option, the Bytewax engine is likely the best option.
320
+
* If none of these engines suite your needs, you may continue using the in-process engine, or write a custom engine.
321
+
* Online features are served via the Python feature server over HTTP, or consumed using the Feast Python SDK.
278
322
* Feast Python SDK is called locally to generate a training dataset
279
323
280
-

281
-
282
-
### Option #2 _(still in development)_
283
-
284
-
Same as Option #1, except:
285
-
286
-
* Push service is deployed as AWS Lambda / Google Cloud Run and is configured as a sink for Kinesis or PubSub to ingest features directly from a stream broker. Lambda / Cloud Run is being managed by Feast SDK (from CI environment)
287
-
* Materialization jobs are managed inside Kubernetes via Kubernetes Job (currently not managed by Helm)
288
-
289
-

290
-
291
-
### Option #3 _(still in development)_
292
-
293
-
Same as Option #2, except:
294
-
295
-
* Push service is deployed on Kubernetes cluster and exposes an HTTP API that can be used as a sink for Kafka (via kafka-http connector) or accessed directly.
296
-
297
-

324
+

0 commit comments