Skip to content

Commit 191fecd

Browse files
committed
Adding module 0 with an extensive README
Signed-off-by: Danny Chiao <danny@tecton.ai>
1 parent f492cb9 commit 191fecd

File tree

12 files changed

+129
-22
lines changed

12 files changed

+129
-22
lines changed

README.md

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Overview
44

5-
This workshop aims to teach basic Feast concepts and walk you through focuses on how to achieve common architectures
5+
This workshop aims to teach basic Feast concepts and walk you through how to achieve common architectures.
66

77
## Pre-requisites
88
This workshop assumes you have the following installed:
@@ -12,14 +12,16 @@ This workshop assumes you have the following installed:
1212
- Docker & Docker Compose (e.g. `brew install docker docker-compose`)
1313

1414
## This workshop is composed of several modules
15+
*See also: [Feast quickstart](https://docs.feast.dev/getting-started/quickstart)*
16+
17+
| Description | Module |
18+
| :------------------------------------------------ | ------------------------------ |
19+
| Setting up and using an initial feature repo | [Module 0](module_0/README.md) |
20+
| Online feature retrieval with Kafka, Spark, Redis | [Module 1](module_1/README.md) |
21+
| On demand feature views | TBD |
22+
| Fetching features for batch scoring | TBD |
23+
| Feast Web UI | TBD |
24+
| Versioning features / models in Feast | TBD |
25+
| Data quality monitoring in Feast | TBD |
26+
| Deploying a feature server to AWS Lambda | TBD |
1527

16-
| Description | Module |
17-
| --- | --- |
18-
| Feast Concepts and basic flows | [Quickstart](https://docs.feast.dev/getting-started/quickstart) |
19-
| Powering low latency online feature retrieval with Kafka, Spark, and Redis | [Module 1](module_1/README.md) |
20-
| Using remote registry and file sources, platform vs client user flows, on demand transformations | [Module 2](module_2/README.md) |
21-
| Fetching features for batch scoring | TBD |
22-
| Feast Web UI | TBD |
23-
| Versioning features / models in Feast | TBD |
24-
| Data quality monitoring in Feast | TBD |
25-
| Deploying a feature server to AWS Lambda | TBD |

module_0/README.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# Module 0: Setting up and using an initial Feast feature repo
2+
3+
Welcome! Here we use a basic example to explain key concepts and user flows in Feast.
4+
5+
We focus on a specific example (that does not include online features + models):
6+
- **Use case**: building a platform for data scientists to share features for training offline models
7+
- **Stack**: you have data in a combination of data warehouses (to be explored in a future module) and data lakes (e.g. S3)
8+
9+
# 1. Mapping to Feast concepts
10+
To support this, you'll need:
11+
| Concept | Requirements |
12+
| :-------------- | :---------------------------------------------------------------------------------------------------- |
13+
| Data sources | `FileSource` (with S3 paths and endpoint overrides) and `FeatureView`s registered with `feast apply` |
14+
| Feature views | Feature views tied to data sources that are shared by data scientists, registered with `feast apply` |
15+
| Provider | In `feature_store.yaml`, specifying the `aws` provider to ensure your registry can be stored in S3 |
16+
| Registry | In `feature_store.yaml`, specifying a path (within an existing S3 bucket) the registry is written to. |
17+
| Transformations | Feast supports last mile transformations with `OnDemandFeatureView` that can be re-used |
18+
19+
# 2. User flows
20+
There are three user groups here worth considering. The ML platform team, the data scientists, and the ML engineers scheduling models in batch. We visit the first two of these in
21+
22+
## 2a. ML Platform Team
23+
The team here sets up the centralized Feast feature repository in GitHub. This is what's seen in `feature_repo_aws/`.
24+
25+
### Step 1: Setup the feature repo
26+
Here, the first thing a platform team needs to do is setup the `feature_store.yaml` within a version controlled repo like GitHub:
27+
28+
```yaml
29+
project: feast_demo_aws
30+
provider: aws
31+
registry: s3://[YOUR BUCKET]/registry.pb
32+
online_store: null
33+
offline_store:
34+
type: file
35+
flags:
36+
alpha_features: true
37+
on_demand_transforms: true
38+
```
39+
40+
Some quick recap of what's happening here:
41+
- The `project` gives infrastructure isolation. Commonly, to start, users will start with one large project for multiple teams.
42+
- All Feast objects like `FeatureView`s have associated projects. Users can only request features from a single project.
43+
- Online stores (when relevant)
44+
- The `provider` options available out of the box set (`gcp`, `aws`, `local`) where the registry lives (S3 vs GCS vs local file) and defaults for offline / online stores if none are specified
45+
- The `registry` is the source of truth on registered Feast objects. Users + model servers will pull from this to get the latest registered features + metadata.
46+
- **Note**: technically, multiple projects can use the same registry, though Feast was not designed with this in mind. Discovery of adjacent features is possible in this flow, but not retrieval.
47+
- The `online_store` here you see is set to null. If you don't need to power real time models with fresh features, this is not needed. If you are batch scoring, for example, then the online store is optional.
48+
- The `offline_store` can only be one type.
49+
- Here, for instruction purposes, we use `file` sources. This will directly read from files (local or remote) and use Dask to execute point-in-time joins. We **do not** recommend this for production usage.
50+
- Generally, we recommend users bias towards data warehouses as their offline store since they are very performant at generating training datasets.
51+
- There is also a contrib plugin (`SparkOfflineStore`) which supports retrieving features with Spark.
52+
- The `flags` control a couple of features today. We're likely to deprecate this system soon, but today it still gates `OnDemandFeatureView` which is still under development.
53+
54+
With the `feature_store.yaml` setup, you can now run `feast apply` to populate the registry. At this point, you can move to...
55+
56+
### Step 2: Adding the feature repo to version control
57+
TODO
58+
59+
Here we also setup CI/CD. You'll want to have a workflow that on PR merge, runs `feast apply`.
60+
61+
See https://github.com/feast-dev/feast-demo/blob/main/.github/workflows/feast_plan.yml as an example of a workflow that automatically runs `feast plan` on new incoming PRs, which alerts you on what changes will occur. This is useful for helping PR reviewers understand the effects of a change.
62+
63+
One example is whether a PR may change features that are already depended on in production by another model (e.g. `FeatureService`).
64+
65+
### Step 3 (optional): Access control for the registry
66+
We don't dive into this deeply, but you don't want to allow arbitrary users to clone the feature repository, change definitions and run `feast apply`. Thus, you should lock down your registry (e.g. with an S3 bucket policy) to only allow changes from your CI/CD user and perhaps some ML engineers.
67+
68+
### Step 4 (optional): Setup a Web UI endpoint
69+
Feast comes with an experimental Web UI. Users can already spin this up locally with `feast ui`, but you may want to have a Web UI that is universally available. Here, you'd likely deploy a service that runs `feast ui` on top of a `feature_store.yaml`, with some configuration on how frequently the UI should be refreshing its registry.
70+
71+
### Other best practices
72+
Many Feast users use `tags` on objects extensively. Some examples of how this may be used:
73+
- To give more detailed documentation on a `FeatureView`
74+
- To highlight what groups you need to join to gain access to certain feature views.
75+
- To denote whether a feature service is in production or in staging.
76+
77+
Additionally, users will often want to have a dev/staging environment that's separate from production. In this case, once pattern that works is to have separate projects:
78+
79+
```bash
80+
├── .github
81+
│ └── workflows
82+
│ ├── production.yml
83+
│ └── staging.yml
84+
85+
├── staging
86+
│ ├── driver_repo.py
87+
│ └── feature_store.yaml
88+
89+
└── production
90+
├── driver_repo.py
91+
└── feature_store.yaml
92+
```
93+
94+
## 2b. Data scientists
95+
TODO
96+
97+
Two ways of working
98+
- Use the `client/` folder approach of not authoring features and primarily re-using features already used in production.
99+
- Have a local copy of the feature repository (e.g. `git clone`). Then the data scientist can iterate on features locally, apply features to their own dev project with a local registry, and then submit PRs to include features that should be used in production (including A/B experiments, or model training iterations)
100+
101+
Data scientists can also investigate other models and their dependent features / data sources / on demand transformations through the repository or through the Web UI (by running `feast ui`)
102+
103+
## 2c. ML engineers
104+
TODO
105+
106+
Discuss the `client/` folder which only needs the `feature_store.yaml` to fetch features and schedule periodic training + model inference jobs
107+
108+
# Conclusion
109+
As a result:
110+
- You have file sources in S3
111+
- You have data scientists who are able to author + reuse features based on a centrally managed registry.
112+
- You have CI/CD You have a remote server that needs to call Feast to retrieve features, including executing on demand transformations to pass for model inference.
113+
- As a result of having multiple services needing central access to a registry, you also have your registry stored in S3.
114+
- You have multiple data scientists needing access to features
115+
116+
TODO
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)