Skip to content

Commit cad91c0

Browse files
committed
Flesh out module 2
Signed-off-by: Danny Chiao <danny@tecton.ai>
1 parent fc07882 commit cad91c0

File tree

7 files changed

+89
-102
lines changed

7 files changed

+89
-102
lines changed

module_0/feature_repo_aws/feature_services.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,6 @@
88
owner="test3@gmail.com",
99
)
1010

11-
feature_service = FeatureService(
11+
feature_service_2 = FeatureService(
1212
name="model_v2", features=[driver_hourly_stats_view], owner="test3@gmail.com",
1313
)

module_1/feature_repo/feature_services.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,6 @@
88
owner="test3@gmail.com",
99
)
1010

11-
feature_service = FeatureService(
11+
feature_service_2 = FeatureService(
1212
name="model_v2", features=[driver_hourly_stats_view], owner="test3@gmail.com",
1313
)

module_2/README.md

Lines changed: 63 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,77 @@
1-
# Module 2: On demand transformations
2-
TODO
1+
<h1>Module 2: On demand transformations</h1>
2+
3+
In this module, we introduce the concept of on demand transforms. These are transformations that execute on-the-fly and accept as input other feature views or request data.
34

5+
TODO:
6+
- add architecture
47
- Define request data
58
- Define on demand transforms
69
- Note that this can also transforms pushed features (e.g. stream features)
710
- Note that this can combine multiple feature views and request data
811

12+
<h2>Table of Contents</h2>
913

10-
<h1>Module 2: On demand transformations</h1>
14+
- [Workshop](#workshop)
15+
- [Step 1: Install Feast](#step-1-install-feast)
16+
- [Step 2: Look at the data we have](#step-2-look-at-the-data-we-have)
17+
- [Step 3: Apply features](#step-3-apply-features)
18+
- [Step 3: Materialize batch features](#step-3-materialize-batch-features)
19+
- [Step 4: Test retrieve features](#step-4-test-retrieve-features)
20+
- [Conclusion](#conclusion)
21+
22+
# Workshop
23+
## Step 1: Install Feast
1124

12-
In this module, we introduce the concept of on demand transforms. These are transformations that execute on-the-fly and accept as input other feature views or request data.
25+
First, we install Feast as well as a Geohash module we want to use:
26+
```bash
27+
pip install feast
28+
pip install pygeohash
29+
```
1330

14-
We and focus on building features for online serving, and keeping them fresh with a combination of batch feature materialization and stream feature ingestion. We'll be roughly working towards the following:
31+
## Step 2: Look at the data we have
32+
We used `data/gen_lat_lon.py` to append randomly generated latitude and longitudes to the original driver stats dataset.
1533

16-
- **Data sources**: Kafka + File source
17-
- **Online store**: Redis
18-
- **Use case**: Predicting churn for drivers in real time.
34+
```python
35+
import pandas as pd
36+
pd.read_parquet("data/driver_stats_lat_lon.parquet")
37+
```
1938

20-
<img src="architecture.png" width=750>
39+
![](data.png)
2140

22-
<h2>Table of Contents</h2>
41+
## Step 3: Apply features
42+
```console
43+
$ feast apply
2344

24-
# Workshop
25-
## Step 1: Install Feast
45+
Created entity driver
46+
Created feature view driver_daily_features
47+
Created feature view driver_hourly_stats
48+
Created on demand feature view transformed_conv_rate
49+
Created on demand feature view avg_hourly_miles_driven
50+
Created on demand feature view location_features_from_push
51+
Created feature service model_v3
52+
Created feature service model_v2
53+
Created feature service model_v1
2654

27-
First, we install Feast with Spark and Redis support:
28-
```bash
29-
pip install "feast[spark,redis]"
30-
```
55+
Created sqlite table feast_demo_odfv_driver_daily_features
56+
Created sqlite table feast_demo_odfv_driver_hourly_stats
57+
```
58+
59+
## Step 3: Materialize batch features
60+
```console
61+
$ feast materialize-incremental $(date +%Y-%m-%d)
62+
63+
Materializing 2 feature views to 2022-05-17 12:41:18-04:00 into the sqlite online store.
64+
65+
driver_hourly_stats from 1748-08-01 16:41:20-04:56:02 to 2022-05-17 12:41:18-04:00:
66+
100%|████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 495.03it/s]
67+
driver_daily_features from 1748-08-01 16:41:20-04:56:02 to 2022-05-17 12:41:18-04:00:
68+
100%|███████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1274.48it/s]
69+
```
70+
71+
## Step 4: Test retrieve features
72+
Now we'll see how these transformations are executed offline at `get_historical_features` and online at `get_online_features` time. We'll also see how `OnDemandFeatureView` interacts with request data, regular feature views, and streaming / push features.
73+
74+
Try out the Jupyter notebook in [client/module_2_client.ipynb](client/module_2_client.ipynb). This is in a separate directory that contains just a `feature_store.yaml`.
75+
76+
# Conclusion
77+
TODO

module_2/client/test_fetch.py

Lines changed: 0 additions & 61 deletions
This file was deleted.

module_2/feature_repo/data_sources.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88

99
driver_stats = FileSource(
1010
name="driver_stats_source",
11-
path="../data/driver_stats_lat_lon.parquet", # Should be a remote path in reality for re-use
11+
path="../data/driver_stats_lat_lon.parquet",
1212
timestamp_field="event_timestamp",
1313
created_timestamp_column="created",
1414
description="A table describing the stats of a driver based on hourly logs",
@@ -24,10 +24,14 @@
2424
# available at request time (e.g. part of the user initiated HTTP request)
2525
driver_request = RequestSource(
2626
name="driver_request",
27+
schema=[Field(name="lat", dtype=Float32), Field(name="lon", dtype=Float32),],
28+
)
29+
30+
31+
val_to_add_request = RequestSource(
32+
name="vals_to_add",
2733
schema=[
2834
Field(name="val_to_add", dtype=Int64),
2935
Field(name="val_to_add_2", dtype=Int64),
30-
Field(name="lat", dtype=Float32),
31-
Field(name="lon", dtype=Float32),
3236
],
3337
)

module_2/feature_repo/feature_services.py

Lines changed: 4 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -8,25 +8,15 @@
88
owner="test3@gmail.com",
99
)
1010

11-
feature_service = FeatureService(
11+
feature_service_2 = FeatureService(
1212
name="model_v2",
13-
features=[
14-
driver_hourly_stats_view,
15-
driver_daily_features_view,
16-
transformed_conv_rate,
17-
],
13+
features=[driver_hourly_stats_view[["conv_rate"]], transformed_conv_rate,],
1814
owner="test3@gmail.com",
1915
)
2016

21-
feature_service = FeatureService(
17+
feature_service_3 = FeatureService(
2218
name="model_v3",
23-
features=[
24-
driver_hourly_stats_view,
25-
driver_daily_features_view,
26-
transformed_conv_rate,
27-
avg_hourly_miles_driven,
28-
location_features,
29-
],
19+
features=[driver_daily_features_view, location_features_from_push,],
3020
owner="test3@gmail.com",
3121
)
3222

module_2/feature_repo/features.py

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,11 @@
3232
name="driver_daily_features",
3333
entities=["driver"],
3434
ttl=timedelta(seconds=8640000000),
35-
schema=[Field(name="daily_miles_driven", dtype=Float32),],
35+
schema=[
36+
Field(name="daily_miles_driven", dtype=Float32),
37+
Field(name="lat", dtype=Float32),
38+
Field(name="lon", dtype=Float32),
39+
],
3640
online=True,
3741
source=driver_stats_push_source,
3842
tags={"production": "True"},
@@ -43,7 +47,7 @@
4347
# Define an on demand feature view which can generate new features based on
4448
# existing feature views and RequestSource features
4549
@on_demand_feature_view(
46-
sources=[driver_hourly_stats_view, driver_request],
50+
sources=[driver_hourly_stats_view, val_to_add_request],
4751
schema=[
4852
Field(name="conv_rate_plus_val1", dtype=Float64),
4953
Field(name="conv_rate_plus_val2", dtype=Float64),
@@ -67,14 +71,17 @@ def avg_hourly_miles_driven(inputs: pd.DataFrame) -> pd.DataFrame:
6771

6872

6973
@on_demand_feature_view(
70-
sources=[driver_daily_features_view, driver_request],
74+
sources=[driver_daily_features_view],
7175
schema=[Field(name=f"geohash_{i}", dtype=String) for i in range(1, 7)],
7276
)
73-
def location_features(inputs: pd.DataFrame) -> pd.DataFrame:
77+
def location_features_from_push(inputs: pd.DataFrame) -> pd.DataFrame:
7478
import pygeohash as gh
7579

7680
df = pd.DataFrame()
77-
geohash = df.apply(lambda x: gh.encode(x.lat, x.lon), axis=1)
81+
df["geohash"] = inputs.apply(lambda x: gh.encode(x.lat, x.lon), axis=1).astype(
82+
"string"
83+
)
84+
7885
for i in range(1, 7):
79-
df[f"geohash_{i}"] = geohash.str[:i]
86+
df[f"geohash_{i}"] = df["geohash"].str[:i].astype("string")
8087
return df

0 commit comments

Comments
 (0)