Skip to content

Commit ec4c02b

Browse files
adchiagitbook-bot
authored andcommitted
GitBook: [#313] Instructions for custom data sources
1 parent 41affbb commit ec4c02b

File tree

1 file changed

+89
-9
lines changed

1 file changed

+89
-9
lines changed

docs/how-to-guides/adding-a-new-offline-store.md

Lines changed: 89 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Overview
44

5-
Feast makes adding support for a new offline store (database) easy. Developers can simply implement the [OfflineStore](../../sdk/python/feast/infra/offline_stores/offline_store.py#L41) interface to add support for a new store (other than the existing stores like Parquet files, Redshift, and Bigquery).
5+
Feast makes adding support for a new offline store (database) easy. Developers can simply implement the [OfflineStore](../../sdk/python/feast/infra/offline\_stores/offline\_store.py#L41) interface to add support for a new store (other than the existing stores like Parquet files, Redshift, and Bigquery). 
66

77
In this guide, we will show you how to extend the existing File offline store and use in a feature repo. While we will be implementing a specific store, this guide should be representative for adding support for any new offline store.
88

@@ -13,15 +13,16 @@ The process for using a custom offline store consists of 4 steps:
1313
1. Defining an `OfflineStore` class.
1414
2. Defining an `OfflineStoreConfig` class.
1515
3. Defining a `RetrievalJob` class for this offline store.
16-
4. Referencing the `OfflineStore` in a feature repo's `feature_store.yaml` file.
16+
4. Defining a `DataSource` class for the offline store
17+
5. Referencing the `OfflineStore` in a feature repo's `feature_store.yaml` file.
1718

1819
## 1. Defining an OfflineStore class
1920

2021
{% hint style="info" %}
21-
OfflineStore class names must end with the OfflineStore suffix!
22+
 OfflineStore class names must end with the OfflineStore suffix!
2223
{% endhint %}
2324

24-
The OfflineStore class contains a couple of methods to read features from the offline store. Unlike the OnlineStore class, Feast does not manage any infrastructure for the offline store.
25+
The OfflineStore class contains a couple of methods to read features from the offline store. Unlike the OnlineStore class, Feast does not manage any infrastructure for the offline store. 
2526

2627
There are two methods that deal with reading data from the offline stores`get_historical_features`and `pull_latest_from_table_or_query`.
2728

@@ -72,11 +73,11 @@ There are two methods that deal with reading data from the offline stores`get_hi
7273

7374
Additional configuration may be needed to allow the OfflineStore to talk to the backing store. For example, Redshift needs configuration information like the connection information for the Redshift instance, credentials for connecting to the database, etc.
7475

75-
To facilitate configuration, all OfflineStore implementations are **required** to also define a corresponding OfflineStoreConfig class in the same file. This OfflineStoreConfig class should inherit from the `FeastConfigBaseModel` class, which is defined [here](../../sdk/python/feast/repo_config.py#L44).
76+
To facilitate configuration, all OfflineStore implementations are **required** to also define a corresponding OfflineStoreConfig class in the same file. This OfflineStoreConfig class should inherit from the `FeastConfigBaseModel` class, which is defined [here](../../sdk/python/feast/repo\_config.py#L44). 
7677

7778
The `FeastConfigBaseModel` is a [pydantic](https://pydantic-docs.helpmanual.io) class, which parses yaml configuration into python objects. Pydantic also allows the model classes to define validators for the config classes, to make sure that the config classes are correctly defined.
7879

79-
This config class **must** container a `type` field, which contains the fully qualified class name of its corresponding OfflineStore class.
80+
This config class **must** container a `type` field, which contains the fully qualified class name of its corresponding OfflineStore class. 
8081

8182
Additionally, the name of the config class must be the same as the OfflineStore class, with the `Config` suffix.
8283

@@ -102,7 +103,7 @@ This configuration can be specified in the `feature_store.yaml` as follows:
102103
```
103104
{% endcode %}
104105

105-
This configuration information is available to the methods of the OfflineStore, via the`config: RepoConfig` parameter which is passed into the methods of the OfflineStore interface, specifically at the `config.offline_store` field of the `config` parameter.
106+
This configuration information is available to the methods of the OfflineStore, via the`config: RepoConfig` parameter which is passed into the methods of the OfflineStore interface, specifically at the `config.offline_store` field of the `config` parameter. 
106107

107108
{% code title="feast_custom_offline_store/file.py" %}
108109
```python
@@ -153,9 +154,69 @@ class CustomFileRetrievalJob(RetrievalJob):
153154
```
154155
{% endcode %}
155156

156-
## 4. Using the custom offline store
157+
## 4. Defining a DataSource class for the offline store
157158

158-
After implementing these classes, the custom offline store can be used by referencing it in a feature repo's `feature_store.yaml` file, specifically in the `offline_store` field. The value specified should be the fully qualified class name of the OfflineStore.
159+
Before this offline store can be used as the batch source for a feature view in a feature repo, a subclass of the `DataSource` [base class](https://rtd.feast.dev/en/master/index.html?highlight=DataSource#feast.data\_source.DataSource) needs to be defined. This class is responsible for holding information needed by specific feature views to support reading historical values from the offline store. For example, a feature view using Redshift as the offline store may need to know which table contains historical feature values.
160+
161+
The data source class should implement two methods - `from_proto`, and `to_proto`. 
162+
163+
For custom offline stores that are not being implemented in the main feature repo, the `custom_options` field should be used to store any configuration needed by the data source. In this case, the implementer is responsible for serializing this configuration into bytes in the `to_proto` method and reading the value back from bytes in the `from_proto` method.
164+
165+
{% code title="feast_custom_offline_store/file.py" %}
166+
```python
167+
class CustomFileDataSource(FileSource):
168+
"""Custom data source class for local files"""
169+
def __init__(
170+
self,
171+
event_timestamp_column: Optional[str] = "",
172+
path: Optional[str] = None,
173+
field_mapping: Optional[Dict[str, str]] = None,
174+
created_timestamp_column: Optional[str] = "",
175+
date_partition_column: Optional[str] = "",
176+
):
177+
super(CustomFileDataSource, self).__init__(
178+
event_timestamp_column,
179+
created_timestamp_column,
180+
field_mapping,
181+
date_partition_column,
182+
)
183+
self._path = path
184+
185+
186+
@staticmethod
187+
def from_proto(data_source: DataSourceProto):
188+
custom_source_options = str(
189+
data_source.custom_options.configuration, encoding="utf8"
190+
)
191+
path = json.loads(custom_source_options)["path"]
192+
return CustomFileDataSource(
193+
field_mapping=dict(data_source.field_mapping),
194+
path=path,
195+
event_timestamp_column=data_source.event_timestamp_column,
196+
created_timestamp_column=data_source.created_timestamp_column,
197+
date_partition_column=data_source.date_partition_column,
198+
)
199+
200+
def to_proto(self) -> DataSourceProto:
201+
config_json = json.dumps({"path": self.path})
202+
data_source_proto = DataSourceProto(
203+
type=DataSourceProto.CUSTOM_SOURCE,
204+
custom_options=DataSourceProto.CustomSourceOptions(
205+
configuration=bytes(config_json, encoding="utf8")
206+
),
207+
)
208+
209+
data_source_proto.event_timestamp_column = self.event_timestamp_column
210+
data_source_proto.created_timestamp_column = self.created_timestamp_column
211+
data_source_proto.date_partition_column = self.date_partition_column
212+
213+
return data_source_proto
214+
```
215+
{% endcode %}
216+
217+
## 5. Using the custom offline store 
218+
219+
After implementing these classes, the custom offline store can be used by referencing it in a feature repo's `feature_store.yaml` file, specifically in the `offline_store` field. The value specified should be the fully qualified class name of the OfflineStore. 
159220

160221
As long as your OfflineStore class is available in your Python environment, it will be imported by Feast dynamically at runtime.
161222

@@ -181,3 +242,22 @@ provider: local
181242
offline_store: feast_custom_offline_store.file.CustomFileOfflineStore
182243
```
183244
{% endcode %}
245+
246+
Finally, the custom data source class can be use in the feature repo to define a data source, and refer to in a feature view definition.
247+
248+
{% code title="feature_repo/repo.py" %}
249+
```python
250+
pdriver_hourly_stats = CustomFileDataSource(
251+
path="feature_repo/data/driver_stats.parquet",
252+
event_timestamp_column="event_timestamp",
253+
created_timestamp_column="created",
254+
)
255+
256+
257+
driver_hourly_stats_view = FeatureView(
258+
batch_source=driver_hourly_stats,
259+
...
260+
)
261+
262+
```
263+
{% endcode %}

0 commit comments

Comments
 (0)