Skip to content

Commit 98b8d8d

Browse files
kevjumbaadchia
andauthored
feat: Feast Spark Offline Store (feast-dev#2349)
* State of feast Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Clean up changes Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix random incorrect changes Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix lint Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix build errors Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix lint Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Add spark offline store components to test against current integration tests Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix lint Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Rename to pass checks Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix issues Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix type checking issues Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix lint Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Clean up print statements for first review Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix lint Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix flake 8 lint tests Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Add warnings for alpha version release Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Format Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Address review Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Address review Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix lint Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Add file store functionality Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * lint Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Add example feature repo Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Update data source creator Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Make cli work for feast init with spark Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Update the docs Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Clean up code Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Clean up more code Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Uncomment repo configs Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix setup.py Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Update dependencies Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix ci dependencies Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Screwed up rebase Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Screwed up rebase Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Screwed up rebase Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Realign with master Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix accidental changes Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Make type map change cleaner Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Address review comments Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix tests accidentally broken Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Add comments Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Reformat Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix logger Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Remove unused imports Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix imports Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix CI dependencies Signed-off-by: Danny Chiao <danny@tecton.ai> * Prefix destinations with project name Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Update comment Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix 3.8 Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * temporary fix Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * rollback Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * update Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Update ci? Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Move third party to contrib Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Fix imports Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Remove third_party refactor Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Revert ci requirements and update comment in type map Signed-off-by: Kevin Zhang <kzhang@tecton.ai> * Revert 3.8-requirements Signed-off-by: Kevin Zhang <kzhang@tecton.ai> Co-authored-by: Danny Chiao <danny@tecton.ai>
1 parent 74f887f commit 98b8d8d

21 files changed

Lines changed: 1401 additions & 208 deletions

File tree

docs/reference/data-sources/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,5 @@ Please see [Data Source](../../getting-started/concepts/feature-view.md#data-sou
99
{% page-ref page="bigquery.md" %}
1010

1111
{% page-ref page="redshift.md" %}
12+
13+
{% page-ref page="spark.md" %}
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Spark
2+
3+
## Description
4+
5+
**NOTE**: Spark data source api is currently in alpha development and the API is not completely stable. The API may change or update in the future.
6+
7+
The spark data source API allows for the retrieval of historical feature values from file/database sources for building training datasets as well as materializing features into an online store.
8+
9+
* Either a table name, a SQL query, or a file path can be provided.
10+
11+
## Examples
12+
13+
Using a table reference from SparkSession(for example, either in memory or a Hive Metastore)
14+
15+
```python
16+
from feast import SparkSource
17+
18+
my_spark_source = SparkSource(
19+
table="FEATURE_TABLE",
20+
)
21+
```
22+
23+
Using a query
24+
25+
```python
26+
from feast import SparkSource
27+
28+
my_spark_source = SparkSource(
29+
query="SELECT timestamp as ts, created, f1, f2 "
30+
"FROM spark_table",
31+
)
32+
```
33+
34+
Using a file reference
35+
36+
```python
37+
from feast import SparkSource
38+
39+
my_spark_source = SparkSource(
40+
path=f"{CURRENT_DIR}/data/driver_hourly_stats",
41+
file_format="parquet",
42+
event_timestamp_column="event_timestamp",
43+
created_timestamp_column="created",
44+
)
45+
```

docs/reference/offline-stores/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,5 @@ Please see [Offline Store](../../getting-started/architecture-and-components/off
99
{% page-ref page="bigquery.md" %}
1010

1111
{% page-ref page="redshift.md" %}
12+
13+
{% page-ref page="spark.md" %}
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Spark
2+
3+
## Description
4+
5+
The Spark offline store is an offline store currently in alpha development that provides support for reading [SparkSources](../data-sources/spark.md).
6+
7+
## Disclaimer
8+
9+
This Spark offline store still does not achieve full test coverage and continues to fail some integration tests when integrating with the feast universal test suite. Please do NOT assume complete stability of the API.
10+
11+
* Spark tables and views are allowed as sources that are loaded in from some Spark store(e.g in Hive or in memory).
12+
* Entity dataframes can be provided as a SQL query or can be provided as a Pandas dataframe. Pandas dataframes will be converted to a Spark dataframe and processed as a temporary view.
13+
* A `SparkRetrievalJob` is returned when calling `get_historical_features()`.
14+
* This allows you to call
15+
* `to_df` to retrieve the pandas dataframe.
16+
* `to_arrow` to retrieve the dataframe as a pyarrow Table.
17+
* `to_spark_df` to retrieve the dataframe the spark.
18+
19+
## Example
20+
21+
{% code title="feature_store.yaml" %}
22+
```yaml
23+
project: my_project
24+
registry: data/registry.db
25+
provider: local
26+
offline_store:
27+
type: spark
28+
spark_conf:
29+
spark.master: "local[*]"
30+
spark.ui.enabled: "false"
31+
spark.eventLog.enabled: "false"
32+
spark.sql.catalogImplementation: "hive"
33+
spark.sql.parser.quotedRegexColumnNames: "true"
34+
spark.sql.session.timeZone: "UTC"
35+
online_store:
36+
path: data/online_store.db
37+
```
38+
{% endcode %}

sdk/python/feast/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,9 @@
33
from pkg_resources import DistributionNotFound, get_distribution
44

55
from feast.infra.offline_stores.bigquery_source import BigQuerySource
6+
from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import (
7+
SparkSource,
8+
)
69
from feast.infra.offline_stores.file_source import FileSource
710
from feast.infra.offline_stores.redshift_source import RedshiftSource
811
from feast.infra.offline_stores.snowflake_source import SnowflakeSource
@@ -47,4 +50,5 @@
4750
"RedshiftSource",
4851
"RequestFeatureView",
4952
"SnowflakeSource",
53+
"SparkSource",
5054
]

sdk/python/feast/cli.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -477,7 +477,9 @@ def materialize_incremental_command(ctx: click.Context, end_ts: str, views: List
477477
@click.option(
478478
"--template",
479479
"-t",
480-
type=click.Choice(["local", "gcp", "aws", "snowflake"], case_sensitive=False),
480+
type=click.Choice(
481+
["local", "gcp", "aws", "snowflake", "spark"], case_sensitive=False
482+
),
481483
help="Specify a template for the created project",
482484
default="local",
483485
)

sdk/python/feast/inference.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
FileSource,
99
RedshiftSource,
1010
SnowflakeSource,
11+
SparkSource,
1112
)
1213
from feast.data_source import DataSource
1314
from feast.errors import RegistryInferenceFailure
@@ -84,7 +85,9 @@ def update_data_sources_with_inferred_event_timestamp_col(
8485
):
8586
# prepare right match pattern for data source
8687
ts_column_type_regex_pattern = ""
87-
if isinstance(data_source, FileSource):
88+
if isinstance(data_source, FileSource) or isinstance(
89+
data_source, SparkSource
90+
):
8891
ts_column_type_regex_pattern = r"^timestamp"
8992
elif isinstance(data_source, BigQuerySource):
9093
ts_column_type_regex_pattern = "TIMESTAMP|DATETIME"
@@ -97,7 +100,7 @@ def update_data_sources_with_inferred_event_timestamp_col(
97100
"DataSource",
98101
"""
99102
DataSource inferencing of event_timestamp_column is currently only supported
100-
for FileSource and BigQuerySource.
103+
for FileSource, SparkSource, BigQuerySource, RedshiftSource, and SnowflakeSource.
101104
""",
102105
)
103106
# for informing the type checker

sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)