Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
57574ed
Update README.md
xiaoyongzhu Jun 28, 2022
5cf8c41
Merge branch 'main' into xiaoyzhu/doc_fix2
xiaoyongzhu Jul 15, 2022
91533d3
update docs per feedback
xiaoyongzhu Jul 15, 2022
e1c16c6
Update feathr-concepts-for-beginners.md
xiaoyongzhu Jul 15, 2022
1fa08a7
Update feathr-concepts-for-beginners.md
xiaoyongzhu Jul 15, 2022
e1b0aec
update materialization setting doc
xiaoyongzhu Jul 19, 2022
1f26445
Update get-offline-features.md
xiaoyongzhu Jul 19, 2022
c752e01
Update get-offline-features.md
xiaoyongzhu Jul 19, 2022
132e1bd
Update feathr-concepts-for-beginners.md
xiaoyongzhu Jul 19, 2022
85a1c44
resolve comments
xiaoyongzhu Jul 20, 2022
3f26272
Update job_utils.py
xiaoyongzhu Jul 20, 2022
bea2a17
fix typos
xiaoyongzhu Jul 20, 2022
16b5c6d
Update job_utils.py
xiaoyongzhu Jul 20, 2022
f2e06e0
Update client.py
xiaoyongzhu Jul 21, 2022
7d9d488
format doc
xiaoyongzhu Jul 30, 2022
283f06d
Merge branch 'main' into xiaoyzhu/doc_fix2
xiaoyongzhu Jul 30, 2022
f7bdc21
Address comments
xiaoyongzhu Jul 30, 2022
1e12b97
Update WriteToHDFSOutputProcessor.scala
xiaoyongzhu Jul 30, 2022
2bb7369
Update WriteToHDFSOutputProcessor.scala
xiaoyongzhu Jul 30, 2022
f38a903
resolve comments
xiaoyongzhu Aug 1, 2022
d17c5fa
Resolve comments
xiaoyongzhu Aug 1, 2022
e8266da
Merge branch 'main' into xiaoyzhu/doc_fix2
xiaoyongzhu Aug 1, 2022
64760ce
fix test failures and typos
xiaoyongzhu Aug 1, 2022
8916be4
Update job_utils.py
xiaoyongzhu Aug 1, 2022
9a213ba
fix comments and formats/typos
xiaoyongzhu Aug 1, 2022
8edb706
fix typos and test failures
xiaoyongzhu Aug 1, 2022
5a97bcd
update test names
xiaoyongzhu Aug 1, 2022
2f00667
Update test_fixture.py
xiaoyongzhu Aug 1, 2022
e0c7427
Update test_fixture.py
xiaoyongzhu Aug 1, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions docs/concepts/feathr-concepts-for-beginners.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,9 +126,19 @@ client.get_online_features(feature_table = "agg_features",
## Illustration

An illustration of the concepts and process that we talked about is like this:
![Feature Join Process](../images/observation_data.jpg)
![Observation Data and Feature Query Process](../images/observation_data.jpg)

## Point in time joins and aggregations
## FAQs on the Concepts

### A bit more on `Observation Data`

The "Observation Data" is a concept that is a bit confusing for some beginners, and simply think it as an immutable dataset, but this dataset could be enhanced by other dataset. For example, you usually cannot drop a column for your "observation data", but you can add additional columns to it.

### What's the relationship between `Source` and `Anchor`?

Usually an Anchor can only have one source, but one source can be consumed by different anchors. From `Source` to `Anchor`, there might be an intermediate step, which is the "preprocessing" function and allows you to customize the input a bit.

### Point in time joins and aggregations - why we need them?

Assuming users are already familiar with the "regular" joins, for example inner join or outer join, and in many of the use cases, we care about time.

Expand Down
94 changes: 0 additions & 94 deletions docs/concepts/feature-join.md

This file was deleted.

87 changes: 87 additions & 0 deletions docs/concepts/get-offline-features.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
layout: default
title: Getting Offline Features using Feature Query
parent: Feathr Concepts
---

# Getting Offline Features using Feature Query
Comment thread
xiaoyongzhu marked this conversation as resolved.

## Intuitions

After the feature producers have defined the features (as described in the [Feature Definition](./feature-definition.md) part), the feature consumers may want to consume those features.

For example, the dataset is like below, where there are 3 tables that feature producers want to extract features from: `user_profile_mock_data`, `user_purchase_history_mock_data`, and `product_detail_mock_data`.

For feature consumers, they will usually use a central dataset ("observation data", `user_observation_mock_data` in this case) which contains a couple of IDs (`user_id` and `product_id` in this case), timestamps, and other columns. Feature consumers will use this "observation data" to query from different feature tables (using `Feature Query` below).

![Feature Flow](https://github.com/linkedin/feathr/blob/main/docs/images/product_recommendation_advanced.jpg?raw=true)

As we can see, the use case for getting offline features using Feathr is straightforward. Feature consumers want to get a few features - for a particular user, what's the gift card balance? What's the total purchase in the last 90 days; Feature consumers can also get a few features for other entities in the same `Feature Query`. For example, in the meanwhile, feature consumers can also query the product feature such as product quantity and price.

In this case, Feathr users can simply specify the feature name that they want to query, and specify for which entity/key that they want to query on, like below. Note that for feature consumers, they don't have to query all the features; instead they can just query a subset of the features that the feature producers have defined.

```python
user_feature_query = FeatureQuery(
feature_list=["feature_user_age",
"feature_user_tax_rate",
"feature_user_gift_card_balance",
"feature_user_has_valid_credit_card",
"feature_user_total_purchase_in_90days",
"feature_user_purchasing_power"
],
key=user_id)

product_feature_query = FeatureQuery(
feature_list=[
"feature_product_quantity",
"feature_product_price"
],
key=product_id)
```

And specify the location for the observation data:

```python
settings = ObservationSettings(
observation_path="wasbs://public@azurefeathrstorage.blob.core.windows.net/sample_data/product_recommendation_sample/user_observation_mock_data.csv",
event_timestamp_column="event_timestamp",
timestamp_format="yyyy-MM-dd")
```

And finally, specify the feature query and finally trigger the computation:

```python
client.get_offline_features(observation_settings=settings,
feature_query=[user_feature_query, product_feature_query],
output_path=output_path)

```

More details for the above APIs can be read from:

- [ObservationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.ObservationSettings)
- [client.get_offline_feature API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.get_offline_features)

## More on `Observation data`

The path of a dataset as the 'spine' for the to-be-created training dataset. We call this input 'spine' dataset the 'observation' dataset. Typically, each row of the observation data contains:

1. **Entity ID Column:** Column(s) representing entity id(s), which will be used as the join key to query feature value.

2. **Timestamp Column:** A column representing the event time of the row. By default, Feathr will make sure the feature values queried have a timestamp earlier than the timestamp in observation data, ensuring no data leakage in the resulting training dataset. Refer to [Point in time Joins](./point-in-time-join.md) for more details.

3. **Other columns** will be simply pass through to the output training dataset, which can be treated as immutable columns.

## More on `Feature Query`

After you have defined all the features, you probably don't want to use all of them in this particular program. In this case, instead of putting every features in this `FeatureQuery` part, you can just put a selected list of features. Note that they have to be of the same key.

## Difference between `materialize_features` and `get_offline_features` API
Comment thread
xiaoyongzhu marked this conversation as resolved.

It is sometimes confusing between "getting offline features" in this document and the "[getting materialized features](./materializing-features.md)" part, given they both seem to "get features and put it somewhere". However there are some differences and you should know when to use which:

1. For `get_offline_features` API, feature consumers usually need to have a central `observation data` so they can use `Feature Query` to query different features for different entities from different tables. For `materialize_features` API, feature consumers don't have the `observation data`, because they don't need to query from existing feature definitions. In this case, feature consumers only need to specify for a specific entity (say `user_id`), which features they want to materialize to offline or online store. Note that for a feature table in the materialization settings, feature consumers can only materialize features for the same key for the same table.

2. For the timestamps, in `get_offline_features` API, Feathr will make sure the feature values queried have a timestamp earlier than the timestamp in observation data, ensuring no data leakage in the resulting training dataset. For `materialize_features` API, Feathr will always materialize the latest feature available in the dataset.

3. Those two APIs are used in two different stage of feature engineering pipeline, and serves different purpose. For `get_offline_features`, it is usually to get data for model training and usually is focused on getting historical data from an offline storage; while for `materialize_features`, it is usually to pre-compute features for model inference via online store.
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
---
layout: default
title: Feature Generation and Materialization
title: Feature Materialization (also known as feature generation)
parent: Feathr Concepts
---

# Feature Generation and Materialization
# Feature Materialization (also known as feature generation)

Feature generation (also known as feature materialization) is the process to create features from raw source data into a certain persisted storage in either offline store (for further reuse), or online store (for online inference).
Feature materialization (also known as feature generation) is the process to create features for a certain entity from raw source data into a certain persisted storage in either offline store (for further reuse), or online store (for online inference).
Comment thread
xiaoyongzhu marked this conversation as resolved.

User can utilize feature generation to pre-compute and materialize pre-defined features to online and/or offline storage. This is desirable when the feature transformation is computation intensive or when the features can be reused (usually in offline setting). Feature generation is also useful in generating embedding features, where those embeddings distill information from large data and is usually more compact.
User can utilize feature generation to pre-compute and materialize pre-defined features to online and/or offline storage. This is desirable when the feature transformation is computation intensive or when the features can be reused (usually in offline setting). Feature generation is also useful in generating embedding features, where those embeddings distill information from large data and is usually more compact. Also, please note that you can only materialize features for a specific entity/key in the same `materialize_features` call.

## Generating Features to Online Store
## Materializing Features to Online Store

When the models are served in an online environment, we also need to serve the corresponding features in the same online environment as well. Feathr provides APIs to generate features to online storage for future consumption. For example:

Expand Down Expand Up @@ -119,7 +119,7 @@ client.materialize_features(settings, execution_configurations={ "spark.feathr.o
For reading those materialized features, Feathr has a convenient helper function called `get_result_df` to help you view the data. For example, you can use the sample code below to read from the materialized result in offline store:

```python

from feathr import get_result_df
path = "abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/materialize_offline_test_data/df0/daily/2020/05/20/"
res = get_result_df(client=client, format="parquet", res_url=path)
```
Expand Down
22 changes: 16 additions & 6 deletions docs/dev_guide/feathr_overall_release_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,26 +4,36 @@ title: Developer Guide for Feathr Overall Release Guide
parent: Developer Guides
---

# When to Release
- For each major and minor version release, please follow these steps.
# Feathr Overall Release Guide

This document describes all the release process for the development team.

## When to Release

- For each major and minor version release, please follow these steps.
- For patch versions, there should be no releases.

# Writing Release Note
## Writing Release Note

Write a release note following past examples [here](https://github.com/linkedin/feathr/releases).
Read through the [commit log](https://github.com/linkedin/feathr/commits/main) to identify the commits after last release to include in the release note. Here are the major things to include

- highlights of the release
- improvements and changes of this release
- new contributors of this release

## Release Maven

# Release Maven
See [Developer Guide for publishing to maven](publish_to_maven.md)

## Upload Feathr Jar

Run the command to generate the Java jar. After the jar is generated, please upload to [Azure storage](https://ms.portal.azure.com/#view/Microsoft_Azure_Storage/ContainerMenuBlade/~/overview/storageAccountId/%2Fsubscriptions%2Fa6c2a7cc-d67e-4a1a-b765-983f08c0423a%2FresourceGroups%2Fazurefeathrintegration%2Fproviders%2FMicrosoft.Storage%2FstorageAccounts%2Fazurefeathrstorage/path/public/etag/%220x8D9E6F64D62D599%22/defaultEncryptionScope/%24account-encryption-key/denyEncryptionScopeOverride//defaultId//publicAccessVal/Container) for faster access.

# Release PyPi
## Release PyPi

See [Python Package Release Note](python_package_release.md)

# Announcement
## Announcement

Please announce the release in our #general Slack channel.
20 changes: 14 additions & 6 deletions docs/dev_guide/publish_to_maven.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,15 @@ layout: default
title: Developer Guide for publishing to maven
parent: Developer Guides
---

# Developer Guide for publishing to maven

## Manual Publishing

1. Get account details to login to https://oss.sonatype.org/
2. Install GPG, setup keys, and export to a key server
```

```bash
$ gpg --gen-key
...
Real name: Central Repo Test
Expand All @@ -32,37 +34,46 @@ $ gpg --keyserver keyserver.ubuntu.com --recv-keys CA925CD6C9E8D064FF05B4728190C
if failing to programmatically export to key server, you can export it manually and upload to http://keyserver.ubuntu.com/ via `submit key`

run the following command to generated the ASCII-armored public key needed by the key server

```
gpg --armor --export user-id > pubkey.asc
```

https://www.linuxbabe.com/security/a-practical-guide-to-gpg-part-1-generate-your-keypair

3. Setup your credentials locally at `$HOME/.sbt/0.13/sonatype.sbt`

```
credentials += Credentials("Sonatype Nexus Repository Manager",
"oss.sonatype.org",
"(Sonatype user name)",
"(Sonatype password)")
```

(ref, https://github.com/xerial/sbt-sonatype)

4. Publish to maven via sbt
In your feathr directory, clear your cache to prevent stale errors
In your feathr directory, clear your cache to prevent stale errors

```
rm -rf target/sonatype-staging/
```

Start sbt console by running

```
sbt -java-home /Library/Java/JavaVirtualMachines/jdk1.8.0_282-msft.jdk/Contents/Home
```

Execute command in sbt console to publish to maven

```
reload
; publishSigned; sonatypeBundleRelease
```

5. "Upon release, your component will be published to Central: this typically occurs within 30 minutes, though updates to search can take up to four hours."
https://central.sonatype.org/publish/publish-guide/#releasing-to-central
https://central.sonatype.org/publish/publish-guide/#releasing-to-central

6. After new version is released via Maven, use the released version to run a test to ensure it actually works. You can do this by running a codebase that imports Feathr scala code.

Expand All @@ -72,9 +83,6 @@ https://central.sonatype.org/publish/publish-guide/#releasing-to-central

### References



https://central.sonatype.org/publish/publish-guide/#deployment

https://www.scala-sbt.org/1.x/docs/Using-Sonatype.html

Loading