-
Notifications
You must be signed in to change notification settings - Fork 245
Various documentation fix #477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
29 commits
Select commit
Hold shift + click to select a range
57574ed
Update README.md
xiaoyongzhu 5cf8c41
Merge branch 'main' into xiaoyzhu/doc_fix2
xiaoyongzhu 91533d3
update docs per feedback
xiaoyongzhu e1c16c6
Update feathr-concepts-for-beginners.md
xiaoyongzhu 1fa08a7
Update feathr-concepts-for-beginners.md
xiaoyongzhu e1b0aec
update materialization setting doc
xiaoyongzhu 1f26445
Update get-offline-features.md
xiaoyongzhu c752e01
Update get-offline-features.md
xiaoyongzhu 132e1bd
Update feathr-concepts-for-beginners.md
xiaoyongzhu 85a1c44
resolve comments
xiaoyongzhu 3f26272
Update job_utils.py
xiaoyongzhu bea2a17
fix typos
xiaoyongzhu 16b5c6d
Update job_utils.py
xiaoyongzhu f2e06e0
Update client.py
xiaoyongzhu 7d9d488
format doc
xiaoyongzhu 283f06d
Merge branch 'main' into xiaoyzhu/doc_fix2
xiaoyongzhu f7bdc21
Address comments
xiaoyongzhu 1e12b97
Update WriteToHDFSOutputProcessor.scala
xiaoyongzhu 2bb7369
Update WriteToHDFSOutputProcessor.scala
xiaoyongzhu f38a903
resolve comments
xiaoyongzhu d17c5fa
Resolve comments
xiaoyongzhu e8266da
Merge branch 'main' into xiaoyzhu/doc_fix2
xiaoyongzhu 64760ce
fix test failures and typos
xiaoyongzhu 8916be4
Update job_utils.py
xiaoyongzhu 9a213ba
fix comments and formats/typos
xiaoyongzhu 8edb706
fix typos and test failures
xiaoyongzhu 5a97bcd
update test names
xiaoyongzhu 2f00667
Update test_fixture.py
xiaoyongzhu e0c7427
Update test_fixture.py
xiaoyongzhu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,87 @@ | ||
| --- | ||
| layout: default | ||
| title: Getting Offline Features using Feature Query | ||
| parent: Feathr Concepts | ||
| --- | ||
|
|
||
| # Getting Offline Features using Feature Query | ||
|
|
||
| ## Intuitions | ||
|
|
||
| After the feature producers have defined the features (as described in the [Feature Definition](./feature-definition.md) part), the feature consumers may want to consume those features. | ||
|
|
||
| For example, the dataset is like below, where there are 3 tables that feature producers want to extract features from: `user_profile_mock_data`, `user_purchase_history_mock_data`, and `product_detail_mock_data`. | ||
|
|
||
| For feature consumers, they will usually use a central dataset ("observation data", `user_observation_mock_data` in this case) which contains a couple of IDs (`user_id` and `product_id` in this case), timestamps, and other columns. Feature consumers will use this "observation data" to query from different feature tables (using `Feature Query` below). | ||
|
|
||
|  | ||
|
|
||
| As we can see, the use case for getting offline features using Feathr is straightforward. Feature consumers want to get a few features - for a particular user, what's the gift card balance? What's the total purchase in the last 90 days; Feature consumers can also get a few features for other entities in the same `Feature Query`. For example, in the meanwhile, feature consumers can also query the product feature such as product quantity and price. | ||
|
|
||
| In this case, Feathr users can simply specify the feature name that they want to query, and specify for which entity/key that they want to query on, like below. Note that for feature consumers, they don't have to query all the features; instead they can just query a subset of the features that the feature producers have defined. | ||
|
|
||
| ```python | ||
| user_feature_query = FeatureQuery( | ||
| feature_list=["feature_user_age", | ||
| "feature_user_tax_rate", | ||
| "feature_user_gift_card_balance", | ||
| "feature_user_has_valid_credit_card", | ||
| "feature_user_total_purchase_in_90days", | ||
| "feature_user_purchasing_power" | ||
| ], | ||
| key=user_id) | ||
|
|
||
| product_feature_query = FeatureQuery( | ||
| feature_list=[ | ||
| "feature_product_quantity", | ||
| "feature_product_price" | ||
| ], | ||
| key=product_id) | ||
| ``` | ||
|
|
||
| And specify the location for the observation data: | ||
|
|
||
| ```python | ||
| settings = ObservationSettings( | ||
| observation_path="wasbs://public@azurefeathrstorage.blob.core.windows.net/sample_data/product_recommendation_sample/user_observation_mock_data.csv", | ||
| event_timestamp_column="event_timestamp", | ||
| timestamp_format="yyyy-MM-dd") | ||
| ``` | ||
|
|
||
| And finally, specify the feature query and finally trigger the computation: | ||
|
|
||
| ```python | ||
| client.get_offline_features(observation_settings=settings, | ||
| feature_query=[user_feature_query, product_feature_query], | ||
| output_path=output_path) | ||
|
|
||
| ``` | ||
|
|
||
| More details for the above APIs can be read from: | ||
|
|
||
| - [ObservationSettings API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.ObservationSettings) | ||
| - [client.get_offline_feature API doc](https://feathr.readthedocs.io/en/latest/feathr.html#feathr.FeathrClient.get_offline_features) | ||
|
|
||
| ## More on `Observation data` | ||
|
|
||
| The path of a dataset as the 'spine' for the to-be-created training dataset. We call this input 'spine' dataset the 'observation' dataset. Typically, each row of the observation data contains: | ||
|
|
||
| 1. **Entity ID Column:** Column(s) representing entity id(s), which will be used as the join key to query feature value. | ||
|
|
||
| 2. **Timestamp Column:** A column representing the event time of the row. By default, Feathr will make sure the feature values queried have a timestamp earlier than the timestamp in observation data, ensuring no data leakage in the resulting training dataset. Refer to [Point in time Joins](./point-in-time-join.md) for more details. | ||
|
|
||
| 3. **Other columns** will be simply pass through to the output training dataset, which can be treated as immutable columns. | ||
|
|
||
| ## More on `Feature Query` | ||
|
|
||
| After you have defined all the features, you probably don't want to use all of them in this particular program. In this case, instead of putting every features in this `FeatureQuery` part, you can just put a selected list of features. Note that they have to be of the same key. | ||
|
|
||
| ## Difference between `materialize_features` and `get_offline_features` API | ||
|
xiaoyongzhu marked this conversation as resolved.
|
||
|
|
||
| It is sometimes confusing between "getting offline features" in this document and the "[getting materialized features](./materializing-features.md)" part, given they both seem to "get features and put it somewhere". However there are some differences and you should know when to use which: | ||
|
|
||
| 1. For `get_offline_features` API, feature consumers usually need to have a central `observation data` so they can use `Feature Query` to query different features for different entities from different tables. For `materialize_features` API, feature consumers don't have the `observation data`, because they don't need to query from existing feature definitions. In this case, feature consumers only need to specify for a specific entity (say `user_id`), which features they want to materialize to offline or online store. Note that for a feature table in the materialization settings, feature consumers can only materialize features for the same key for the same table. | ||
|
|
||
| 2. For the timestamps, in `get_offline_features` API, Feathr will make sure the feature values queried have a timestamp earlier than the timestamp in observation data, ensuring no data leakage in the resulting training dataset. For `materialize_features` API, Feathr will always materialize the latest feature available in the dataset. | ||
|
|
||
| 3. Those two APIs are used in two different stage of feature engineering pipeline, and serves different purpose. For `get_offline_features`, it is usually to get data for model training and usually is focused on getting historical data from an offline storage; while for `materialize_features`, it is usually to pre-compute features for model inference via online store. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.