Skip to content

Commit 729a06a

Browse files
zhilingczhilingc
authored andcommitted
Add documentation
1 parent 3f66f0e commit 729a06a

File tree

2 files changed

+160
-0
lines changed

2 files changed

+160
-0
lines changed

docs/assets/statistics-sources.png

162 KB
Loading

docs/user-guide/statistics.md

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# Statistics
2+
3+
Data is a first-class citizen in machine learning projects, it is critical to have tests and validations around data. To that end, Feast avails various feature statistics to users in order to give users visibility into the data that has been ingested into the system.
4+
5+
![overview](../assets/statistics-sources.png)
6+
7+
Feast exposes feature statistics at two points in the Feast system:
8+
1. Inflight feature statistics from the population job
9+
2. Historical feature statistics from the warehouse stores
10+
11+
## Historical Feature Statistics
12+
13+
Feast supports the computation of feature statistics over data already written to warehouse stores. These feature statistics, which can be retrieved over distinct sets of historical data, are fully compatible with [TFX's Data Validation](https://tensorflow.google.cn/tfx/tutorials/data_validation/tfdv_basic).
14+
15+
### Retrieving Statistics
16+
17+
Statistics can be retrieved from Feast using the python SDK's `get_statistics` method. This requires a connection to Feast core.
18+
19+
Feature statistics can be retrieved for a single feature set, from a single valid warehouse store. Users can opt to either retrieve feature statistics for a discrete subset of data by providing an `ingestion_id` , a unique id generated for a dataset when it is ingested into feast:
20+
21+
```{python}
22+
#A unique ingestion id is returned for each batch ingestion
23+
ingestion_id=client.ingest(feature_set,df)
24+
25+
stats = client.get_statistics(
26+
feature_set_id='project/feature_set',
27+
store='warehouse',
28+
features=['feature_1', 'feature_2'],
29+
ingestion_ids=[ingestion_id])
30+
```
31+
32+
Or by selecting data within a time range by providing a `start_date` and `end_date` (the start date is inclusive, the end date is not):
33+
34+
```{python}
35+
start_date=datetime(2020,10,1,0,0,0)
36+
end_date=datetime(2020,10,2,0,0,0)
37+
38+
stats = client.get_statistics(
39+
feature_set_id = 'project/feature_set',
40+
store='warehouse',
41+
features=['feature_1', 'feature_2'],
42+
start_date=start_date,
43+
end_date=end_date)
44+
```
45+
46+
{% hint style="info" %}
47+
Although `get_statistics` accepts python `datetime` objects for `start_date` and `end_date`, statistics are computed at the day granularity.
48+
{% endhint %}
49+
50+
Note that when providing a time range, Feast will NOT filter out duplicated rows. It is therefore highly recommended to provide `ingestion_id`s whenever possible.
51+
52+
Feast returns the statistics in the form of the protobuf [DatasetFeatureStatisticsList](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/statistics.proto#L36), which can be subsequently passed to TFDV methods to [validate the dataset](https://www.tensorflow.org/tfx/data_validation/get_started#checking_the_data_for_errors)...
53+
54+
```{python}
55+
anomalies = tfdv.validate_statistics(
56+
statistics=stats_2, schema=feature_set.export_tfx_schema())
57+
tfdv.display_anomalies(anomalies)
58+
```
59+
60+
Or [visualise the statistics](https://www.tensorflow.org/tfx/data_validation/get_started#computing_descriptive_data_statistics) in [facets](https://github.com/PAIR-code/facets).
61+
62+
```{python}
63+
tfdv.visualize_statistics(stats)
64+
```
65+
66+
Refer to the [example notebook](https://github.com/feast-dev/feast/blob/master/examples/statistics/Historical%20Feature%20Statistics%20with%20Feast,%20TFDV%20and%20Facets.ipynb) for an end-to-end example showcasing Feast's integration with TFDV and Facets.
67+
68+
### Aggregating Statistics
69+
70+
Feast supports retrieval of feature statistics across multiple datasets or days.
71+
72+
```{python}
73+
stats = client.get_statistics(
74+
feature_set_id='project/feature_set',
75+
store='warehouse',
76+
features=['feature_1', 'feature_2'],
77+
ingestion_ids=[ingestion_id_1, ingestion_id_2])
78+
```
79+
80+
However, when querying across multiple datasets, Feast computes the statistics for each dataset independently (for caching purposes), and aggregates the results. As a result of this, certain un-aggregatable statistics are dropped in the process, such as medians, uniqueness counts, and histograms.
81+
82+
Refer to the table below for the list of statistics that will be dropped.
83+
84+
### Caching
85+
86+
Feast caches the results of all feature statistics requests, and will, by default, retrieve and return the cached results. To recompute previously computed feature statistics, set `force_refresh` to `true` when retrieving the statistics:
87+
88+
```{python}
89+
stats=client.get_statistics(
90+
feature_set_id='project/feature_set',
91+
store='warehouse',
92+
features=['feature_1', 'feature_2'],
93+
dataset_ids=[dataset_id],
94+
force_refresh=True)
95+
```
96+
97+
This will force Feast to recompute the statistics, and replace any previously cached values.
98+
99+
### Supported Statistics
100+
101+
Feast supports most, but not all of the feature statistics defined in TFX's [FeatureNameStatistics](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/statistics.proto#L147). For the definition of each statistic and information about how each one is computed, refer to the [protobuf definition](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/statistics.proto#L147).
102+
103+
| Type | Statistic | Supported | Aggregateable |
104+
| --- | --- | --- | --- |
105+
| Common | NumNonMissing |||
106+
| | NumMissing |||
107+
| | MinNumValues |||
108+
| | MaxNumValues |||
109+
| | AvgNumValues |||
110+
| | TotalNumValues |||
111+
| | NumValuesHist | | |
112+
| Numeric | Min |||
113+
| | Max |||
114+
| | Median || |
115+
| | Mean |||
116+
| | Stdev |||
117+
| | NumZeroes |||
118+
| | Quantiles || |
119+
| | Histogram || |
120+
| String | RankHistogram || |
121+
| | TopValues || |
122+
| | Unique || |
123+
| | AvgLength |||
124+
| Bytes | MinNumBytes |||
125+
| | MaxNumBytes |||
126+
| | AvgNumBytes |||
127+
| | Unique || |
128+
| Struct/List | - (uses common statistics only) | - | - |
129+
130+
## Inflight Feature Statistics
131+
132+
For insight into data currently flowing into Feast through the population jobs, [statsd](https://github.com/statsd/statsd) is used to capture feature value statistics.
133+
134+
Inflight feature statistics are windowed (default window length is 30s) and computed at two points in the feature population pipeline:
135+
136+
1. Prior to store writes, after successful validation
137+
2. After successful store writes
138+
139+
The following metrics are written at the end of each window as [statsd gauges](https://github.com/statsd/statsd/blob/master/docs/metric_types.md#gauges):
140+
141+
```
142+
feast_ingestion_feature_value_min
143+
feast_ingestion_feature_value_max
144+
feast_ingestion_feature_value_mean
145+
feast_ingestion_feature_value_percentile_25 feast_ingestion_feature_value_percentile_50 feast_ingestion_feature_value_percentile_90 feast_ingestion_feature_value_percentile_95 feast_ingestion_feature_value_percentile_99
146+
```
147+
148+
{% hint style="info" %}
149+
the gauge metric type is used over histogram because statsd only supports positive values for histogram metric types, while numerical feature values can be of any double value.
150+
{% endhint %}
151+
152+
The metrics are tagged with and can be aggregated by the following keys:
153+
154+
| key | description |
155+
| --- | --- |
156+
| feast_store | store the population job is writing to
157+
| feast_project_name | feast project name
158+
| feast_featureSet_name | feature set name
159+
| feast_feature_name | feature name
160+
| ingestion_job_name | id of the population job writing the feature values.

0 commit comments

Comments
 (0)