|
| 1 | +# Statistics |
| 2 | + |
| 3 | +Data is a first-class citizen in machine learning projects, it is critical to have tests and validations around data. To that end, Feast avails various feature statistics to users in order to give users visibility into the data that has been ingested into the system. |
| 4 | + |
| 5 | + |
| 6 | + |
| 7 | +Feast exposes feature statistics at two points in the Feast system: |
| 8 | +1. Inflight feature statistics from the population job |
| 9 | +2. Historical feature statistics from the warehouse stores |
| 10 | + |
| 11 | +## Historical Feature Statistics |
| 12 | + |
| 13 | +Feast supports the computation of feature statistics over data already written to warehouse stores. These feature statistics, which can be retrieved over distinct sets of historical data, are fully compatible with [TFX's Data Validation](https://tensorflow.google.cn/tfx/tutorials/data_validation/tfdv_basic). |
| 14 | + |
| 15 | +### Retrieving Statistics |
| 16 | + |
| 17 | +Statistics can be retrieved from Feast using the python SDK's `get_statistics` method. This requires a connection to Feast core. |
| 18 | + |
| 19 | +Feature statistics can be retrieved for a single feature set, from a single valid warehouse store. Users can opt to either retrieve feature statistics for a discrete subset of data by providing an `ingestion_id` , a unique id generated for a dataset when it is ingested into feast: |
| 20 | + |
| 21 | +```{python} |
| 22 | +#A unique ingestion id is returned for each batch ingestion |
| 23 | +ingestion_id=client.ingest(feature_set,df) |
| 24 | +
|
| 25 | +stats = client.get_statistics( |
| 26 | + feature_set_id='project/feature_set', |
| 27 | + store='warehouse', |
| 28 | + features=['feature_1', 'feature_2'], |
| 29 | + ingestion_ids=[ingestion_id]) |
| 30 | +``` |
| 31 | + |
| 32 | +Or by selecting data within a time range by providing a `start_date` and `end_date` (the start date is inclusive, the end date is not): |
| 33 | + |
| 34 | +```{python} |
| 35 | +start_date=datetime(2020,10,1,0,0,0) |
| 36 | +end_date=datetime(2020,10,2,0,0,0) |
| 37 | +
|
| 38 | +stats = client.get_statistics( |
| 39 | +feature_set_id = 'project/feature_set', |
| 40 | + store='warehouse', |
| 41 | + features=['feature_1', 'feature_2'], |
| 42 | + start_date=start_date, |
| 43 | + end_date=end_date) |
| 44 | +``` |
| 45 | + |
| 46 | +{% hint style="info" %} |
| 47 | +Although `get_statistics` accepts python `datetime` objects for `start_date` and `end_date`, statistics are computed at the day granularity. |
| 48 | +{% endhint %} |
| 49 | + |
| 50 | +Note that when providing a time range, Feast will NOT filter out duplicated rows. It is therefore highly recommended to provide `ingestion_id`s whenever possible. |
| 51 | + |
| 52 | +Feast returns the statistics in the form of the protobuf [DatasetFeatureStatisticsList](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/statistics.proto#L36), which can be subsequently passed to TFDV methods to [validate the dataset](https://www.tensorflow.org/tfx/data_validation/get_started#checking_the_data_for_errors)... |
| 53 | + |
| 54 | +```{python} |
| 55 | +anomalies = tfdv.validate_statistics( |
| 56 | + statistics=stats_2, schema=feature_set.export_tfx_schema()) |
| 57 | +tfdv.display_anomalies(anomalies) |
| 58 | +``` |
| 59 | + |
| 60 | +Or [visualise the statistics](https://www.tensorflow.org/tfx/data_validation/get_started#computing_descriptive_data_statistics) in [facets](https://github.com/PAIR-code/facets). |
| 61 | + |
| 62 | +```{python} |
| 63 | +tfdv.visualize_statistics(stats) |
| 64 | +``` |
| 65 | + |
| 66 | +Refer to the [example notebook](https://github.com/feast-dev/feast/blob/master/examples/statistics/Historical%20Feature%20Statistics%20with%20Feast,%20TFDV%20and%20Facets.ipynb) for an end-to-end example showcasing Feast's integration with TFDV and Facets. |
| 67 | + |
| 68 | +### Aggregating Statistics |
| 69 | + |
| 70 | +Feast supports retrieval of feature statistics across multiple datasets or days. |
| 71 | + |
| 72 | +```{python} |
| 73 | +stats = client.get_statistics( |
| 74 | + feature_set_id='project/feature_set', |
| 75 | + store='warehouse', |
| 76 | + features=['feature_1', 'feature_2'], |
| 77 | + ingestion_ids=[ingestion_id_1, ingestion_id_2]) |
| 78 | +``` |
| 79 | + |
| 80 | +However, when querying across multiple datasets, Feast computes the statistics for each dataset independently (for caching purposes), and aggregates the results. As a result of this, certain un-aggregatable statistics are dropped in the process, such as medians, uniqueness counts, and histograms. |
| 81 | + |
| 82 | +Refer to the table below for the list of statistics that will be dropped. |
| 83 | + |
| 84 | +### Caching |
| 85 | + |
| 86 | +Feast caches the results of all feature statistics requests, and will, by default, retrieve and return the cached results. To recompute previously computed feature statistics, set `force_refresh` to `true` when retrieving the statistics: |
| 87 | + |
| 88 | +```{python} |
| 89 | +stats=client.get_statistics( |
| 90 | + feature_set_id='project/feature_set', |
| 91 | + store='warehouse', |
| 92 | + features=['feature_1', 'feature_2'], |
| 93 | + dataset_ids=[dataset_id], |
| 94 | + force_refresh=True) |
| 95 | +``` |
| 96 | + |
| 97 | +This will force Feast to recompute the statistics, and replace any previously cached values. |
| 98 | + |
| 99 | +### Supported Statistics |
| 100 | + |
| 101 | +Feast supports most, but not all of the feature statistics defined in TFX's [FeatureNameStatistics](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/statistics.proto#L147). For the definition of each statistic and information about how each one is computed, refer to the [protobuf definition](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/statistics.proto#L147). |
| 102 | + |
| 103 | +| Type | Statistic | Supported | Aggregateable | |
| 104 | +| --- | --- | --- | --- | |
| 105 | +| Common | NumNonMissing | ✔ | ✔ | |
| 106 | +| | NumMissing | ✔ | ✔ | |
| 107 | +| | MinNumValues | ✔ | ✔ | |
| 108 | +| | MaxNumValues | ✔ | ✔ | |
| 109 | +| | AvgNumValues | ✔ | ✔ | |
| 110 | +| | TotalNumValues | ✔ | ✔ | |
| 111 | +| | NumValuesHist | | | |
| 112 | +| Numeric | Min | ✔ | ✔ | |
| 113 | +| | Max | ✔ | ✔ | |
| 114 | +| | Median | ✔ | | |
| 115 | +| | Mean | ✔ | ✔ | |
| 116 | +| | Stdev | ✔ | ✔ | |
| 117 | +| | NumZeroes | ✔ | ✔ | |
| 118 | +| | Quantiles | ✔ | | |
| 119 | +| | Histogram | ✔ | | |
| 120 | +| String | RankHistogram | ✔ | | |
| 121 | +| | TopValues | ✔ | | |
| 122 | +| | Unique | ✔ | | |
| 123 | +| | AvgLength | ✔ | ✔ | |
| 124 | +| Bytes | MinNumBytes | ✔ | ✔ | |
| 125 | +| | MaxNumBytes | ✔ | ✔ | |
| 126 | +| | AvgNumBytes | ✔ | ✔ | |
| 127 | +| | Unique | ✔ | | |
| 128 | +| Struct/List | - (uses common statistics only) | - | - | |
| 129 | + |
| 130 | +## Inflight Feature Statistics |
| 131 | + |
| 132 | +For insight into data currently flowing into Feast through the population jobs, [statsd](https://github.com/statsd/statsd) is used to capture feature value statistics. |
| 133 | + |
| 134 | +Inflight feature statistics are windowed (default window length is 30s) and computed at two points in the feature population pipeline: |
| 135 | + |
| 136 | +1. Prior to store writes, after successful validation |
| 137 | +2. After successful store writes |
| 138 | + |
| 139 | +The following metrics are written at the end of each window as [statsd gauges](https://github.com/statsd/statsd/blob/master/docs/metric_types.md#gauges): |
| 140 | + |
| 141 | +``` |
| 142 | +feast_ingestion_feature_value_min |
| 143 | +feast_ingestion_feature_value_max |
| 144 | +feast_ingestion_feature_value_mean |
| 145 | +feast_ingestion_feature_value_percentile_25 feast_ingestion_feature_value_percentile_50 feast_ingestion_feature_value_percentile_90 feast_ingestion_feature_value_percentile_95 feast_ingestion_feature_value_percentile_99 |
| 146 | +``` |
| 147 | + |
| 148 | +{% hint style="info" %} |
| 149 | +the gauge metric type is used over histogram because statsd only supports positive values for histogram metric types, while numerical feature values can be of any double value. |
| 150 | +{% endhint %} |
| 151 | + |
| 152 | +The metrics are tagged with and can be aggregated by the following keys: |
| 153 | + |
| 154 | +| key | description | |
| 155 | +| --- | --- | |
| 156 | +| feast_store | store the population job is writing to |
| 157 | +| feast_project_name | feast project name |
| 158 | +| feast_featureSet_name | feature set name |
| 159 | +| feast_feature_name | feature name |
| 160 | +| ingestion_job_name | id of the population job writing the feature values. |
0 commit comments