# Data Quality Monitoring Data Quality Monitoring (DQM) is a Feast module aimed to help users to validate their data with the user-curated set of rules. Validation could be applied during: * Historical retrieval (training dataset generation) * [planned] Writing features into an online store * [planned] Reading features from an online store Its goal is to address several complex data problems, namely: * Data consistency - new training datasets can be significantly different from previous datasets. This might require a change in model architecture. * Issues/bugs in the upstream pipeline - bugs in upstream pipelines can cause invalid values to overwrite existing valid values in an online store. * Training/serving skew - distribution shift could significantly decrease the performance of the model. > To monitor data quality, we check that the characteristics of the tested dataset (aka the tested dataset's profile) are "equivalent" to the characteristics of the reference dataset. > How exactly profile equivalency should be measured is up to the user. ### Overview The validation process consists of the following steps: 1. User prepares reference dataset (currently only [saved datasets](../getting-started/concepts/dataset.md) from historical retrieval are supported). 2. User defines profiler function, which should produce profile by given dataset (currently only profilers based on [Great Expectations](https://docs.greatexpectations.io) are allowed). 3. Validation of tested dataset is performed with reference dataset and profiler provided as parameters. ### Preparations Feast with Great Expectations support can be installed via ```shell pip install 'feast[ge]' ``` ### Dataset profile Currently, Feast supports only [Great Expectation's](https://greatexpectations.io/) [ExpectationSuite](https://legacy.docs.greatexpectations.io/en/latest/autoapi/great_expectations/core/expectation_suite/index.html#great_expectations.core.expectation_suite.ExpectationSuite) as dataset's profile. Hence, the user needs to define a function (profiler) that would receive a dataset and return an [ExpectationSuite](https://legacy.docs.greatexpectations.io/en/latest/autoapi/great_expectations/core/expectation_suite/index.html#great_expectations.core.expectation_suite.ExpectationSuite). Great Expectations supports automatic profiling as well as manually specifying expectations: ```python from great_expectations.dataset import Dataset from great_expectations.core.expectation_suite import ExpectationSuite from feast.dqm.profilers.ge_profiler import ge_profiler @ge_profiler def automatic_profiler(dataset: Dataset) -> ExpectationSuite: from great_expectations.profile.user_configurable_profiler import UserConfigurableProfiler return UserConfigurableProfiler( profile_dataset=dataset, ignored_columns=['conv_rate'], value_set_threshold='few' ).build_suite() ``` However, from our experience capabilities of automatic profiler are quite limited. So we would recommend crafting your own expectations: ```python @ge_profiler def manual_profiler(dataset: Dataset) -> ExpectationSuite: dataset.expect_column_max_to_be_between("column", 1, 2) return dataset.get_expectation_suite() ``` ### Validating Training Dataset During retrieval of historical features, `validation_reference` can be passed as a parameter to methods `.to_df(validation_reference=...)` or `.to_arrow(validation_reference=...)` of RetrievalJob. If parameter is provided Feast will run validation once dataset is materialized. In case if validation successful materialized dataset is returned. Otherwise, `feast.dqm.errors.ValidationFailed` exception would be raised. It will consist of all details for expectations that didn't pass. ```python from feast import FeatureStore fs = FeatureStore(".") job = fs.get_historical_features(...) job.to_df( validation_reference=fs .get_saved_dataset("my_reference_dataset") .as_reference(profiler=manual_profiler) ) ```