Skip to content

Use parameterized queries in HiveStatsCollectionOperator#66751

Open
Har1sh-k wants to merge 1 commit into
apache:mainfrom
Har1sh-k:parameterize-hive-stats-sql
Open

Use parameterized queries in HiveStatsCollectionOperator#66751
Har1sh-k wants to merge 1 commit into
apache:mainfrom
Har1sh-k:parameterize-hive-stats-sql

Conversation

@Har1sh-k
Copy link
Copy Markdown

@Har1sh-k Har1sh-k commented May 12, 2026

Summary

HiveStatsCollectionOperator builds its bookkeeping SQL (the SELECT and DELETE against hive_stats in the MySQL metastore, and the SELECT ... FROM <table> WHERE <partition_key> = '<value>' against Presto) by f-string-interpolating template-rendered fields (table, partition, dttm) directly into raw SQL strings.

This PR is intended as defense-in-depth and aligns the operator with the documented guidance in security_model.rst and sql.rst that Dag authors should avoid passing unsanitized input into operators and hooks — the built-in bookkeeping and partition-predicate paths no longer rely on each Dag author to sanitize these values. extra_exprs / assignment_func still accept DAG-authored SQL expressions by design and are intentionally unchanged.

  • The MySQL SELECT and DELETE use %s placeholders with the parameters= kwarg of MySqlHook.get_records / .run, so values are bound by the driver instead of interpolated.
  • For the Presto SELECT, partition values are passed as bound parameters using the hook's declared placeholder (PrestoHook overrides DbApiHook's default %s to ?).
  • Identifiers (table name and partition column names) cannot be parameterized in standard SQL — they are validated against ^[A-Za-z_][A-Za-z0-9_]*(?:\.[A-Za-z_][A-Za-z0-9_]*)?$ and ^[A-Za-z_][A-Za-z0-9_]*$ respectively, raising AirflowException on mismatch.

Test plan

  • New test_execute_rejects_invalid_table_identifier and test_execute_rejects_invalid_partition_column exercise the allowlist regexes.
  • New test_execute_parameterizes_mysql_bookkeeping_queries and test_execute_parameterizes_presto_partition_values verify the SQL is parameterized (no f-string interpolation of table / partition values into the SQL body) and that the Presto query uses the hook's ? placeholder.
  • Updated test_execute_delete_previous_runs_rows and test_runs_for_hive_stats to assert the parameterized form.
  • Existing test_hive_stats.py tests continue to pass.

Was generative AI tooling used to co-author this PR?
  • Yes (Claude Opus 4.7)

I used Claude Opus 4.7 to assist with this PR, following the Gen-AI Assisted contributions guidelines:

  • I reviewed and understand the generated code; the regex allowlists, the move to presto.placeholder, the MySQL parameterization, and the tests were iterated against airflow-core/docs/security/security_model.rst and airflow-core/docs/security/sql.rst rather than blindly accepted.
  • Ruff lint and format checks pass on both changed files; python -m py_compile and ast.parse succeed.
  • The diff is scoped to two files under providers/apache/hive/ with no unrelated changes.
  • The added unit tests are part of the diff and will run in CI.
  • I take final responsibility for the change.

@boring-cyborg
Copy link
Copy Markdown

boring-cyborg Bot commented May 12, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example Dag that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

HiveStatsCollectionOperator builds its bookkeeping SQL (the SELECT
and DELETE against hive_stats in the MySQL metastore, and the
SELECT ... FROM <table> WHERE <partition_key> = '<value>' against
Presto) by f-string-interpolating template-rendered fields (table,
partition, dttm) directly into raw SQL strings.

Per the security model in airflow-core/docs/security/security_model.rst
and airflow-core/docs/security/sql.rst, Dag authors are trusted users
responsible for sanitizing input before passing it to operators. The
change here is defense-in-depth so that the operator does not rely on
each Dag author to sanitize.

The MySQL bookkeeping SELECT and DELETE now use %s placeholders with
the parameters= kwarg of MySqlHook.get_records / .run, so the values
are bound by the driver instead of interpolated.

For the Presto SELECT, partition values are passed as bound parameters
using the hook's declared placeholder (PrestoHook overrides the
DbApiHook default to "?"). Identifiers (table name and partition
column names) cannot be parameterized in standard SQL — they are
validated against ^[A-Za-z_][A-Za-z0-9_]*(?:\.[A-Za-z_][A-Za-z0-9_]*)?$
for the table and ^[A-Za-z_][A-Za-z0-9_]*$ for partition keys, raising
AirflowException on mismatch.
@Har1sh-k Har1sh-k force-pushed the parameterize-hive-stats-sql branch from 7ed5db6 to 4bb0946 Compare May 12, 2026 03:59
@Har1sh-k Har1sh-k marked this pull request as ready for review May 12, 2026 04:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant