Skip to content

ClickHouseIO: Add DateTime64 support for sub-second timestamp precision#38510

Open
Eliaaazzz wants to merge 1 commit into
apache:masterfrom
Eliaaazzz:users/elia/clickhouse-datetime64
Open

ClickHouseIO: Add DateTime64 support for sub-second timestamp precision#38510
Eliaaazzz wants to merge 1 commit into
apache:masterfrom
Eliaaazzz:users/elia/clickhouse-datetime64

Conversation

@Eliaaazzz
Copy link
Copy Markdown
Contributor

@Eliaaazzz Eliaaazzz commented May 15, 2026

ClickHouseIO's TableSchema and column-type parser only recognized DateTime (second precision), so pipelines emitting sub-second timestamps (log/event ingestion, financial data) could not write to ClickHouse tables declared with DateTime64(precision[, 'timezone']) columns.

This change adds first-class DateTime64 support to ClickHouseIO:

  • Schema model — new TypeName.DATETIME64; ColumnType carries precision (0–9, validated) and an optional timezone, with a ColumnType.dateTime64(precision[, timezone]) factory.
  • Parser — JavaCC grammar rule for DateTime64(<precision>[, '<timezone>']), also reachable through Nullable(...) and Array(...) via the existing primitive() rule.
  • Beam field-type mapping — picks the narrowest logical type that round-trips the requested precision:
    • precision ≤ 3 → Joda DATETIME (preserves existing pipelines).
    • precision 4–6SqlTypes.TIMESTAMP (MicrosInstant).
    • precision ≥ 7NanosInstant, the only built-in logical type that preserves full nanosecond precision through a Row; MicrosInstant rejects non-micro-aligned nanos.
  • Writer — serializes DateTime64 as a little-endian Int64 of epoch_seconds * 10^precision + sub_second_units, accepting both Joda ReadableInstant and java.time.Instant. Uses Math.floorDiv / Math.floorMod so negative timestamps match ClickHouse's encoding, and Math.multiplyExact / Math.addExact for overflow safety.

Tests:

  • TableSchemaTest — parser cases for DateTime64(3), DateTime64(6, 'UTC'), DateTime64(9), Nullable(DateTime64(...)), Array(DateTime64(...)); schema-mapping tests for the millis, micros and nanos buckets; precision-range validation.
  • ClickHouseWriterTest — encoder unit tests covering Joda and java.time.Instant inputs, precision 0/3/6/7/9, negative timestamps and the precision-7 100 ns truncation path.
  • ClickHouseIOIT — round-trip integration tests against the ClickHouse test container for precisions 3/6/9 (the 9-precision case uses non-micro-aligned nanos) and Nullable(DateTime64(6)).

fixes #38466


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: `addresses [BEAM-121] Add DisplayData for IO transforms #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #` instead.
  • Update `CHANGES.md` with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

ClickHouse's DateTime64(precision[, 'timezone']) was not recognized by
TableSchema or the column-type parser, so pipelines emitting sub-second
timestamps (log/event ingestion, financial data) could not write to
DateTime64 columns.

This adds:
* TypeName.DATETIME64 with precision (0-9) and optional timezone fields,
  plus a ColumnType.dateTime64(precision[, timezone]) factory.
* Parser grammar for DateTime64(<precision>[, '<timezone>']) so the type
  is also recognized inside Nullable(...) and Array(...) via the
  existing primitive() rule.
* Beam schema mapping picks the narrowest logical type that round-trips
  the requested precision:
    precision <= 3 → Joda DATETIME (preserves existing pipelines).
    precision 4-6 → SqlTypes.TIMESTAMP (MicrosInstant).
    precision >= 7 → NanosInstant, the only built-in logical type that
                     carries full nanosecond precision through a Row;
                     MicrosInstant would reject sub-micro nanos.
* Writer serialization as a little-endian Int64 of
  epoch_seconds * 10^precision + sub_second_units, accepting both Joda
  ReadableInstant and java.time.Instant inputs; floor division on
  negative timestamps matches ClickHouse's own encoding.

Tests: parser cases for DateTime64(3), DateTime64(6,'UTC'),
DateTime64(9), Nullable(DateTime64(...)) and Array(DateTime64(...));
schema-mapping tests for the micros and nanos buckets; encoder unit
tests covering Joda/java.time inputs, zero/nano/negative edge cases and
the precision-7 100 ns tick truncation path; round-trip integration
tests against the ClickHouse testcontainer for precisions 3/6/9 (with
non-micro-aligned nanos for the nanos case) plus a nullable case.

Closes apache#38466
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for ClickHouse's DateTime64 data type, allowing users to write timestamps with sub-second precision (up to nanoseconds) to ClickHouse tables. The changes include updates to the schema model, parser, and writer to accommodate the precision and timezone requirements of DateTime64, while maintaining backward compatibility for existing pipelines using standard DateTime.

Highlights

  • DateTime64 Support: Added first-class support for ClickHouse DateTime64 columns, enabling sub-second timestamp precision in pipelines.
  • Schema Mapping: Implemented dynamic Beam field-type mapping based on precision: Joda DATETIME for <=3, SqlTypes.TIMESTAMP for 4-6, and NanosInstant for >=7.
  • Writer Logic: Updated the ClickHouse writer to handle serialization of DateTime64 as a 64-bit integer, ensuring correct handling of negative timestamps and overflow safety.
  • Parser Updates: Extended the JavaCC grammar to support parsing DateTime64(precision[, 'timezone']) definitions.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for ClickHouse DateTime64 columns with sub-second precision to ClickHouseIO. The implementation includes a new encoding method for timestamps, updates to the ColumnType parser, and logic to map ClickHouse precisions to appropriate Beam logical types like DATETIME, SqlTypes.TIMESTAMP, and NanosInstant. Comprehensive tests were added to verify the encoding and schema mapping. Reviewers recommended using a constant for NanosInstant to improve performance and suggested adding Objects.requireNonNull when unboxing the precision value to prevent potential NullPointerExceptions.

} else if (p <= 6) {
return Schema.FieldType.logicalType(SqlTypes.TIMESTAMP);
} else {
return Schema.FieldType.logicalType(new NanosInstant());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better performance and consistency with other logical types (like SqlTypes.TIMESTAMP), consider defining a private constant for NanosInstant instead of instantiating it every time getEquivalentFieldType is called for a DATETIME64 column with precision ≥ 7.

Suggested change
return Schema.FieldType.logicalType(new NanosInstant());
return Schema.FieldType.logicalType(NANOS_INSTANT);

break;

case DATETIME64:
BinaryStreamUtils.writeInt64(stream, encodeDateTime64(value, columnType.precision()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The precision() method on ColumnType is marked as @Nullable. While the factory and parser ensure it is set for DATETIME64, unboxing it here to an int for the encodeDateTime64 call could theoretically throw a NullPointerException if a ColumnType was manually constructed via the builder without a precision. Consider adding a null check or using Objects.requireNonNull for robustness.

Suggested change
BinaryStreamUtils.writeInt64(stream, encodeDateTime64(value, columnType.precision()));
BinaryStreamUtils.writeInt64(stream, encodeDateTime64(value, java.util.Objects.requireNonNull(columnType.precision())));

@github-actions
Copy link
Copy Markdown
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request]: ClickHouseIO: Add DateTime64 support for sub-second timestamp precision

1 participant