perf(spanner): optimize query result decoding#17375
Conversation
There was a problem hiding this comment.
Code Review
This pull request optimizes performance across Spanner streaming and helper modules by caching list append methods, pre-compiling regular expressions, and replacing helper functions with direct attribute getters or lambdas. However, several critical issues were identified in _helpers.py: missing imports for datetime and base64 will cause runtime errors, BYTES values are incorrectly encoded to UTF-8 instead of being base64-decoded, and the new timestamp parsing logic ignores timezone offsets for non-UTC inputs, resulting in incorrect times.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces several performance optimizations to the Spanner client library, such as caching local variables and methods during row merging, using lambdas and operator.attrgetter for type decoding, pre-compiling regex patterns, and optimizing timestamp parsing. The reviewer provided valuable feedback to further boost performance, including checking the more common number_value field first for float decoding, using string slicing instead of regular expressions for timestamp parsing, and removing the unused re import.
rahul2393
left a comment
There was a problem hiding this comment.
LGTM - with minor comments/questions and test assert fix
| if val[19] == ".": | ||
| if val.endswith("Z"): | ||
| offset = "Z" | ||
| fraction = val[20:-1] |
There was a problem hiding this comment.
Spanner Timestamp seems to be always UTC Z at the API/wire level. Non-Z offset only come from CAST(ts as String), which is a STRING column and routes through the STRING decoder.
+/-HH:MM handling plus the astimezone call won't ever be executed on real TIMESTAMP value? Can we drop it to shrink the risky code, or comment marking it deliberately defensive. The Z fast-path is the legitimate win and short-circuits before this anyway?
|
|
||
| parsed = self._callFUT(value_pb, field_type, field_name) | ||
| self.assertIsInstance(parsed, datetime_helpers.DatetimeWithNanoseconds) | ||
| self.assertEqual(parsed, expected) |
There was a problem hiding this comment.
DatetimeWithNanoseconds.eq is inherited from datetime and compares only to microsecond - it ignores .nanosecond. So this passes even if sub-us digits are wrong.
| index = 0 | ||
| else: | ||
| for value in values: | ||
| if value.HasField("null_value"): |
There was a problem hiding this comment.
value.HasField("null_value") duplicates _parse_nullables's semantics(still used in lazy/array/single-value paths). Should we add comment here to keep in sync so duplication doesn't silently drift?
| return operator.attrgetter("bool_value") | ||
| elif type_code == TypeCode.INT64: | ||
| return _parse_int64 | ||
| return lambda value_pb: int(value_pb.string_value) |
There was a problem hiding this comment.
Maybe a comment on why only STRING/BOOL use attrgetter(faster) while transform trypes use lambdas?

Work in progress.
Optimizes the decoding and reading of (large) result sets for Spanner.