Skip to content

fix: Handle array of strings columns in Athena materialization#6324

Merged
ntkathole merged 3 commits into
feast-dev:masterfrom
alan-gauthier-jt:fix-empty-string-array
Jun 3, 2026
Merged

fix: Handle array of strings columns in Athena materialization#6324
ntkathole merged 3 commits into
feast-dev:masterfrom
alan-gauthier-jt:fix-empty-string-array

Conversation

@alan-gauthier-jt

@alan-gauthier-jt alan-gauthier-jt commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it

Fixes two related bugs that cause TypeError and ValueError when materializing
feature views with array-typed columns (e.g. Array(String), Array(Int64)) using
the Athena offline store.

Arrow/Athena deserializes array columns as numpy.ndarray (object dtype) instead of
plain Python lists. This breaks two code paths in type_map.py:

  1. _convert_scalar_values_to_proto: pd.isnull(ndarray) returns an array of bools,
    and not <array> raises ValueError: The truth value of an empty array is ambiguous.
    → Already guarded by _is_array_like in newer Feast versions; no change needed here.

  2. _convert_list_values_to_proto (generic list path): proto_type(val=ndarray) passes
    the raw numpy array to the protobuf constructor, which only accepts Python lists →
    TypeError: bad argument type for built-in operation. Additionally, Arrow nullable
    columns can yield None elements inside the ndarray, which protobuf repeated fields
    also reject.

  3. _validate_collection_item_types: None elements inside an ndarray failed the
    type(item) in valid_types check before reaching the sanitization step.

Changes

feast/type_map.py

  • Add module-level _LIST_NONE_DEFAULTS dict mapping each list ValueType to a
    type-appropriate zero/empty default value used to replace None elements:

    • STRING_LIST, UUID_LIST, TIME_UUID_LIST, DECIMAL_LIST""
    • BYTES_LISTb""
    • INT32_LIST, INT64_LIST0
    • FLOAT_LIST, DOUBLE_LIST0.0
    • BOOL_LISTFalse
    • UNIX_TIMESTAMP_LISTNULL_TIMESTAMP_INT_VALUE
  • Add module-level _sanitize_list_value(value, feast_value_type) helper that:

    • Calls .tolist() on any numpy.ndarray to produce a plain Python list
      (empty ndarray → None, treated as a missing row)
    • Replaces None elements with the type-appropriate default from _LIST_NONE_DEFAULTS
    • Is a no-op for plain Python lists without None and for scalar values
  • Apply sanitization upfront in _convert_list_values_to_proto: both values and
    sample are normalised via _sanitize_list_value before any type-checking or proto
    conversion, removing the need for per-path ndarray handling.

  • Remove the old _to_proto_safe_list / _DROP_NONE / _LIST_TYPE_NONE_REPLACEMENT
    module-level helpers, which have been superseded by the above.

  • Skip None elements in _validate_collection_item_typesNone entries are
    valid in nullable Arrow columns and are sanitized upstream; raising a TypeError on
    them before that point was incorrect.

Testing

Added TestArrowArrayStringListMaterialization in
sdk/python/tests/unit/test_type_map.py covering:

Test Scenario
test_sanitize_list_value_ndarray ndarray → plain list
test_sanitize_list_value_empty_ndarray empty ndarray → None (missing row)
test_sanitize_list_value_ndarray_with_none None elements in STRING_LIST replaced with ""
test_sanitize_list_value_plain_list plain list passthrough
test_sanitize_list_value_plain_list_with_none None in plain STRING_LIST list replaced with ""
test_sanitize_list_value_numeric_none_replaced None in numeric/bool lists replaced with zero default
test_sanitize_list_value_bytes_none_replaced None in BYTES_LIST replaced with b""
test_sanitize_list_value_scalar_passthrough non-list, non-ndarray values unchanged
test_string_list_from_ndarray full round-trip via python_values_to_proto_values
test_string_list_from_empty_ndarray empty ndarray no longer raises ValueError
test_string_list_from_ndarray_with_none_elements None in ndarray no longer raises TypeError
test_string_list_null_row_produces_empty_proto None rows produce empty ProtoValue
test_mixed_batch_simulating_athena_chunk full simulation of a failing Athena materialization batch
pytest sdk/python/tests/unit/test_type_map.py::TestArrowArrayStringListMaterialization -v

Which issues this PR fixes

Fixes #6325

Does this PR introduce a user-facing change?

Yes — materialization of array-typed feature columns from Athena no longer fails with
TypeError or ValueError when a batch contains empty arrays, None rows, or None
elements inside arrays. None elements inside an array are now stored as the
type-appropriate zero/empty value (e.g. "" for strings, 0 for integers).

Previously:
  TypeError: bad argument type for built-in operation
  ValueError: The truth value of an empty array is ambiguous

After this fix:
  Materialization completes successfully.
  None elements inside arrays are replaced with type-appropriate defaults.

devin-ai-integration[bot]

This comment was marked as resolved.

@alan-gauthier-jt alan-gauthier-jt changed the title fix: handle numpy.ndarray Array(String) columns in Athena materialization fix: handle numpyndarray Array(String) columns in Athena materialization Apr 24, 2026
@alan-gauthier-jt alan-gauthier-jt changed the title fix: handle numpyndarray Array(String) columns in Athena materialization fix: Handle array of strings columns in Athena materialization Apr 24, 2026
Comment thread sdk/python/feast/type_map.py Outdated
Comment thread sdk/python/feast/type_map.py Outdated
@alan-gauthier-jt alan-gauthier-jt requested a review from ntkathole May 7, 2026 09:57
@alan-gauthier-jt alan-gauthier-jt force-pushed the fix-empty-string-array branch from d21d32c to ac81649 Compare May 13, 2026 07:46
@alan-gauthier-jt alan-gauthier-jt force-pushed the fix-empty-string-array branch from ac81649 to 66173d7 Compare May 19, 2026 07:03
@alan-gauthier-jt alan-gauthier-jt force-pushed the fix-empty-string-array branch 2 times, most recently from 828cd23 to 4030d4b Compare June 2, 2026 11:52
Signed-off-by: Alan Gauthier <alan.gauthier@jobteaser.com>
Signed-off-by: Alan Gauthier <alan.gauthier@jobteaser.com>
Signed-off-by: Alan Gauthier <alan.gauthier@jobteaser.com>
@ntkathole ntkathole force-pushed the fix-empty-string-array branch from 4030d4b to 066c9a2 Compare June 3, 2026 09:01
@ntkathole ntkathole merged commit 4ed0278 into feast-dev:master Jun 3, 2026
20 of 27 checks passed
franciscojavierarceo pushed a commit that referenced this pull request Jun 13, 2026
# [0.64.0](v0.63.0...v0.64.0) (2026-06-13)

### Bug Fixes

* Add async_supported property to RedisOnlineStore ([9b088fe](9b088fe))
* Add missing feast init templates to operator CRD and enhance persistence documentation ([1941d4d](1941d4d))
* Allow to publish from reference branch ([5458ec8](5458ec8))
* API calls list ([4203eb7](4203eb7))
* **bigquery:** Enable list inference for parquet loads in offline_write_batch ([9243497](9243497)), closes [#5845](#5845)
* Bump grpcio dependencies ([07b4782](07b4782))
* **compute-engine/local:** Honor field_mapping on join keys in dedup + join nodes ([#6395](#6395)) ([bd01824](bd01824))
* **dynamodb:** Avoid tag race condition by using diff-based tag updates ([#6479](#6479)) ([bad2b7d](bad2b7d)), closes [#6418](#6418)
* **dynamodb:** Fix mypy type for _build_projection_expression return ([217b4da](217b4da))
* Fix intermittent async test failures for DynamoDB and Redis ([63c5eb1](63c5eb1))
* Fix mongodb blog title ([57d28d4](57d28d4))
* Fix shared SQL registry crash - avoid unnecessary UDF deserialization in proto cache building ([ac588d7](ac588d7))
* Fix SparkRetrievalJob.persist() failing for SparkSource ([209d7cd](209d7cd))
* Fixed formatting and image for mongo blog ([#6377](#6377)) ([f8389fb](f8389fb))
* Fixes for ray source ([7f592a4](7f592a4))
* **go:** skip registry refresh when cache_ttl_seconds <= 0 ([97ed40c](97ed40c))
* Handle array of strings columns in Athena materialization ([#6324](#6324)) ([4ed0278](4ed0278))
* make milvus VARCHAR max_length configurable, remove hardcoded 512 limit ([3b98c22](3b98c22))
* **operator:** Set appProtocol: grpc on registry gRPC Service ([#6367](#6367)) ([c9ae2b4](c9ae2b4))
* PyJWT 2.10+ added validation that rejects empty HMAC keys ([e756ffe](e756ffe))
* RemoteOnlineStore sends all features in a single HTTP request ([8f187dd](8f187dd))
* Remove registry proto dump to enforce RBAC and add permission checks to Commit/Refresh RPCs ([328431f](328431f))
* Remove selector migration job - no longer needed ([51c325e](51c325e))
* replace broken .claude skill symlink with correct relative path ([4541690](4541690))
* Replace selector label strip patch with migration Job for upgrade-safe selector uniqueness ([00dea50](00dea50))
* Scope feature view name conflict check to current project in file-based registry ([#6369](#6369)) ([a4fde83](a4fde83)), closes [#6209](#6209)
* **snowflake:** Stop double-quoting connection identifiers ([#6462](#6462)) ([e914d59](e914d59))
* **spark:** S3/GCS PyArrow filesystem resolution for staging paths ([#6442](#6442)) ([ae50414](ae50414))
* **trino:** Clean up temporary entity tables after retrieval ([#6381](#6381)) ([d86b13d](d86b13d)), closes [#6306](#6306)
* Update go-feature-server base image to Go 1.25 and fix operator Dockerfile COPY permissions ([86ef0bc](86ef0bc))

### Features

* [Backend] Data Quality Monitoring with native compute, multi-backend support, REST API, CLI ([#6202](#6202)) ([5458c37](5458c37))
* Add apache flink compute engine ([#6476](#6476)) ([9636d6a](9636d6a))
* Add demo noteboooks for users ([e362173](e362173))
* Add enabled/disabled toggle for feature views ([#6401](#6401)) ([5f1fa0d](5f1fa0d)), closes [#6395](#6395)
* Add Label View to init template ([ec272d5](ec272d5))
* Add mTLS support to remote registry gRPC client ([#6474](#6474)) ([c9602d8](c9602d8))
* Add Prometheus gauges for FeatureStore installation telemetry ([#6354](#6354)) ([1b681b7](1b681b7))
* Adds registry REST API endpoints for managing entities, data sources, and feature views ([#6413](#6413)) ([f77bd1d](f77bd1d))
* Allow CRUD on entities, data sources, and feature views from UI ([#6412](#6412)) ([2321c07](2321c07))
* Allow default openlineage configuration ([#6467](#6467)) ([276b6df](276b6df))
* **bigquery:** Support DATE-type event timestamp columns ([#6362](#6362)) ([753dee5](753dee5)), closes [#2530](#2530)
* **cli:** Add `feast projects delete` command (closes [#5095](#5095)) ([#6318](#6318)) ([1a4b96c](1a4b96c))
* Data Quality Monitoring added in feast UI ([#6422](#6422)) ([fa271be](fa271be))
* **dynamodb:** Use ProjectionExpression when requested_features is set ([0adc906](0adc906)), closes [#6058](#6058)
* Enhance DataSource and FeatureView modals with error handling and submission states ([96d7169](96d7169))
* Expose registry endpoints on feature server for MCP access ([f77981c](f77981c))
* Feast First-Class LabelView Implementation ([#6292](#6292)) ([c0e7e5d](c0e7e5d))
* Feast-MLflow Integration ([#6235](#6235)) ([7279c75](7279c75))
* Operational metrics for offline store and SOX metrics for both ([#6340](#6340)) ([65b1b80](65b1b80))
* Pre-compute feature service ([8011550](8011550))
* REST API-backed UI for RBAC compatibility and per-page lazy loading ([#6414](#6414)) ([6ae80af](6ae80af))
* Support non-string map key types ([#6382](#6382)) ([#6383](#6383)) ([728aa2e](728aa2e))
* Update FeatureStore CRD with DRA Fields ([01241e4](01241e4))

### Performance Improvements

* Cache feature view resolution in get_online_features to reduce per-request overhead ([55c2f18](55c2f18))
* Optimize feature serving latency with batched async Redis, cached checks fix ([103809a](103809a))
* Replace MessageToDict with optimized custom dict builder ([#6015](#6015)) ([9902064](9902064))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: TypeError / ValueError when materializing Array(String) feature views with Athena offline store

3 participants