fix: validate empty CSV column names and improve mismatch error messages#1010
Conversation
Co-Authored-By: bot_apk <apk@cognition.ai>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1777652889-fix-csv-empty-column-validation#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1777652889-fix-csv-empty-column-validationPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
|
↩️ Triggering Reason: CDK PR is linked to a connector oncall issue, CI is passing or already has local/full pytest evidence, and no AI review marker is present. |
|
Correction: attempted to trigger the CDK This PR remains marked for human review/next-step decision. |
|
/prerelease
|
There was a problem hiding this comment.
Pull request overview
This PR improves the file-based CDK CSV parsing UX by (1) failing fast on malformed CSV headers that contain empty/whitespace-only column names and (2) making CSV field-mismatch errors clearer and more actionable during parsing/discovery.
Changes:
- Add validation in
_CsvReader._get_headers()to raise aFailureType.config_errorwhen any header column name is empty/whitespace-only. - Update
FileBasedSourceErrormessages for “more columns than headers” vs “fewer columns than headers” mismatches to be user-facing and unambiguous. - Adjust
CsvParser.parse_records()to preserve the originalRecordParseErrordetail (instead of replacing it with a generic parsing error), plus add unit tests for these cases.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
airbyte_cdk/sources/file_based/file_types/csv_parser.py |
Adds empty-header validation and changes how parse errors are propagated from parse_records(). |
airbyte_cdk/sources/file_based/exceptions.py |
Rewords CSV mismatch error messages to be clearer for end users. |
unit_tests/sources/file_based/file_types/test_csv_parser.py |
Adds tests for empty header validation and for preserving mismatch error detail. |
Comments suppressed due to low confidence (1)
airbyte_cdk/sources/file_based/file_types/csv_parser.py:123
- The
_get_headers()docstring says it “will reset” the file pointer, but theCsvHeaderUserProvidedearly-return path doesn’t callfp.seek(0). Either update the docstring to reflect the actual behavior or ensure the pointer is reset consistently across all branches.
def _get_headers(self, fp: IOBase, config_format: CsvFormat, dialect_name: str) -> List[str]:
"""Assumes the fp is pointing to the beginning of the files and will reset it as such."""
# Note that this method assumes the dialect has already been registered if we're parsing the headers
if isinstance(config_format.header_definition, CsvHeaderUserProvided):
return config_format.header_definition.column_names
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…est assertion - Move csv.unregister_dialect to broader try/finally in read_data() to prevent dialect leak when _get_headers() raises AirbyteTracedException - Remove unnecessary catch/re-raise in parse_records(); let exceptions propagate naturally with original context - Guard data_generator.close() against UnboundLocalError if generator was never assigned - Assert expected_empty_count in test to validate internal_message Co-Authored-By: bot_apk <apk@cognition.ai>
|
/prerelease
|
Alfredo Garcia (agarctfi)
left a comment
There was a problem hiding this comment.
We tested this locally, and now it will fail in CHECK, with a valid error message.
Summary
When a CSV file has trailing empty columns (e.g.,
col1,col2,col3,,,), the CDK'scsv_parser.py_get_headers()silently accepts empty-string column names. These propagate into the discovered schema and catalog, causing the platform to fail with an opaque "Init container error" (KotlinInvalidNullExceptiononField.name). The customer never sees any hint about the CSV data issue.This PR adds validation, improves error messages, and fixes error propagation:
_get_headers()— raisesAirbyteTracedExceptionwithfailure_type=config_errorwhen CSV headers contain empty/whitespace-only column names, surfacing the problem during discover.ERROR_PARSING_RECORD_MISMATCHED_COLUMNSandERROR_PARSING_RECORD_MISMATCHED_ROWSto use clear user-facing language instead of confusing references toNone.parse_records()— removed unnecessary catch/re-raise that double-wrapped error messages; addeddata_generator = Noneguard againstUnboundLocalError.read_data()— movedcsv.unregister_dialect()to a broadertry/finallyso dialects are always cleaned up, even when_get_headers()raises.Resolves https://github.com/airbytehq/oncall/issues/12144:
Review & Testing Checklist for Human
col1,col2,col3,,,) — confirm the error surfaces during discover, not as an opaque platform errorRecommended test plan: upload a CSV file with trailing empty columns to an S3 source and run discover — you should see the new
config_errormessage instead of the previous "Init container error".Notes
ignore_errors_on_fields_mismatch), airbytehq/airbyte#20509 (empty columns array)Link to Devin session: https://app.devin.ai/sessions/c0ac93b0ed1a401ba346b7fcc93bc41b