Skip to content

fix: validate empty CSV column names and improve mismatch error messages#1010

Merged
Alfredo Garcia (agarctfi) merged 3 commits into
mainfrom
devin/1777652889-fix-csv-empty-column-validation
May 13, 2026
Merged

fix: validate empty CSV column names and improve mismatch error messages#1010
Alfredo Garcia (agarctfi) merged 3 commits into
mainfrom
devin/1777652889-fix-csv-empty-column-validation

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot commented May 1, 2026

Summary

When a CSV file has trailing empty columns (e.g., col1,col2,col3,,,), the CDK's csv_parser.py _get_headers() silently accepts empty-string column names. These propagate into the discovered schema and catalog, causing the platform to fail with an opaque "Init container error" (KotlinInvalidNullException on Field.name). The customer never sees any hint about the CSV data issue.

This PR adds validation, improves error messages, and fixes error propagation:

  1. Validate empty column names in _get_headers() — raises AirbyteTracedException with failure_type=config_error when CSV headers contain empty/whitespace-only column names, surfacing the problem during discover.
  2. Improve mismatch error messages — updated ERROR_PARSING_RECORD_MISMATCHED_COLUMNS and ERROR_PARSING_RECORD_MISMATCHED_ROWS to use clear user-facing language instead of confusing references to None.
  3. Clean up error propagation in parse_records() — removed unnecessary catch/re-raise that double-wrapped error messages; added data_generator = None guard against UnboundLocalError.
  4. Fix dialect leak in read_data() — moved csv.unregister_dialect() to a broader try/finally so dialects are always cleaned up, even when _get_headers() raises.

Resolves https://github.com/airbytehq/oncall/issues/12144:

Review & Testing Checklist for Human

  • Verify the empty column name error message is clear and actionable for end users
  • Test with a real CSV file containing trailing commas in the header row (e.g., col1,col2,col3,,,) — confirm the error surfaces during discover, not as an opaque platform error
  • Verify existing CSV sources with valid headers are unaffected (regression check)

Recommended test plan: upload a CSV file with trailing empty columns to an S3 source and run discover — you should see the new config_error message instead of the previous "Init container error".

Notes

Link to Devin session: https://app.devin.ai/sessions/c0ac93b0ed1a401ba346b7fcc93bc41b

@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1777652889-fix-csv-empty-column-validation#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1777652889-fix-csv-empty-column-validation

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

PyTest Results (Fast)

4 066 tests  +7   4 055 ✅ +7   6m 20s ⏱️ - 1m 25s
    1 suites ±0      11 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 8dcbe83. ± Comparison against base commit d3d1346.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

PyTest Results (Full)

4 069 tests  +7   4 057 ✅ +7   10m 57s ⏱️ -1s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 8dcbe83. ± Comparison against base commit d3d1346.

♻️ This comment has been updated with latest results.

@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

↩️ Triggering /ai-review per Hands-Free AI Triage Project triage next step.

Reason: CDK PR is linked to a connector oncall issue, CI is passing or already has local/full pytest evidence, and no AI review marker is present.

Devin session

@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Correction: attempted to trigger the CDK /ai-review workflow programmatically, but this repository does not expose an ai-review-command.yml workflow and slash_command_dispatch.yml is not workflow-dispatchable. No AI review workflow was actually started.

This PR remains marked for human review/next-step decision.

Devin session

@agarctfi
Copy link
Copy Markdown
Contributor

Alfredo Garcia (agarctfi) commented May 13, 2026

/prerelease

Prerelease Job Info

This job triggers the publish workflow with default arguments to create a prerelease.

Prerelease job started... Check job output.

✅ Prerelease workflow triggered successfully.

View the publish workflow run: https://github.com/airbytehq/airbyte-python-cdk/actions/runs/25825981629

@agarctfi Alfredo Garcia (agarctfi) marked this pull request as ready for review May 13, 2026 21:13
Copilot AI review requested due to automatic review settings May 13, 2026 21:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the file-based CDK CSV parsing UX by (1) failing fast on malformed CSV headers that contain empty/whitespace-only column names and (2) making CSV field-mismatch errors clearer and more actionable during parsing/discovery.

Changes:

  • Add validation in _CsvReader._get_headers() to raise a FailureType.config_error when any header column name is empty/whitespace-only.
  • Update FileBasedSourceError messages for “more columns than headers” vs “fewer columns than headers” mismatches to be user-facing and unambiguous.
  • Adjust CsvParser.parse_records() to preserve the original RecordParseError detail (instead of replacing it with a generic parsing error), plus add unit tests for these cases.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
airbyte_cdk/sources/file_based/file_types/csv_parser.py Adds empty-header validation and changes how parse errors are propagated from parse_records().
airbyte_cdk/sources/file_based/exceptions.py Rewords CSV mismatch error messages to be clearer for end users.
unit_tests/sources/file_based/file_types/test_csv_parser.py Adds tests for empty header validation and for preserving mismatch error detail.
Comments suppressed due to low confidence (1)

airbyte_cdk/sources/file_based/file_types/csv_parser.py:123

  • The _get_headers() docstring says it “will reset” the file pointer, but the CsvHeaderUserProvided early-return path doesn’t call fp.seek(0). Either update the docstring to reflect the actual behavior or ensure the pointer is reset consistently across all branches.
    def _get_headers(self, fp: IOBase, config_format: CsvFormat, dialect_name: str) -> List[str]:
        """Assumes the fp is pointing to the beginning of the files and will reset it as such."""
        # Note that this method assumes the dialect has already been registered if we're parsing the headers
        if isinstance(config_format.header_definition, CsvHeaderUserProvided):
            return config_format.header_definition.column_names


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread airbyte_cdk/sources/file_based/file_types/csv_parser.py
Comment thread airbyte_cdk/sources/file_based/file_types/csv_parser.py Outdated
Comment thread unit_tests/sources/file_based/file_types/test_csv_parser.py
…est assertion

- Move csv.unregister_dialect to broader try/finally in read_data() to
  prevent dialect leak when _get_headers() raises AirbyteTracedException
- Remove unnecessary catch/re-raise in parse_records(); let exceptions
  propagate naturally with original context
- Guard data_generator.close() against UnboundLocalError if generator
  was never assigned
- Assert expected_empty_count in test to validate internal_message

Co-Authored-By: bot_apk <apk@cognition.ai>
@ZaneHyattAB
Copy link
Copy Markdown
Contributor

ZaneHyattAB commented May 13, 2026

/prerelease

Prerelease Job Info

This job triggers the publish workflow with default arguments to create a prerelease.

Prerelease job started... Check job output.

✅ Prerelease workflow triggered successfully.

View the publish workflow run: https://github.com/airbytehq/airbyte-python-cdk/actions/runs/25827509667

Copy link
Copy Markdown
Contributor

@agarctfi Alfredo Garcia (agarctfi) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tested this locally, and now it will fail in CHECK, with a valid error message.

@agarctfi Alfredo Garcia (agarctfi) merged commit 19a7083 into main May 13, 2026
31 of 33 checks passed
@agarctfi Alfredo Garcia (agarctfi) deleted the devin/1777652889-fix-csv-empty-column-validation branch May 13, 2026 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants