Skip to content

fix float columns being converted to string in from_csv#3043

Open
knQzx wants to merge 1 commit intospeechbrain:developfrom
knQzx:fix/csv-float-type-inference
Open

fix float columns being converted to string in from_csv#3043
knQzx wants to merge 1 commit intospeechbrain:developfrom
knQzx:fix/csv-float-type-inference

Conversation

@knQzx
Copy link
Copy Markdown

@knQzx knQzx commented Mar 29, 2026

load_data_csv was only casting the duration field to float, leaving all other numeric columns as strings. replaced that with a general int/float cast for all fields so numeric values are preserved properly.

fixes #3027

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates SpeechBrain’s CSV loader (DynamicItemDataset.from_csv via load_data_csv) to preserve numeric types by automatically casting numeric-looking fields to int/float, addressing #3027 where non-duration numeric columns were being loaded as strings.

Changes:

  • Update load_data_csv documentation to describe automatic numeric casting.
  • Replace the previous special-casing of duration with generic per-field numeric casting (int, then float).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread speechbrain/dataio/dataio.py Outdated
Comment on lines +105 to +109
Uses the SpeechBrain legacy CSV data format, where the CSV must have an
'ID' field.
If there is a field called duration, it is interpreted as a float.
The rest of the fields are left as they are (legacy _format and _opts fields
are not used to load the data in any special way).
Numeric fields are automatically cast to int or float where possible.
The legacy _format and _opts fields are not used to load the data in any
special way.
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_data_csv now attempts int/float casting for every field, which is a backward-incompatible behavior change vs. the previous contract (“all fields left as-is except duration”). This will also coerce any numeric-looking string (e.g., speaker IDs like "0012", class labels like "1") into numbers and may silently drop leading zeros or change downstream type expectations. Consider adding a cast_numeric/infer_types flag (defaulting to the legacy behavior) or providing an explicit allow/deny list of columns to cast so callers can opt out of coercion where string semantics matter.

Copilot uses AI. Check for mistakes.
Comment thread speechbrain/dataio/dataio.py Outdated
Comment on lines +169 to +177
# Cast numeric values to int/float:
for key, value in row.items():
try:
row[key] = int(value)
except ValueError:
try:
row[key] = float(value)
except ValueError:
pass
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numeric casting loop relies on nested try/except ValueError for every cell. For large CSVs this can be a noticeable CPU cost, especially for non-numeric columns (paths, text) that will raise twice per value. Consider a lightweight pre-check (e.g., regex/fullmatch or fast character checks) to identify numeric candidates before attempting conversion, to avoid exceptions in the common non-numeric case.

Copilot uses AI. Check for mistakes.
Comment thread speechbrain/dataio/dataio.py Outdated
Comment on lines 169 to 178
# Cast numeric values to int/float:
for key, value in row.items():
try:
row[key] = int(value)
except ValueError:
try:
row[key] = float(value)
except ValueError:
pass
result[data_id] = row
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces new type-inference behavior in load_data_csv, but there are no unit tests asserting numeric columns stay numeric (including the regression case from #3027). Adding a focused test (e.g., in tests/unittests/test_data_io.py) that writes a small CSV with float/int columns and verifies the returned dict contains float/int types would help prevent future regressions.

Copilot uses AI. Check for mistakes.
@Adel-Moumen
Copy link
Copy Markdown
Collaborator

Hi @knQzx , thanks for your PR!

I think your PR try to address an important problem, but I am not so convinced by the way you've done it. I think this way of automatically -- and silently -- converting columns to "compatible" types might lead to incorrect behaviours. For instance, a string "00123" will be converted into an int "123" which loses information, the first two digits that might be important.

I think another solution could be to pass to DynamicItemDataset.from_csv a types argument (e.g. a dict) that explictly tells the type of each column e.g. "filename: str, duration: float, speak_id: int". This way we leave the conversion to the user rather than something automatic.

cc @pplantinga

@knQzx knQzx force-pushed the fix/csv-float-type-inference branch from f781a86 to 48ada05 Compare April 4, 2026 16:08
@knQzx
Copy link
Copy Markdown
Author

knQzx commented Apr 4, 2026

hey, thanks for the feedback - you're right, auto-casting is a bad idea since it silently breaks stuff like "00123" -> 123.

reworked it to use an explicit col_types dict like you suggested. now load_data_csv and from_csv accept col_types={"duration": float, "spk_id": int} etc - only casts what the user asks for, default behavior is unchanged.

added a test for it too. lmk if this looks better!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DynamicItemDataset changes fields to string

4 participants