fix float columns being converted to string in from_csv by knQzx · Pull Request #3043 · speechbrain/speechbrain

knQzx · 2026-03-29T09:05:10Z

load_data_csv was only casting the duration field to float, leaving all other numeric columns as strings. replaced that with a general int/float cast for all fields so numeric values are preserved properly.

fixes #3027

Copilot

Pull request overview

This PR updates SpeechBrain’s CSV loader (DynamicItemDataset.from_csv via load_data_csv) to preserve numeric types by automatically casting numeric-looking fields to int/float, addressing #3027 where non-duration numeric columns were being loaded as strings.

Changes:

Update load_data_csv documentation to describe automatic numeric casting.
Replace the previous special-casing of duration with generic per-field numeric casting (int, then float).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T15:49:31Z

    Uses the SpeechBrain legacy CSV data format, where the CSV must have an
    'ID' field.
-    If there is a field called duration, it is interpreted as a float.
-    The rest of the fields are left as they are (legacy _format and _opts fields
-    are not used to load the data in any special way).
+    Numeric fields are automatically cast to int or float where possible.
+    The legacy _format and _opts fields are not used to load the data in any
+    special way.


load_data_csv now attempts int/float casting for every field, which is a backward-incompatible behavior change vs. the previous contract (“all fields left as-is except duration”). This will also coerce any numeric-looking string (e.g., speaker IDs like "0012", class labels like "1") into numbers and may silently drop leading zeros or change downstream type expectations. Consider adding a cast_numeric/infer_types flag (defaulting to the legacy behavior) or providing an explicit allow/deny list of columns to cast so callers can opt out of coercion where string semantics matter.

Copilot · 2026-04-03T15:49:31Z

+            # Cast numeric values to int/float:
+            for key, value in row.items():
+                try:
+                    row[key] = int(value)
+                except ValueError:
+                    try:
+                        row[key] = float(value)
+                    except ValueError:
+                        pass


The numeric casting loop relies on nested try/except ValueError for every cell. For large CSVs this can be a noticeable CPU cost, especially for non-numeric columns (paths, text) that will raise twice per value. Consider a lightweight pre-check (e.g., regex/fullmatch or fast character checks) to identify numeric candidates before attempting conversion, to avoid exceptions in the common non-numeric case.

Copilot · 2026-04-03T15:49:32Z

+            # Cast numeric values to int/float:
+            for key, value in row.items():
+                try:
+                    row[key] = int(value)
+                except ValueError:
+                    try:
+                        row[key] = float(value)
+                    except ValueError:
+                        pass
            result[data_id] = row


This change introduces new type-inference behavior in load_data_csv, but there are no unit tests asserting numeric columns stay numeric (including the regression case from #3027). Adding a focused test (e.g., in tests/unittests/test_data_io.py) that writes a small CSV with float/int columns and verifies the returned dict contains float/int types would help prevent future regressions.

Adel-Moumen · 2026-04-03T15:56:48Z

Hi @knQzx , thanks for your PR!

I think your PR try to address an important problem, but I am not so convinced by the way you've done it. I think this way of automatically -- and silently -- converting columns to "compatible" types might lead to incorrect behaviours. For instance, a string "00123" will be converted into an int "123" which loses information, the first two digits that might be important.

I think another solution could be to pass to DynamicItemDataset.from_csv a types argument (e.g. a dict) that explictly tells the type of each column e.g. "filename: str, duration: float, speak_id: int". This way we leave the conversion to the user rather than something automatic.

cc @pplantinga

knQzx · 2026-04-04T16:08:49Z

hey, thanks for the feedback - you're right, auto-casting is a bad idea since it silently breaks stuff like "00123" -> 123.

reworked it to use an explicit col_types dict like you suggested. now load_data_csv and from_csv accept col_types={"duration": float, "spk_id": int} etc - only casts what the user asks for, default behavior is unchanged.

added a test for it too. lmk if this looks better!

Adel-Moumen assigned pplantinga and unassigned pplantinga Apr 3, 2026

Adel-Moumen requested review from Copilot and pplantinga April 3, 2026 15:46

Copilot started reviewing on behalf of Adel-Moumen April 3, 2026 15:46 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

use explicit col_types dict instead of auto-casting

48ada05

knQzx force-pushed the fix/csv-float-type-inference branch from f781a86 to 48ada05 Compare April 4, 2026 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix float columns being converted to string in from_csv#3043

fix float columns being converted to string in from_csv#3043
knQzx wants to merge 1 commit intospeechbrain:developfrom
knQzx:fix/csv-float-type-inference

knQzx commented Mar 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Adel-Moumen commented Apr 3, 2026

Uh oh!

knQzx commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

knQzx commented Mar 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Adel-Moumen commented Apr 3, 2026

Uh oh!

knQzx commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants