fix float columns being converted to string in from_csv#3043
fix float columns being converted to string in from_csv#3043knQzx wants to merge 1 commit intospeechbrain:developfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates SpeechBrain’s CSV loader (DynamicItemDataset.from_csv via load_data_csv) to preserve numeric types by automatically casting numeric-looking fields to int/float, addressing #3027 where non-duration numeric columns were being loaded as strings.
Changes:
- Update
load_data_csvdocumentation to describe automatic numeric casting. - Replace the previous special-casing of
durationwith generic per-field numeric casting (int, then float).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Uses the SpeechBrain legacy CSV data format, where the CSV must have an | ||
| 'ID' field. | ||
| If there is a field called duration, it is interpreted as a float. | ||
| The rest of the fields are left as they are (legacy _format and _opts fields | ||
| are not used to load the data in any special way). | ||
| Numeric fields are automatically cast to int or float where possible. | ||
| The legacy _format and _opts fields are not used to load the data in any | ||
| special way. |
There was a problem hiding this comment.
load_data_csv now attempts int/float casting for every field, which is a backward-incompatible behavior change vs. the previous contract (“all fields left as-is except duration”). This will also coerce any numeric-looking string (e.g., speaker IDs like "0012", class labels like "1") into numbers and may silently drop leading zeros or change downstream type expectations. Consider adding a cast_numeric/infer_types flag (defaulting to the legacy behavior) or providing an explicit allow/deny list of columns to cast so callers can opt out of coercion where string semantics matter.
| # Cast numeric values to int/float: | ||
| for key, value in row.items(): | ||
| try: | ||
| row[key] = int(value) | ||
| except ValueError: | ||
| try: | ||
| row[key] = float(value) | ||
| except ValueError: | ||
| pass |
There was a problem hiding this comment.
The numeric casting loop relies on nested try/except ValueError for every cell. For large CSVs this can be a noticeable CPU cost, especially for non-numeric columns (paths, text) that will raise twice per value. Consider a lightweight pre-check (e.g., regex/fullmatch or fast character checks) to identify numeric candidates before attempting conversion, to avoid exceptions in the common non-numeric case.
| # Cast numeric values to int/float: | ||
| for key, value in row.items(): | ||
| try: | ||
| row[key] = int(value) | ||
| except ValueError: | ||
| try: | ||
| row[key] = float(value) | ||
| except ValueError: | ||
| pass | ||
| result[data_id] = row |
There was a problem hiding this comment.
This change introduces new type-inference behavior in load_data_csv, but there are no unit tests asserting numeric columns stay numeric (including the regression case from #3027). Adding a focused test (e.g., in tests/unittests/test_data_io.py) that writes a small CSV with float/int columns and verifies the returned dict contains float/int types would help prevent future regressions.
|
Hi @knQzx , thanks for your PR! I think your PR try to address an important problem, but I am not so convinced by the way you've done it. I think this way of automatically -- and silently -- converting columns to "compatible" types might lead to incorrect behaviours. For instance, a string "00123" will be converted into an int "123" which loses information, the first two digits that might be important. I think another solution could be to pass to cc @pplantinga |
f781a86 to
48ada05
Compare
|
hey, thanks for the feedback - you're right, auto-casting is a bad idea since it silently breaks stuff like "00123" -> 123. reworked it to use an explicit col_types dict like you suggested. now load_data_csv and from_csv accept col_types={"duration": float, "spk_id": int} etc - only casts what the user asks for, default behavior is unchanged. added a test for it too. lmk if this looks better! |
load_data_csvwas only casting thedurationfield to float, leaving all other numeric columns as strings. replaced that with a general int/float cast for all fields so numeric values are preserved properly.fixes #3027