Make calls to "open" deterministic (code from Samsung AI Center Cambridge) by rogiervd · Pull Request #2734 · speechbrain/speechbrain

rogiervd · 2024-10-28T17:45:10Z

What does this PR do?

This PR makes calls to open use UTF-8 encoding.

The current behaviour is that the encoding depends on the local environment. Under Linux, if environment variable LC_ALL=C.UTF-8 is set, files are read as UTF-8. However, we have found that in practice, behaviour changes from user to user or machine to machine, and transcriptions with non-ASCII characters get messed up. This PR makes the behaviour deterministic.

When opening files for csv, this also add newline="", as required by https://docs.python.org/3/library/csv.html#csv.reader.

Change after review: the PR now also adds a check for the encoding to the pre-commit checks.

Breaking changes

For users/machines that have LC_ALL=C.UTF-8 set, nothing changes. For users/machines that have LC_ALL set differently, the behaviour changes to the correct one.

Remarks

I believe I have changed all relevant calls to open but it is possible I have missed some. This highlights that it might have been better to have all parsing, including calls to open, in one place instead of scattered around.
This is a bit of a big PR that touches many files. I'd be happy to split it up into multiple changes per subdirectory or something if that's more useful.
I have not been able to run all recipes.

cc @TParcollet for visibility.

asumagic · 2024-10-30T09:44:37Z

I didn't realize Linux could even be configured to anything else than UTF-8, but this is likely more useful for Windows, which seems to default to cp1252 from what I could find.

This is a bit of a big PR that touches many files. I'd be happy to split it up into multiple changes per subdirectory or something if that's more useful.

Shouldn't be. I'm doubtful this change can break anything (save for typos, and unless it was already broken).

I think the linter we use is able to detect the open with missing encoding, maybe we should look into making it into a CI error down the line.

asumagic · 2024-10-30T09:47:05Z

Although the CRLF thing wrt. CSV file reading is interesting... Didn't know about that gotcha.

If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n linendings on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.

rogiervd · 2024-10-30T09:50:09Z

I think the linter we use is able to detect the open with missing encoding, maybe we should look into making it into a CI error down the line.

That would be best. I'm happy to try and switch this on, but I don't see it. Where can I do this?

asumagic · 2024-10-30T09:59:21Z

I remember encountering it while running linting tools locally, but upon closer inspection, I can only find it being implemented as part of a flake8 plugin, flake8-encodings. I didn't install it manually and it doesn't seem to be used in SpeechBrain, so I don't know what would have pulled it.

If I'm reading this right this might just be a matter of adding that package to the lint-requirements.txt?

TBH it wouldn't be a blocker for this PR but it would be better to have down the line.

asumagic · 2024-10-30T10:00:10Z

+ potentially in .flake8 adding ENC to select =

rogiervd · 2024-10-30T10:04:56Z

Let me try this, first in a separate PR. It would be best to solve this once and for all, if I can.

rogiervd · 2024-10-30T10:11:26Z

From local experimentation, it looks like installing flake8-encodings should be all that's necessary. c3d8b3b should switch this on, and find 64 (!) instances of open that I'd missed.

rogiervd · 2024-10-30T11:08:45Z

Is linting supposed to be run in CI? Because I don’t see where, and my commit attempting to switch on the detection of encodings hasn’t triggered a CI failure in the way I hoped it would.

asumagic · 2024-10-30T11:23:28Z

Seems like the pre-commit workflow (pre-commit.yaml) does not actually honor the lint-requirements unlike the test workflow (pythonapp.yml).

(Noting that requirements.txt does include lint-requirements.txt.)

Seems like an oversight, will address this in a separate PR quickly.

asumagic · 2024-10-30T11:26:42Z

Actually it might just be simpler to include the dependency in .pre-commit-config.yaml under flake8. pre-commit kind of does its own dependency management so it might be better to avoid pulling all SB dependencies there.

Then, it would probably make sense to remove it from lint-requirements.txt.

asumagic · 2024-10-30T11:36:08Z

Seems like this works as expected now.

rogiervd · 2024-10-30T11:36:13Z

Great. The linting (e34eab0) is now causing the CI to fail as expected.

They now use UTF-8 encoding. This is the same as the old behaviour if (on Linux) environment variable LC_ALL=C.UTF-8 was set. However, often this environment variable was not set and behaviour changed from environment to environment. For CSV files, this also uses newline="".

These had been forgotten, but the added flake8 test found them.

TParcollet · 2024-11-04T16:09:37Z

@asumagic I also don't see how this could break anything serious from the files touched. Feel free to merge when you want. As this is touching the CI, maybe @pplantinga wants to have a very quick look at the new requirements.

pplantinga

Good cleanup of the repo, making it more usable on Windows! LGTM

asumagic added this to the v1.1.0 milestone Oct 30, 2024

Rogier van Dalen added 5 commits October 30, 2024 16:08

Make flake8 check for encodings on open()

dd2d489

Move flake8-encodings requirement to .pre-commit-config.yaml

8f479d0

Make more calls to "open" deterministic

02dd647

These had been forgotten, but the added flake8 test found them.

Remove accidental addition to docstring

cfdc9aa

rogiervd force-pushed the open-encoding branch from b7f6fd9 to cfdc9aa Compare October 30, 2024 16:09

TParcollet added the correctness Functionality not objectively broken, but may be surprising or wrong e.g. regarding literature label Nov 4, 2024

pplantinga approved these changes Nov 4, 2024

View reviewed changes

pplantinga merged commit c4a4243 into speechbrain:develop Nov 4, 2024

Conversation

rogiervd commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Breaking changes

Remarks

Uh oh!

asumagic commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asumagic commented Oct 30, 2024

Uh oh!

rogiervd commented Oct 30, 2024

Uh oh!

asumagic commented Oct 30, 2024

Uh oh!

asumagic commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rogiervd commented Oct 30, 2024

Uh oh!

rogiervd commented Oct 30, 2024

Uh oh!

rogiervd commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asumagic commented Oct 30, 2024

Uh oh!

asumagic commented Oct 30, 2024

Uh oh!

asumagic commented Oct 30, 2024

Uh oh!

rogiervd commented Oct 30, 2024

Uh oh!

TParcollet commented Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pplantinga left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rogiervd commented Oct 28, 2024 •

edited

Loading

asumagic commented Oct 30, 2024 •

edited

Loading

asumagic commented Oct 30, 2024 •

edited

Loading

rogiervd commented Oct 30, 2024 •

edited

Loading

TParcollet commented Nov 4, 2024 •

edited

Loading