Make calls to "open" deterministic (code from Samsung AI Center Cambridge)#2734
Make calls to "open" deterministic (code from Samsung AI Center Cambridge)#2734pplantinga merged 5 commits intospeechbrain:developfrom
Conversation
|
I didn't realize Linux could even be configured to anything else than UTF-8, but this is likely more useful for Windows, which seems to default to cp1252 from what I could find.
Shouldn't be. I'm doubtful this change can break anything (save for typos, and unless it was already broken). I think the linter we use is able to detect the |
|
Although the CRLF thing wrt. CSV file reading is interesting... Didn't know about that gotcha.
|
That would be best. I'm happy to try and switch this on, but I don't see it. Where can I do this? |
|
I remember encountering it while running linting tools locally, but upon closer inspection, I can only find it being implemented as part of a flake8 plugin, If I'm reading this right this might just be a matter of adding that package to the TBH it wouldn't be a blocker for this PR but it would be better to have down the line. |
|
+ potentially in |
|
Let me try this, first in a separate PR. It would be best to solve this once and for all, if I can. |
|
From local experimentation, it looks like installing |
|
Is linting supposed to be run in CI? Because I don’t see where, and my commit attempting to switch on the detection of encodings hasn’t triggered a CI failure in the way I hoped it would. |
|
Seems like the pre-commit workflow ( (Noting that Seems like an oversight, will address this in a separate PR quickly. |
|
Actually it might just be simpler to include the dependency in Then, it would probably make sense to remove it from |
|
Seems like this works as expected now. |
|
Great. The linting (e34eab0) is now causing the CI to fail as expected. |
They now use UTF-8 encoding. This is the same as the old behaviour if (on Linux) environment variable LC_ALL=C.UTF-8 was set. However, often this environment variable was not set and behaviour changed from environment to environment. For CSV files, this also uses newline="".
These had been forgotten, but the added flake8 test found them.
b7f6fd9 to
cfdc9aa
Compare
|
@asumagic I also don't see how this could break anything serious from the files touched. Feel free to merge when you want. As this is touching the CI, maybe @pplantinga wants to have a very quick look at the new requirements. |
pplantinga
left a comment
There was a problem hiding this comment.
Good cleanup of the repo, making it more usable on Windows! LGTM
What does this PR do?
This PR makes calls to
openuse UTF-8 encoding.The current behaviour is that the encoding depends on the local environment. Under Linux, if environment variable
LC_ALL=C.UTF-8is set, files are read as UTF-8. However, we have found that in practice, behaviour changes from user to user or machine to machine, and transcriptions with non-ASCII characters get messed up. This PR makes the behaviour deterministic.When opening files for
csv, this also addnewline="", as required by https://docs.python.org/3/library/csv.html#csv.reader.Change after review: the PR now also adds a check for the encoding to the pre-commit checks.
Breaking changes
For users/machines that have
LC_ALL=C.UTF-8set, nothing changes. For users/machines that haveLC_ALLset differently, the behaviour changes to the correct one.Remarks
openbut it is possible I have missed some. This highlights that it might have been better to have all parsing, including calls toopen, in one place instead of scattered around.cc @TParcollet for visibility.