IWSLT 2022 speech translation recipe by mzboito · Pull Request #1475 · speechbrain/speechbrain

mzboito · 2022-06-27T09:23:06Z

This is a recipe for wav2vec 2.0 fine-tuning in the speech translation task. It includes data processing for the Tamasheq-French dataset, and the parameters from the best system in the low-resource task (that will be used as baseline next year).

TParcollet · 2022-07-05T14:26:40Z

Hey @mzboito did you finally solved your issues ? I won't have time to have a look at all that before a while I am afraid :/

mzboito · 2022-07-05T14:29:03Z

Dear reviewers.
I updated some headers, removed some imports and added information at the tests/recipes.csv.
I'm unsure about why the pre-commits are failing. It's my first time doing a PR, so I'll appreciate your guidance.

When I try to run "pytest tests" locally, I get an error related to fairseq's progress_bar.
This problem seems to happen even when I'm testing the main branch of speechbrain. Do I need an specific fairseq distribution in order to run the tests? (btw, my recipe doesn't use fairseq)

Thanks for your time.

…/speechbrain into iwslt_speech_translation

mzboito · 2022-07-05T14:39:46Z

Thanks @TParcollet, I just received the log from the failed tests. I fixed the errors (unused variables). Could you try to run it again?

mzboito · 2022-07-05T15:01:44Z

Hello again. Sorry about the mess! I fixed my formatting and now "pre-commit run --all-files" passes on my machine.

TParcollet

Hi @mzboito and thank you so much for this work! Once my tree comment will be fixed, we will be able to merge! Thanks again!

mzboito · 2022-08-31T10:32:06Z

Hello Titouan, thanks for all your comments! I applied all the changes and trained a model from scratch to make sure nothing breaks with the new tokenizer integrated into the train.py.
My code passes the pre-commit and the pytest tests executes without a problem, but when I tried to commit, I got a bunch of errors about the syntax of unrelated files. After some obscure git commands, I managed to push the new version. I hope everything works!
Thanks again!

anautsch · 2022-08-31T10:33:06Z

The error is at speechbrain/alignment/ctc_segmentation.py, which I do not use.

Hi @mzboito known issue. There are scripts in the tests folder that shadow this. You can try to run locally:

./tests/.run-linters.sh
./tests/.run-unittests.sh
./tests/.run-doctests.sh

Just saw your message while I drafted this one - re-starting the tests, let's see.

TParcollet

One more change, and we're good to go! Thanks @mzboito

…unction

mzboito · 2022-08-31T16:35:20Z

Hi again Titouan! Thanks for the feedback: it looks much better like this. :)
I incorporated all your changes:

added git clone to read me
removed data_proc folder
moved the python script to the root folder for this recipe
added call to this script using run_on_main at train.py
I hope everything is ok now. Let me know otherwise! :)

TParcollet · 2022-09-01T09:03:17Z

Nice! Lemme try it!

mzboito · 2022-09-01T09:34:47Z

Oh I'm sorry, the error is that I didn't update the recipe csv file!!

anautsch

it's just for the minor comments. lgtm otherwise !

( obligatory curiosity question: do you plan to upload the models ? )

anautsch · 2022-09-12T16:16:02Z

GitHub indicates a conflict with the file tests/recipes.csv - can you please fetch the latest develop version into your PR?
This should help resolving it. It looks like other recipes have updates (no clue though why git remarks it; it should not).

mzboito · 2022-09-13T09:49:52Z

Hello @anautsch ,
Thanks for your feedback. In the last commit I did the following changes:

I added new recipes.csv copied from the main branch. I added my recipe at the end.
Replaced the value inside debug by a parameter defined at the hparams file
Changed the import for the data_proc as requested.

I hope it works fine now! For some reason github is still saying there's a conflict at recipes.csv

anautsch · 2022-09-13T10:05:03Z

Hello @anautsch , Thanks for your feedback. In the last commit I did the following changes:

I added new recipes.csv copied from the main branch. I added my recipe at the end.

Replaced the value inside debug by a parameter defined at the hparams file

Changed the import for the data_proc as requested.

I hope it works fine now! For some reason github is still saying there's a conflict at recipes.csv

Thanks @mzboito !
It literally was just this now

<<<<<<< iwslt_speech_translation
recipe0144,ST,Tamasheq-French,recipes/IWSLT22_lowresource/train.py,recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml,recipes/IWSLT22_lowresource/prepare_iwslt22.py,recipes/IWSLT22_lowresource/README.md,,,,
=======
>>>>>>> develop

before, more lines were affected. It could be simply a tree/history versioning thingy (github still thinking of your version at branch time, and now this one being copied over - so a resolve merge cleared it up - thanks for preparing this, so it was easy!)

anautsch · 2022-09-13T10:09:43Z

@mzboito linters got to be kidding...

Trim Trailing Whitespace.................................................Failed
- hook id: trailing-whitespace
- exit code: 1
- files were modified by this hook

Fixing recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml

edit: seriously, I can't see what it complains about - one empty line too much?

@@ -43,7 +43,7 @@ ckpt_interval_minutes: 15 # save checkpoint every N min
 sorting: ascending
 sorting_min_duration: 1
 # this replaces sorting_min_duration in debug mode
-sorting_debug_duration: 3 
+sorting_debug_duration: 3
 sorting_max_duration: 5

…/speechbrain into iwslt_speech_translation

mzboito · 2022-09-13T10:12:12Z

Sorry @anautsch , I'm in a new machine and i forgot to install black and flake8!
The trailing character should be gone now.

anautsch · 2022-09-13T10:14:11Z

it's about trailing whitespaces, the other linters passed.

Fix End of Files.........................................................Passed
Fix requirements.txt.....................................................Passed
Mixed line ending........................................................Passed
Check for added large files..............................................Passed
black....................................................................Passed
flake8...................................................................Passed
yamllint.................................................................Passed

mzboito · 2022-09-13T10:21:46Z

That's a bit odd, I don't know why linters is mad about my comment, but I moved it somewhere else. Let's see!

anautsch · 2022-09-13T10:22:20Z

That's a bit odd, I don't know why linters is mad about my comment, but I moved it somewhere else. Let's see!

Indeed. Heading for lunch; fingers crossed !

mzboito · 2022-09-13T10:28:53Z

Bon app @anautsch ! Locally, when I run black giving as input the hparams file I get the following error:
error: cannot format recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml: Cannot parse: 13:11: __set_seed: !!python/object/apply:torch.manual_seed [!ref ]
All done! 💥 💔 💥
1 file failed to reformat.

I don't understand why just now it is complaining about this line. Before it was running fine. Moreover, I checked a different recipe, and the line is identical (e.g. https://github.com/speechbrain/speechbrain/blob/develop/templates/speaker_id/train.yaml)

Not sure how to fix this.

anautsch · 2022-09-13T12:25:56Z

Bon app !

Thanks; let's see what we have here.

Locally, when I run black giving as input the hparams file I get the following error [...] I don't understand why just now it is complaining about this line.

black should handle py files only; cf linters: git ls-files | grep -E "\.py$" | xargs black --check --diff

I checked a different recipe, and the line is identical (e.g. https://github.com/speechbrain/speechbrain/blob/develop/templates/speaker_id/train.yaml)

It has the same error when given to black.

Not sure how to fix this.

yamllint recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml

gives

  19:81     warning  line too long (106 > 80 characters)  (line-length)
  31:81     warning  line too long (102 > 80 characters)  (line-length)
  42:81     warning  line too long (93 > 80 characters)  (line-length)
  75:81     warning  line too long (89 > 80 characters)  (line-length)
  198:61    error    no new line character at the end of file  (new-line-at-end-of-file)

When I added an empty line at the end that error disappeared (but it wasn't reported before either).

Chasing down this lead: grep -r trailing-whitespace .
=> .pre-commit-config.yaml points to https://github.com/pre-commit/pre-commit-hooks

So, I tried out: pre-commit run trailing-whitespace --files recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml

Trim Trailing Whitespace.................................................Failed
- hook id: trailing-whitespace
- exit code: 1
- files were modified by this hook

Fixing recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml

ran a meld on the supposedly fixed one and another local copy: Files are identical !!
Ran it a second time:

pre-commit run trailing-whitespace --filesml recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml
Trim Trailing Whitespace.................................................Passed

Idk - some cache issue?

Please run twice:

pre-commit run trailing-whitespace --files recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml

anautsch · 2022-09-13T12:34:36Z

This is funny and frightening at the same time ... what is git up to...

$ git diff
diff --git a/recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml b/recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml
index a889f2e9..63fb2b10 100644
--- a/recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml
+++ b/recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml
@@ -42,7 +42,7 @@ ckpt_interval_minutes: 15 # save checkpoint every N min
 # Data sorting parameters: sorting_debug_duration replaces sorting_min_duration in debug mode
 sorting: ascending
 sorting_min_duration: 1
-sorting_debug_duration: 3 
+sorting_debug_duration: 3
 sorting_max_duration: 5

the only thing coming to mind is "\n\r" vs "\n" end of the line character thingy; or some other 'invisible' command that is not an end-of-line and thus raises the above error as its casted as "whitespace" which is then trailing. dunno.

mzboito · 2022-09-13T13:19:27Z

Hello @anautsch , sorry for the delay.
I think the problem was an extra whitespace after the number 3. Sorry about that.
Now it's passing pre-commit.

pre-commit run trailing-whitespace --files recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml
Trim Trailing Whitespace.................................................Passed

all points addressed/resolved

anautsch

one last bit - just went through a final time
(never ending story here... sorry for that)

(about the white space, at some point, I need to reconfigure my tools and/or clean my glasses... expected it but didn't catch it before)

mzboito added 2 commits June 27, 2022 11:18

IWSLT 2022 low-resource recipe

f861670

IWSLT 2022 low-resource recipe

76cad55

mravanelli requested a review from anautsch June 27, 2022 15:22

mravanelli added the enhancement New feature or request label Jun 28, 2022

mzboito added 2 commits July 5, 2022 16:23

recipe headers update/unused import removal/inclusion at recipes.csv

117709b

Merge branch 'develop' into iwslt_speech_translation

968852f

mzboito added 2 commits July 5, 2022 16:37

removed unused variables (errors in test 3.7)

19f5f22

Merge branch 'iwslt_speech_translation' of https://github.com/mzboito…

70c74e8

…/speechbrain into iwslt_speech_translation

mzboito added 2 commits July 5, 2022 16:54

flake8 friendly version of train.py

03e4132

pre-commit passed on local

b4f0b85

TParcollet requested changes Aug 27, 2022

View reviewed changes

Comment thread recipes/IWSLT22_lowresource/data_proc/prepare_tamasheq.sh Outdated

Comment thread recipes/IWSLT22_lowresource/data_proc/to_json.py

Comment thread recipes/IWSLT22_lowresource/train.py Outdated

changes in IWSLT22: new tokenizer and scripts names

3e82ee7

TParcollet previously requested changes Aug 31, 2022

View reviewed changes

Comment thread recipes/IWSLT22_lowresource/README.md Outdated

Comment thread recipes/IWSLT22_lowresource/train.py

changes in data processing: integrating script inside train.py main f…

72f3ef8

…unction

Update recipes.csv

6519b16

anautsch suggested changes Sep 12, 2022

View reviewed changes

Comment thread recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml Outdated

anautsch reviewed Sep 12, 2022

View reviewed changes

Comment thread recipes/IWSLT22_lowresource/train.py Outdated

anautsch reviewed Sep 12, 2022

View reviewed changes

Comment thread recipes/IWSLT22_lowresource/train.py Outdated

changes asked by reviewer @anautsch

085b182

Merge branch 'develop' into iwslt_speech_translation

d7ea83b

black fix to train.py

d09becd

Merge branch 'iwslt_speech_translation' of https://github.com/mzboito…

e0c48ad

…/speechbrain into iwslt_speech_translation

removing comment at hparam that is triggering error

ad96763

fix trailing-whitespace

f1a2ade

anautsch approved these changes Sep 13, 2022

View reviewed changes

anautsch suggested changes Sep 13, 2022

View reviewed changes

Comment thread recipes/IWSLT22_lowresource/README.md Outdated

fix at README: wrong file nmae

9819e11

anautsch approved these changes Sep 13, 2022

View reviewed changes

anautsch merged commit 85c6a0d into speechbrain:develop Sep 13, 2022

mzboito deleted the iwslt_speech_translation branch September 13, 2022 15:37

Conversation

mzboito commented Jun 27, 2022

Uh oh!

TParcollet commented Jul 5, 2022

Uh oh!

mzboito commented Jul 5, 2022

Uh oh!

mzboito commented Jul 5, 2022

Uh oh!

mzboito commented Jul 5, 2022

Uh oh!

TParcollet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mzboito commented Aug 31, 2022

Uh oh!

anautsch commented Aug 31, 2022

Uh oh!

TParcollet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mzboito commented Aug 31, 2022

Uh oh!

TParcollet commented Sep 1, 2022

Uh oh!

mzboito commented Sep 1, 2022

Uh oh!

anautsch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anautsch commented Sep 12, 2022

Uh oh!

mzboito commented Sep 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anautsch commented Sep 13, 2022

Uh oh!

anautsch commented Sep 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mzboito commented Sep 13, 2022

Uh oh!

anautsch commented Sep 13, 2022

Uh oh!

mzboito commented Sep 13, 2022

Uh oh!

anautsch commented Sep 13, 2022

Uh oh!

mzboito commented Sep 13, 2022

Uh oh!

anautsch commented Sep 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anautsch commented Sep 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mzboito commented Sep 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anautsch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mzboito commented Sep 13, 2022 •

edited

Loading

anautsch commented Sep 13, 2022 •

edited

Loading

anautsch commented Sep 13, 2022 •

edited

Loading

anautsch commented Sep 13, 2022 •

edited

Loading

mzboito commented Sep 13, 2022 •

edited

Loading