Skip to content

IWSLT 2022 speech translation recipe#1475

Merged
anautsch merged 18 commits intospeechbrain:developfrom
mzboito:iwslt_speech_translation
Sep 13, 2022
Merged

IWSLT 2022 speech translation recipe#1475
anautsch merged 18 commits intospeechbrain:developfrom
mzboito:iwslt_speech_translation

Conversation

@mzboito
Copy link
Copy Markdown
Contributor

@mzboito mzboito commented Jun 27, 2022

This is a recipe for wav2vec 2.0 fine-tuning in the speech translation task. It includes data processing for the Tamasheq-French dataset, and the parameters from the best system in the low-resource task (that will be used as baseline next year).

@mravanelli mravanelli requested a review from anautsch June 27, 2022 15:22
@mravanelli mravanelli added the enhancement New feature or request label Jun 28, 2022
@TParcollet
Copy link
Copy Markdown
Collaborator

Hey @mzboito did you finally solved your issues ? I won't have time to have a look at all that before a while I am afraid :/

@mzboito
Copy link
Copy Markdown
Contributor Author

mzboito commented Jul 5, 2022

Dear reviewers.
I updated some headers, removed some imports and added information at the tests/recipes.csv.
I'm unsure about why the pre-commits are failing. It's my first time doing a PR, so I'll appreciate your guidance.

When I try to run "pytest tests" locally, I get an error related to fairseq's progress_bar.
This problem seems to happen even when I'm testing the main branch of speechbrain. Do I need an specific fairseq distribution in order to run the tests? (btw, my recipe doesn't use fairseq)

Thanks for your time.

@mzboito
Copy link
Copy Markdown
Contributor Author

mzboito commented Jul 5, 2022

Thanks @TParcollet, I just received the log from the failed tests. I fixed the errors (unused variables). Could you try to run it again?

@mzboito
Copy link
Copy Markdown
Contributor Author

mzboito commented Jul 5, 2022

Hello again. Sorry about the mess! I fixed my formatting and now "pre-commit run --all-files" passes on my machine.

Copy link
Copy Markdown
Collaborator

@TParcollet TParcollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mzboito and thank you so much for this work! Once my tree comment will be fixed, we will be able to merge! Thanks again!

Comment thread recipes/IWSLT22_lowresource/data_proc/prepare_tamasheq.sh Outdated
Comment thread recipes/IWSLT22_lowresource/data_proc/to_json.py
Comment thread recipes/IWSLT22_lowresource/train.py Outdated
@mzboito
Copy link
Copy Markdown
Contributor Author

mzboito commented Aug 31, 2022

Hello Titouan, thanks for all your comments! I applied all the changes and trained a model from scratch to make sure nothing breaks with the new tokenizer integrated into the train.py.
My code passes the pre-commit and the pytest tests executes without a problem, but when I tried to commit, I got a bunch of errors about the syntax of unrelated files. After some obscure git commands, I managed to push the new version. I hope everything works!
Thanks again!

@anautsch
Copy link
Copy Markdown
Collaborator

The error is at speechbrain/alignment/ctc_segmentation.py, which I do not use.

Hi @mzboito known issue. There are scripts in the tests folder that shadow this. You can try to run locally:

./tests/.run-linters.sh
./tests/.run-unittests.sh
./tests/.run-doctests.sh

Just saw your message while I drafted this one - re-starting the tests, let's see.

Copy link
Copy Markdown
Collaborator

@TParcollet TParcollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more change, and we're good to go! Thanks @mzboito

Comment thread recipes/IWSLT22_lowresource/README.md Outdated
Comment thread recipes/IWSLT22_lowresource/train.py
@mzboito
Copy link
Copy Markdown
Contributor Author

mzboito commented Aug 31, 2022

Hi again Titouan! Thanks for the feedback: it looks much better like this. :)
I incorporated all your changes:

  1. added git clone to read me
  2. removed data_proc folder
  3. moved the python script to the root folder for this recipe
  4. added call to this script using run_on_main at train.py
    I hope everything is ok now. Let me know otherwise! :)

@TParcollet
Copy link
Copy Markdown
Collaborator

Nice! Lemme try it!

@mzboito
Copy link
Copy Markdown
Contributor Author

mzboito commented Sep 1, 2022

Oh I'm sorry, the error is that I didn't update the recipe csv file!!

Copy link
Copy Markdown
Collaborator

@anautsch anautsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just for the minor comments. lgtm otherwise !

( obligatory curiosity question: do you plan to upload the models ? )

Comment thread recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml Outdated
Comment thread recipes/IWSLT22_lowresource/train.py Outdated
Comment thread recipes/IWSLT22_lowresource/train.py Outdated
@anautsch
Copy link
Copy Markdown
Collaborator

GitHub indicates a conflict with the file tests/recipes.csv - can you please fetch the latest develop version into your PR?
This should help resolving it. It looks like other recipes have updates (no clue though why git remarks it; it should not).

@mzboito
Copy link
Copy Markdown
Contributor Author

mzboito commented Sep 13, 2022

Hello @anautsch ,
Thanks for your feedback. In the last commit I did the following changes:

  1. I added new recipes.csv copied from the main branch. I added my recipe at the end.
  2. Replaced the value inside debug by a parameter defined at the hparams file
  3. Changed the import for the data_proc as requested.

I hope it works fine now! For some reason github is still saying there's a conflict at recipes.csv

@anautsch
Copy link
Copy Markdown
Collaborator

Hello @anautsch , Thanks for your feedback. In the last commit I did the following changes:

  1. I added new recipes.csv copied from the main branch. I added my recipe at the end.
  2. Replaced the value inside debug by a parameter defined at the hparams file
  3. Changed the import for the data_proc as requested.

I hope it works fine now! For some reason github is still saying there's a conflict at recipes.csv

Thanks @mzboito !
It literally was just this now

<<<<<<< iwslt_speech_translation
recipe0144,ST,Tamasheq-French,recipes/IWSLT22_lowresource/train.py,recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml,recipes/IWSLT22_lowresource/prepare_iwslt22.py,recipes/IWSLT22_lowresource/README.md,,,,
=======
>>>>>>> develop

before, more lines were affected. It could be simply a tree/history versioning thingy (github still thinking of your version at branch time, and now this one being copied over - so a resolve merge cleared it up - thanks for preparing this, so it was easy!)

@anautsch
Copy link
Copy Markdown
Collaborator

anautsch commented Sep 13, 2022

@mzboito linters got to be kidding...

Trim Trailing Whitespace.................................................Failed
- hook id: trailing-whitespace
- exit code: 1
- files were modified by this hook

Fixing recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml

edit: seriously, I can't see what it complains about - one empty line too much?

@@ -43,7 +43,7 @@ ckpt_interval_minutes: 15 # save checkpoint every N min
 sorting: ascending
 sorting_min_duration: 1
 # this replaces sorting_min_duration in debug mode
-sorting_debug_duration: 3 
+sorting_debug_duration: 3
 sorting_max_duration: 5
 

@mzboito
Copy link
Copy Markdown
Contributor Author

mzboito commented Sep 13, 2022

Sorry @anautsch , I'm in a new machine and i forgot to install black and flake8!
The trailing character should be gone now.

@anautsch
Copy link
Copy Markdown
Collaborator

it's about trailing whitespaces, the other linters passed.

Fix End of Files.........................................................Passed
Fix requirements.txt.....................................................Passed
Mixed line ending........................................................Passed
Check for added large files..............................................Passed
black....................................................................Passed
flake8...................................................................Passed
yamllint.................................................................Passed

@mzboito
Copy link
Copy Markdown
Contributor Author

mzboito commented Sep 13, 2022

That's a bit odd, I don't know why linters is mad about my comment, but I moved it somewhere else. Let's see!

@anautsch
Copy link
Copy Markdown
Collaborator

That's a bit odd, I don't know why linters is mad about my comment, but I moved it somewhere else. Let's see!

Indeed. Heading for lunch; fingers crossed !

@mzboito
Copy link
Copy Markdown
Contributor Author

mzboito commented Sep 13, 2022

Bon app @anautsch ! Locally, when I run black giving as input the hparams file I get the following error:
error: cannot format recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml: Cannot parse: 13:11: __set_seed: !!python/object/apply:torch.manual_seed [!ref ]
All done! 💥 💔 💥
1 file failed to reformat.

I don't understand why just now it is complaining about this line. Before it was running fine. Moreover, I checked a different recipe, and the line is identical (e.g. https://github.com/speechbrain/speechbrain/blob/develop/templates/speaker_id/train.yaml)

Not sure how to fix this.

@anautsch
Copy link
Copy Markdown
Collaborator

anautsch commented Sep 13, 2022

Bon app !

Thanks; let's see what we have here.

Locally, when I run black giving as input the hparams file I get the following error [...] I don't understand why just now it is complaining about this line.

black should handle py files only; cf linters: git ls-files | grep -E "\.py$" | xargs black --check --diff

I checked a different recipe, and the line is identical (e.g. https://github.com/speechbrain/speechbrain/blob/develop/templates/speaker_id/train.yaml)

It has the same error when given to black.

Not sure how to fix this.

yamllint recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml

gives

  19:81     warning  line too long (106 > 80 characters)  (line-length)
  31:81     warning  line too long (102 > 80 characters)  (line-length)
  42:81     warning  line too long (93 > 80 characters)  (line-length)
  75:81     warning  line too long (89 > 80 characters)  (line-length)
  198:61    error    no new line character at the end of file  (new-line-at-end-of-file)

When I added an empty line at the end that error disappeared (but it wasn't reported before either).

Chasing down this lead: grep -r trailing-whitespace .
=> .pre-commit-config.yaml points to https://github.com/pre-commit/pre-commit-hooks

So, I tried out: pre-commit run trailing-whitespace --files recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml

Trim Trailing Whitespace.................................................Failed
- hook id: trailing-whitespace
- exit code: 1
- files were modified by this hook

Fixing recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml

ran a meld on the supposedly fixed one and another local copy: Files are identical !!
Ran it a second time:

pre-commit run trailing-whitespace --filesml recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml
Trim Trailing Whitespace.................................................Passed

Idk - some cache issue?

Please run twice:

pre-commit run trailing-whitespace --files recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml

@anautsch
Copy link
Copy Markdown
Collaborator

anautsch commented Sep 13, 2022

This is funny and frightening at the same time ... what is git up to...

$ git diff
diff --git a/recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml b/recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml
index a889f2e9..63fb2b10 100644
--- a/recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml
+++ b/recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml
@@ -42,7 +42,7 @@ ckpt_interval_minutes: 15 # save checkpoint every N min
 # Data sorting parameters: sorting_debug_duration replaces sorting_min_duration in debug mode
 sorting: ascending
 sorting_min_duration: 1
-sorting_debug_duration: 3 
+sorting_debug_duration: 3
 sorting_max_duration: 5

the only thing coming to mind is "\n\r" vs "\n" end of the line character thingy; or some other 'invisible' command that is not an end-of-line and thus raises the above error as its casted as "whitespace" which is then trailing. dunno.

@mzboito
Copy link
Copy Markdown
Contributor Author

mzboito commented Sep 13, 2022

Hello @anautsch , sorry for the delay.
I think the problem was an extra whitespace after the number 3. Sorry about that.
Now it's passing pre-commit.

pre-commit run trailing-whitespace --files recipes/IWSLT22_lowresource/hparams/train_w2v2_st.yaml
Trim Trailing Whitespace.................................................Passed

@anautsch anautsch dismissed TParcollet’s stale review September 13, 2022 13:51

all points addressed/resolved

Copy link
Copy Markdown
Collaborator

@anautsch anautsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last bit - just went through a final time
(never ending story here... sorry for that)

(about the white space, at some point, I need to reconfigure my tools and/or clean my glasses... expected it but didn't catch it before)

Comment thread recipes/IWSLT22_lowresource/README.md Outdated
@anautsch anautsch merged commit 85c6a0d into speechbrain:develop Sep 13, 2022
@mzboito mzboito deleted the iwslt_speech_translation branch September 13, 2022 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants