Skip to content

Add People's Speech (30,000 hours) Conformer ASR (Code from Samsung AI Center Cambridge)#2767

Merged
Adel-Moumen merged 57 commits intospeechbrain:developfrom
TParcollet:people_speech
Nov 28, 2024
Merged

Add People's Speech (30,000 hours) Conformer ASR (Code from Samsung AI Center Cambridge)#2767
Adel-Moumen merged 57 commits intospeechbrain:developfrom
TParcollet:people_speech

Conversation

@TParcollet
Copy link
Copy Markdown
Collaborator

@TParcollet TParcollet commented Nov 22, 2024

This PR introduces an ASR training and an optional data preparation for performing ASR with the People's Speech Dataset.

To-do:

  • Finish training the model
  • Answer the question of to download or not to download?
  • Add recipe testing. WE CAN T RECIPE TEST RECIPES BASED ON HF DATASET BECAUSE CSV ARE NOT USED

Bellow question has been partially answered here (see bellow)
This PR raises an important discussion that we must have with core maintainers and people wishing to participate (@pplantinga, @mravanelli, @Adel-Moumen, @asumagic, @poonehmousavi, @ycemsubakan, @Gastron): What should we do with HuggingFace datasets download? HuggingFace datasets are going to be more and more around, and luckily, the brilliant @Gastron built us a DynamicItemDataset and a set of functions that just work out of the box for it... which means that it is ultra simple to integrate in SB (as seen in this recipe). HOWEVER, HuggingFace datasets must be downloaded before being loaded. SpeechBrain politic has always been to let the user download the dataset, and I think this totally make sense, and i'd like to keep it that way. However, for Gigaspeech for instance, we (I) already broke this rule (sigh). My idea (Adel's actually), would be to dissociate the data_prep and download scripts. We should provide the users another .py that can be run before the recipe to download the dataset. OR we can also not provide any script to avoid maintenance and just give a few instructions to our users. The problem with the latter is that downloading an huggingface dataset is slightly more complex that wget-ing a link -- and users could get it wrong i.e. struggle to match it with our further data prep (csv creation).

The proposal in this PR
It is to the user to download the dataset via HuggingFace beforehand, much like we do for pretty much all the other recipes. I voluntarily deactivate the ability of HF to go look on the internet the dataset. This will force users to DL it where they want beforehand. Being consistent is the most important thing, and it will ease maintainance.

@TParcollet TParcollet added ready to review Waiting on reviewer to provide feedback and removed work in progress Not ready for merge labels Nov 25, 2024
Comment thread recipes/PeoplesSpeech/ASR/transformer/README.md Outdated
Comment thread recipes/PeoplesSpeech/ASR/transformer/extra_requirements.txt Outdated
Comment thread recipes/PeoplesSpeech/ASR/transformer/hparams/conformer_large.yaml Outdated
Comment thread recipes/PeoplesSpeech/ASR/transformer/hparams/conformer_large.yaml Outdated
Comment thread recipes/PeoplesSpeech/ASR/transformer/hparams/conformer_large.yaml
Comment thread recipes/PeoplesSpeech/peoples_speech_prepare.py Outdated
Comment thread recipes/PeoplesSpeech/peoples_speech_prepare.py
Comment thread recipes/PeoplesSpeech/peoples_speech_prepare.py
Comment thread recipes/PeoplesSpeech/peoples_speech_prepare.py
Comment thread recipes/PeoplesSpeech/ASR/transformer/train.py Outdated
@TParcollet
Copy link
Copy Markdown
Collaborator Author

thanks @Adel-Moumen fixed.

Copy link
Copy Markdown
Collaborator

@Adel-Moumen Adel-Moumen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment thread recipes/PeoplesSpeech/peoples_speech_prepare.py Outdated
else:
hf_caching_dir = os.environ["XDG_CACHE_HOME"]

if hf_caching_dir != hf_download_folder:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I don't understand the variable hf_download_folder. What's the meaning of this one?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is where the arrow files are extracted ... and we don't want our users to be confused by HF hiding super heavy stuff in random places (what an absolutely horrendous software design).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i need to think more about this but ok

Comment thread recipes/PeoplesSpeech/ASR/transformer/hparams/conformer_large.yaml
Comment thread recipes/PeoplesSpeech/peoples_speech_prepare.py Outdated
kwargs={
"hf_download_folder": hparams["hf_download_folder"],
"subsets": hparams["subsets"],
"save_folder": hparams["save_folder"],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I found many recipes saving csv directly inside of the output_folder. In this case, you made the choice of saving manifests in the save_folder (i.e. output_folder/save). I was wondering, but, which one should we stick to in the future?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a long standing issue, I don't have an answer for that to be honest..

@Adel-Moumen
Copy link
Copy Markdown
Collaborator

Just left some new comments. I run the code and I could extract and train for a few steps the transformer model. Lets merge after the new batch of comments is fixed :)

@pplantinga
Copy link
Copy Markdown
Collaborator

I agree with Titouan that we need some consistent way for dataset download. Here's one more place where this already exists in the repo:

https://github.com/speechbrain/speechbrain/blob/develop/recipes/Voicebank/voicebank_prepare.py#L394

In this case this is just a function that users can call if they want to download, but its not explicitly called in the recipe (iirc). I would be fine with either something like this or a short additional file that calls a function like this. that people can choose to run themselves.

@Adel-Moumen Adel-Moumen merged commit dc57e8f into speechbrain:develop Nov 28, 2024
@Adel-Moumen
Copy link
Copy Markdown
Collaborator

Thanks @TParcollet :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

help wanted Extra attention is needed ready to review Waiting on reviewer to provide feedback recipes Changes to recipes only (add/edit)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants