Add People's Speech (30,000 hours) Conformer ASR (Code from Samsung AI Center Cambridge)#2767
Conversation
|
thanks @Adel-Moumen fixed. |
| else: | ||
| hf_caching_dir = os.environ["XDG_CACHE_HOME"] | ||
|
|
||
| if hf_caching_dir != hf_download_folder: |
There was a problem hiding this comment.
actually, I don't understand the variable hf_download_folder. What's the meaning of this one?
There was a problem hiding this comment.
That is where the arrow files are extracted ... and we don't want our users to be confused by HF hiding super heavy stuff in random places (what an absolutely horrendous software design).
There was a problem hiding this comment.
i need to think more about this but ok
| kwargs={ | ||
| "hf_download_folder": hparams["hf_download_folder"], | ||
| "subsets": hparams["subsets"], | ||
| "save_folder": hparams["save_folder"], |
There was a problem hiding this comment.
Question: I found many recipes saving csv directly inside of the output_folder. In this case, you made the choice of saving manifests in the save_folder (i.e. output_folder/save). I was wondering, but, which one should we stick to in the future?
There was a problem hiding this comment.
Yes, it's a long standing issue, I don't have an answer for that to be honest..
|
Just left some new comments. I run the code and I could extract and train for a few steps the transformer model. Lets merge after the new batch of comments is fixed :) |
|
I agree with Titouan that we need some consistent way for dataset download. Here's one more place where this already exists in the repo: https://github.com/speechbrain/speechbrain/blob/develop/recipes/Voicebank/voicebank_prepare.py#L394 In this case this is just a function that users can call if they want to download, but its not explicitly called in the recipe (iirc). I would be fine with either something like this or a short additional file that calls a function like this. that people can choose to run themselves. |
|
Thanks @TParcollet :) |
This PR introduces an ASR training and an optional data preparation for performing ASR with the People's Speech Dataset.
To-do:
Bellow question has been partially answered here (see bellow)
This PR raises an important discussion that we must have with core maintainers and people wishing to participate (@pplantinga, @mravanelli, @Adel-Moumen, @asumagic, @poonehmousavi, @ycemsubakan, @Gastron): What should we do with HuggingFace datasets download? HuggingFace datasets are going to be more and more around, and luckily, the brilliant @Gastron built us a DynamicItemDataset and a set of functions that just work out of the box for it... which means that it is ultra simple to integrate in SB (as seen in this recipe). HOWEVER, HuggingFace datasets must be downloaded before being loaded. SpeechBrain politic has always been to let the user download the dataset, and I think this totally make sense, and i'd like to keep it that way. However, for Gigaspeech for instance, we (I) already broke this rule (sigh). My idea (Adel's actually), would be to dissociate the data_prep and download scripts. We should provide the users another .py that can be run before the recipe to download the dataset. OR we can also not provide any script to avoid maintenance and just give a few instructions to our users. The problem with the latter is that downloading an huggingface dataset is slightly more complex that wget-ing a link -- and users could get it wrong i.e. struggle to match it with our further data prep (csv creation).
The proposal in this PR
It is to the user to download the dataset via HuggingFace beforehand, much like we do for pretty much all the other recipes. I voluntarily deactivate the ability of HF to go look on the internet the dataset. This will force users to DL it where they want beforehand. Being consistent is the most important thing, and it will ease maintainance.