Librilight data preparation for SpeechBrain SSL (code from Samsung AI Center Cambridge)#2765
Conversation
Adel-Moumen
left a comment
There was a problem hiding this comment.
Hey @shucongzhang , how are you doing? Nice to see a data prep for Librilight! I left some early comments in case you wanna take a look. Also, do you plan to add a recipe e.g. best-rq ? would be nice to have a working prototype alongside this data prep. thanks :)
|
|
||
| import argparse | ||
| import csv | ||
| import multiprocessing |
There was a problem hiding this comment.
Could you please use instead parallel_map from our toolkit (see: https://github.com/speechbrain/speechbrain/blob/develop/recipes/CommonVoice/common_voice_prepare.py#L368)
| n_processes: int | ||
| Number of parallel processes | ||
| """ | ||
| print("Processing each subfolder of this split") |
There was a problem hiding this comment.
could you please use our loggers instead of using print? (e.g. https://github.com/speechbrain/speechbrain/blob/develop/recipes/CommonVoice/common_voice_prepare.py#L24)
| with multiprocessing.Pool(processes=n_processes) as pool: | ||
| for _ in tqdm.tqdm( | ||
| pool.imap_unordered(make_csv_for_each, tasks), total=len(tasks) | ||
| ): | ||
| pass |
There was a problem hiding this comment.
this code will have to change according to the parallel_map
| csv_file_folder: str | ||
| Path to the output folder of generated csv files | ||
| """ | ||
| print("Merging the csvs of each subfolder into one csv") |
| # filter out bad rows | ||
| for row in reader: | ||
| if len(row) == 3 and os.path.exists(row[-1]): | ||
| new_row = [row[-1], row[1], row[2]] | ||
| csv_writer.writerow(new_row) | ||
| else: | ||
| print(f"bad row {row}") |
There was a problem hiding this comment.
can you add in the docstring of the function what does it means to have "bad rows" ?
| import shutil | ||
|
|
||
| shutil.rmtree(f"{csv_file_folder}/tmp") |
| ) | ||
| for i, flac_file in enumerate(subpath_1.glob("**/*.flac")): | ||
| flac_file_name = flac_file.stem | ||
| waveform, sample_rate = torchaudio.load(str(flac_file)) |
There was a problem hiding this comment.
I think it would be better to instead use the read_audio_info (see: https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/dataio/dataio.py#L166-L221)
| def parse_args(): | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument( | ||
| "--input_dir", | ||
| type=str, | ||
| default=None, | ||
| help="Path to the Libri-Light split after vad", | ||
| required=True, | ||
| ) | ||
| parser.add_argument( | ||
| "--output_dir", | ||
| type=str, | ||
| default=None, | ||
| help="Path to the output folder of generated csv files", | ||
| required=True, | ||
| ) | ||
| parser.add_argument( | ||
| "--max_length", | ||
| type=float, | ||
| default=20.2, | ||
| help="The max length for each prepared audio clip" "(default is 20.2)", | ||
| ) | ||
| parser.add_argument( | ||
| "--n_processes", | ||
| type=int, | ||
| default=32, | ||
| help="Number of parallel processes", | ||
| ) | ||
| args = parser.parse_args() | ||
| return args |
There was a problem hiding this comment.
@TParcollet do you have an opinion on that? Usually the data prep is being done directly in the train.py ,BUT, given how large the dataset is, I would say that 99% of the folks will prefer to first process the dataset, and then, train models (which means a clear way of executing only the data prep). In this regard, I would say we could stick to argparse for this recipe, but I'd like to get the goat opinion on this design choice. thanks
There was a problem hiding this comment.
This is an excellent question as I have been dealing with recipes with 10k+ hours recently and i've had this exact same issue. I resorted in adding a yaml argument to just do the dataprep.
I think that the key here is to find the potential issue that it may cause. If one starts a training with a very long data prep on a single process, personally, I don't think it is a problem and it will work just fine. Problem is ... these very long data prep imply that people will want to use multiple GPUs, hence DDP, hence time out on the dataprep. There is also the cluster occupation issue, as it is inefficient to occupy multiple GPUs for a few hours.
I think that what @shucongzhang proposes here makes sense. However, we would also need to discuss more about that with people like @pplantinga, @mravanelli and @ASU. Another possibility would simply be to always respect SB way of doing thing ... and therefore have one librilight_prepare.py and call it in a recipe with some checks to make sure that it's not run with DDP.
There was a problem hiding this comment.
I would still, however, by convention, name the file librilight_prepare.py
There was a problem hiding this comment.
Hi @Adel-Moumen How are you. Thank you so much for the quick and detailed comments! I really appreciate it! For the script, I've rewritten it with SpeechBrain utils functions. For having a separate data preparation, I think the reasons are
- Before the SpeechBrain data preparation, the user needs to use external VAD to pre-process the data, which is already a separate step.
- There is no dev set for LibriLight. Thus, the dev.csv should be taken from somewhere else, which is also already a separate data preparation.
- As you said, the data is huge and prepare it first before training may be a good option.
I'm also discussing with Titouan about this. I'll update this PR after there's a conclusion about how should we do the data preparation for this dataset.
Thank you again!
There was a problem hiding this comment.
Hi @Adel-Moumen How are you. Thank you so much for the quick and detailed comments! I really appreciate it! For the script, I've rewritten it with SpeechBrain utils functions. For having a separate data preparation, I think the reasons are
1. Before the SpeechBrain data preparation, the user needs to use external VAD to pre-process the data, which is already a separate step. 2. There is no dev set for LibriLight. Thus, the dev.csv should be taken from somewhere else, which is also already a separate data preparation. 3. As you said, the data is huge and prepare it first before training may be a good option.I'm also discussing with Titouan about this. I'll update this PR after there's a conclusion about how should we do the data preparation for this dataset.
Thank you again!
Hi!
1. Before the SpeechBrain data preparation, the user needs to use external VAD to pre-process the data, which is already a separate step.
can you please elaborate on this? I didn't knew you had to run VAD on top of Libri-Light. According to the official paper, they said that they already ran a VAD no?
2. There is no dev set for LibriLight. Thus, the dev.csv should be taken from somewhere else, which is also already a separate data preparation.
For the dev set, I guess we can directly in train.py or something like that take N% of training set as dev set?
Let me know the outcome of the discussion.
Do you plan to add a recipe e.g. best-rq or wav2vec? Woudl be nice to have that with this PR.
There was a problem hiding this comment.
@Adel-Moumen Hi Adel, I've rewritten the scripts with SB libraries, with a script of training the BEST-RQ.
For VAD, we have to do it. Please refer to "1B. Segmenting" in https://github.com/facebookresearch/libri-light/tree/main/data_preparation
For the dev set, I still thinking using for example LibriSpeech dev-clean would be good. The reasons are
- We do not really care about the loss on the dev-clean. We just want to use it to monitor the training.
- If we use a percentage of the training set as the dev, then we are wasting that amount of data.
- LibriSpeech is easy to access, and the dev/test sets do not overlap with the Libri-Light dataset.
I'm looking forward to your further comments. Thx!
…echbrain into librilight_prep
Adel-Moumen
left a comment
There was a problem hiding this comment.
Hi @shucongzhang, I left you some comments. Thanks again for the PR :) I will try to train an SSL model so that we have something to release along this PR :)
| 2- Use the ```cut_by_vad.py``` script from the Libri-Light repo to do the VAD of each downloaded split. If you want to the use the small split, and you want to have most clips after VAD to be 20 seconds | ||
|
|
||
| python cut_by_vad.py \ | ||
| --input_dir path_to_Libri-Light/small \ | ||
| --output_dir Libri-Light_VAD/small_vad \ | ||
| --target_len_sec 20 | ||
|
|
There was a problem hiding this comment.
I guess maybe you should specify that you need to git clone the repo :p
| * Mirco Ravanelli, 2020 | ||
| * Ju-Chieh Chou, 2020 | ||
| * Loren Lugosch, 2020 | ||
| * Pierre Champion, 2023 | ||
| * Adel Moumen, 2024 |
There was a problem hiding this comment.
Not sure about hwo relevants those people are with respect to this file. If they didnt contributed to this file, then you can remove them
| data_folder = data_folder | ||
| splits = vad_splits | ||
| save_folder = save_folder |
| if not os.path.exists(save_folder): | ||
| os.makedirs(save_folder) |
There was a problem hiding this comment.
-> os.makedirs(save_folder, exist_ok=True)
| # snt_id = wav_file.split("/")[-1].replace(".flac", "") | ||
| snt_id = wav_file |
There was a problem hiding this comment.
Under different folders, there are duplicated names for the .flac files. I've double checked and under different folders the .flac file with the same names are different utterances. Thus, in the beginning, I just use the full audio path as the ID. Now, the ID is changed to the subfolder name under each split + the name of the .flac.
| duration = info.num_frames / info.sample_rate | ||
|
|
||
| return LLRow( | ||
| snt_id=snt_id, |
There was a problem hiding this comment.
wait. Is snt_id == file_path?
| @dataclass | ||
| class LLRow: | ||
| snt_id: str | ||
| duration: float | ||
| file_path: str |
There was a problem hiding this comment.
I think you should add a small docstring abt this :)
| wavs, wav_lens, mask = ( | ||
| wavs.to(self.device), | ||
| wav_lens.to(self.device), | ||
| mask.to(self.device), | ||
| ) |
There was a problem hiding this comment.
not needed. You just need to do batch.to(self.device) and it will move everything on the right device :p
There was a problem hiding this comment.
The batch variable in this function is a list, batch.to(self.device) is not feasible. I guess there's a way that to pass a tensor rather than a list to this function, but I'm less familiar with this function and not sure how to do this, sorry :(
Hi @Adel-Moumen , thank you for the comments again! Sorry for the delay, I was ill last week. I have modified the PR based on your comments. I'll take holidays now and be back next year, and I wish a good holiday season to you! :) |
TParcollet
left a comment
There was a problem hiding this comment.
CI keeps crashing due to storage issue. But once it's green LGTM!
| def parse_args(): | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument( | ||
| "--input_dir", | ||
| type=str, | ||
| default=None, | ||
| help="Path to the Libri-Light split after vad", | ||
| required=True, | ||
| ) | ||
| parser.add_argument( | ||
| "--output_dir", | ||
| type=str, | ||
| default=None, | ||
| help="Path to the output folder of generated csv files", | ||
| required=True, | ||
| ) | ||
| parser.add_argument( | ||
| "--max_length", | ||
| type=float, | ||
| default=20.2, | ||
| help="The max length for each prepared audio clip" "(default is 20.2)", | ||
| ) | ||
| parser.add_argument( | ||
| "--n_processes", | ||
| type=int, | ||
| default=32, | ||
| help="Number of parallel processes", | ||
| ) | ||
| args = parser.parse_args() | ||
| return args |
There was a problem hiding this comment.
I would still, however, by convention, name the file librilight_prepare.py
… Center Cambridge) (speechbrain#2765) Co-authored-by: Shucong Zhang/Embedded AI /SRUK/Engineer/Samsung Electronics <s1.zhang@sruk-ccn4.eu.corp.samsungelectronics.net> Co-authored-by: Parcollet Titouan <parcollet.titouan@gmail.com>
What does this PR do?
This PR contains a script that will create a train.csv file for the LibriLight dataset. The train.csv can be directly used as the "train_csv" in any SpeechBrain SSL yaml file.