Skip to content

Librilight data preparation for SpeechBrain SSL (code from Samsung AI Center Cambridge)#2765

Merged
TParcollet merged 17 commits intospeechbrain:developfrom
shucongzhang:librilight_prep
May 28, 2025
Merged

Librilight data preparation for SpeechBrain SSL (code from Samsung AI Center Cambridge)#2765
TParcollet merged 17 commits intospeechbrain:developfrom
shucongzhang:librilight_prep

Conversation

@shucongzhang
Copy link
Copy Markdown
Contributor

What does this PR do?

This PR contains a script that will create a train.csv file for the LibriLight dataset. The train.csv can be directly used as the "train_csv" in any SpeechBrain SSL yaml file.

Shucong Zhang/Embedded AI /SRUK/Engineer/Samsung Electronics added 3 commits November 20, 2024 17:29
@Adel-Moumen Adel-Moumen self-requested a review November 21, 2024 16:26
Copy link
Copy Markdown
Collaborator

@Adel-Moumen Adel-Moumen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @shucongzhang , how are you doing? Nice to see a data prep for Librilight! I left some early comments in case you wanna take a look. Also, do you plan to add a recipe e.g. best-rq ? would be nice to have a working prototype alongside this data prep. thanks :)


import argparse
import csv
import multiprocessing
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_processes: int
Number of parallel processes
"""
print("Processing each subfolder of this split")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +87 to +91
with multiprocessing.Pool(processes=n_processes) as pool:
for _ in tqdm.tqdm(
pool.imap_unordered(make_csv_for_each, tasks), total=len(tasks)
):
pass
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code will have to change according to the parallel_map

csv_file_folder: str
Path to the output folder of generated csv files
"""
print("Merging the csvs of each subfolder into one csv")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same logger

Comment on lines +118 to +124
# filter out bad rows
for row in reader:
if len(row) == 3 and os.path.exists(row[-1]):
new_row = [row[-1], row[1], row[2]]
csv_writer.writerow(new_row)
else:
print(f"bad row {row}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add in the docstring of the function what does it means to have "bad rows" ?

Comment on lines +126 to +128
import shutil

shutil.rmtree(f"{csv_file_folder}/tmp")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof?

)
for i, flac_file in enumerate(subpath_1.glob("**/*.flac")):
flac_file_name = flac_file.stem
waveform, sample_rate = torchaudio.load(str(flac_file))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to instead use the read_audio_info (see: https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/dataio/dataio.py#L166-L221)

Comment on lines +131 to +160
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--input_dir",
type=str,
default=None,
help="Path to the Libri-Light split after vad",
required=True,
)
parser.add_argument(
"--output_dir",
type=str,
default=None,
help="Path to the output folder of generated csv files",
required=True,
)
parser.add_argument(
"--max_length",
type=float,
default=20.2,
help="The max length for each prepared audio clip" "(default is 20.2)",
)
parser.add_argument(
"--n_processes",
type=int,
default=32,
help="Number of parallel processes",
)
args = parser.parse_args()
return args
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TParcollet do you have an opinion on that? Usually the data prep is being done directly in the train.py ,BUT, given how large the dataset is, I would say that 99% of the folks will prefer to first process the dataset, and then, train models (which means a clear way of executing only the data prep). In this regard, I would say we could stick to argparse for this recipe, but I'd like to get the goat opinion on this design choice. thanks

Copy link
Copy Markdown
Collaborator

@TParcollet TParcollet Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent question as I have been dealing with recipes with 10k+ hours recently and i've had this exact same issue. I resorted in adding a yaml argument to just do the dataprep.

I think that the key here is to find the potential issue that it may cause. If one starts a training with a very long data prep on a single process, personally, I don't think it is a problem and it will work just fine. Problem is ... these very long data prep imply that people will want to use multiple GPUs, hence DDP, hence time out on the dataprep. There is also the cluster occupation issue, as it is inefficient to occupy multiple GPUs for a few hours.

I think that what @shucongzhang proposes here makes sense. However, we would also need to discuss more about that with people like @pplantinga, @mravanelli and @ASU. Another possibility would simply be to always respect SB way of doing thing ... and therefore have one librilight_prepare.py and call it in a recipe with some checks to make sure that it's not run with DDP.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still, however, by convention, name the file librilight_prepare.py

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Adel-Moumen How are you. Thank you so much for the quick and detailed comments! I really appreciate it! For the script, I've rewritten it with SpeechBrain utils functions. For having a separate data preparation, I think the reasons are

  1. Before the SpeechBrain data preparation, the user needs to use external VAD to pre-process the data, which is already a separate step.
  2. There is no dev set for LibriLight. Thus, the dev.csv should be taken from somewhere else, which is also already a separate data preparation.
  3. As you said, the data is huge and prepare it first before training may be a good option.

I'm also discussing with Titouan about this. I'll update this PR after there's a conclusion about how should we do the data preparation for this dataset.

Thank you again!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Adel-Moumen How are you. Thank you so much for the quick and detailed comments! I really appreciate it! For the script, I've rewritten it with SpeechBrain utils functions. For having a separate data preparation, I think the reasons are

1. Before the SpeechBrain data preparation, the user needs to use external VAD to pre-process the data, which is already a separate step.

2. There is no dev set for LibriLight. Thus, the dev.csv should be taken from somewhere else,  which is also already a separate data preparation.

3. As you said, the data is huge and prepare it first before training may be a good option.

I'm also discussing with Titouan about this. I'll update this PR after there's a conclusion about how should we do the data preparation for this dataset.

Thank you again!

Hi!

1. Before the SpeechBrain data preparation, the user needs to use external VAD to pre-process the data, which is already a separate step.

can you please elaborate on this? I didn't knew you had to run VAD on top of Libri-Light. According to the official paper, they said that they already ran a VAD no?

2. There is no dev set for LibriLight. Thus, the dev.csv should be taken from somewhere else,  which is also already a separate data preparation.

For the dev set, I guess we can directly in train.py or something like that take N% of training set as dev set?

Let me know the outcome of the discussion.

Do you plan to add a recipe e.g. best-rq or wav2vec? Woudl be nice to have that with this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Adel-Moumen Hi Adel, I've rewritten the scripts with SB libraries, with a script of training the BEST-RQ.
For VAD, we have to do it. Please refer to "1B. Segmenting" in https://github.com/facebookresearch/libri-light/tree/main/data_preparation
For the dev set, I still thinking using for example LibriSpeech dev-clean would be good. The reasons are

  1. We do not really care about the loss on the dev-clean. We just want to use it to monitor the training.
  2. If we use a percentage of the training set as the dev, then we are wasting that amount of data.
  3. LibriSpeech is easy to access, and the dev/test sets do not overlap with the Libri-Light dataset.

I'm looking forward to your further comments. Thx!

Copy link
Copy Markdown
Collaborator

@Adel-Moumen Adel-Moumen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shucongzhang, I left you some comments. Thanks again for the PR :) I will try to train an SSL model so that we have something to release along this PR :)

Comment on lines +12 to +18
2- Use the ```cut_by_vad.py``` script from the Libri-Light repo to do the VAD of each downloaded split. If you want to the use the small split, and you want to have most clips after VAD to be 20 seconds

python cut_by_vad.py \
--input_dir path_to_Libri-Light/small \
--output_dir Libri-Light_VAD/small_vad \
--target_len_sec 20

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess maybe you should specify that you need to git clone the repo :p

Comment thread recipes/Libri-Light/self-supervised-learning/hparams/BEST-RQ.yaml
Comment on lines +12 to +16
* Mirco Ravanelli, 2020
* Ju-Chieh Chou, 2020
* Loren Lugosch, 2020
* Pierre Champion, 2023
* Adel Moumen, 2024
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about hwo relevants those people are with respect to this file. If they didnt contributed to this file, then you can remove them

Comment on lines +83 to +85
data_folder = data_folder
splits = vad_splits
save_folder = save_folder
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Comment on lines +89 to +90
if not os.path.exists(save_folder):
os.makedirs(save_folder)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> os.makedirs(save_folder, exist_ok=True)

Comment on lines +141 to +142
# snt_id = wav_file.split("/")[-1].replace(".flac", "")
snt_id = wav_file
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under different folders, there are duplicated names for the .flac files. I've double checked and under different folders the .flac file with the same names are different utterances. Thus, in the beginning, I just use the full audio path as the ID. Now, the ID is changed to the subfolder name under each split + the name of the .flac.

duration = info.num_frames / info.sample_rate

return LLRow(
snt_id=snt_id,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait. Is snt_id == file_path?

Comment on lines +133 to +137
@dataclass
class LLRow:
snt_id: str
duration: float
file_path: str
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should add a small docstring abt this :)

Comment on lines +37 to +41
wavs, wav_lens, mask = (
wavs.to(self.device),
wav_lens.to(self.device),
mask.to(self.device),
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed. You just need to do batch.to(self.device) and it will move everything on the right device :p

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batch variable in this function is a list, batch.to(self.device) is not feasible. I guess there's a way that to pass a tensor rather than a list to this function, but I'm less familiar with this function and not sure how to do this, sorry :(

Shucong Zhang/Embedded AI /SRUK/Engineer/Samsung Electronics added 3 commits December 17, 2024 12:19
@shucongzhang
Copy link
Copy Markdown
Contributor Author

Hi @shucongzhang, I left you some comments. Thanks again for the PR :) I will try to train an SSL model so that we have something to release along this PR :)

Hi @Adel-Moumen , thank you for the comments again! Sorry for the delay, I was ill last week. I have modified the PR based on your comments. I'll take holidays now and be back next year, and I wish a good holiday season to you! :)

Copy link
Copy Markdown
Collaborator

@TParcollet TParcollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI keeps crashing due to storage issue. But once it's green LGTM!

Comment on lines +131 to +160
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--input_dir",
type=str,
default=None,
help="Path to the Libri-Light split after vad",
required=True,
)
parser.add_argument(
"--output_dir",
type=str,
default=None,
help="Path to the output folder of generated csv files",
required=True,
)
parser.add_argument(
"--max_length",
type=float,
default=20.2,
help="The max length for each prepared audio clip" "(default is 20.2)",
)
parser.add_argument(
"--n_processes",
type=int,
default=32,
help="Number of parallel processes",
)
args = parser.parse_args()
return args
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still, however, by convention, name the file librilight_prepare.py

Comment thread recipes/Libri-Light/self-supervised-learning/hparams/BEST-RQ.yaml
@TParcollet TParcollet merged commit 075d5f2 into speechbrain:develop May 28, 2025
4 of 5 checks passed
pplantinga pushed a commit to pplantinga/speechbrain that referenced this pull request Jun 2, 2025
… Center Cambridge) (speechbrain#2765)

Co-authored-by: Shucong Zhang/Embedded AI /SRUK/Engineer/Samsung Electronics <s1.zhang@sruk-ccn4.eu.corp.samsungelectronics.net>
Co-authored-by: Parcollet Titouan <parcollet.titouan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants