Librilight data preparation for SpeechBrain SSL (code from Samsung AI Center Cambridge) by shucongzhang · Pull Request #2765 · speechbrain/speechbrain

shucongzhang · 2024-11-21T16:19:46Z

What does this PR do?

This PR contains a script that will create a train.csv file for the LibriLight dataset. The train.csv can be directly used as the "train_csv" in any SpeechBrain SSL yaml file.

Adel-Moumen

Hey @shucongzhang , how are you doing? Nice to see a data prep for Librilight! I left some early comments in case you wanna take a look. Also, do you plan to add a recipe e.g. best-rq ? would be nice to have a working prototype alongside this data prep. thanks :)

Adel-Moumen · 2024-11-21T16:27:45Z

+
+import argparse
+import csv
+import multiprocessing


Could you please use instead parallel_map from our toolkit (see: https://github.com/speechbrain/speechbrain/blob/develop/recipes/CommonVoice/common_voice_prepare.py#L368)

Adel-Moumen · 2024-11-21T16:30:06Z

+    n_processes: int
+        Number of parallel processes
+    """
+    print("Processing each subfolder of this split")


could you please use our loggers instead of using print? (e.g. https://github.com/speechbrain/speechbrain/blob/develop/recipes/CommonVoice/common_voice_prepare.py#L24)

Adel-Moumen · 2024-11-21T16:30:55Z

+    with multiprocessing.Pool(processes=n_processes) as pool:
+        for _ in tqdm.tqdm(
+            pool.imap_unordered(make_csv_for_each, tasks), total=len(tasks)
+        ):
+            pass


this code will have to change according to the parallel_map

Adel-Moumen · 2024-11-21T16:31:22Z

+    csv_file_folder: str
+        Path to the output folder of generated csv files
+    """
+    print("Merging the csvs of each subfolder into one csv")


same logger

Adel-Moumen · 2024-11-21T16:31:45Z

+                # filter out bad rows
+                for row in reader:
+                    if len(row) == 3 and os.path.exists(row[-1]):
+                        new_row = [row[-1], row[1], row[2]]
+                        csv_writer.writerow(new_row)
+                    else:
+                        print(f"bad row {row}")


can you add in the docstring of the function what does it means to have "bad rows" ?

Adel-Moumen · 2024-11-21T16:32:02Z

+    import shutil
+
+    shutil.rmtree(f"{csv_file_folder}/tmp")


Adel-Moumen · 2024-11-21T16:33:56Z

+    )
+    for i, flac_file in enumerate(subpath_1.glob("**/*.flac")):
+        flac_file_name = flac_file.stem
+        waveform, sample_rate = torchaudio.load(str(flac_file))


I think it would be better to instead use the read_audio_info (see: https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/dataio/dataio.py#L166-L221)

Adel-Moumen · 2024-11-21T16:35:57Z

+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--input_dir",
+        type=str,
+        default=None,
+        help="Path to the Libri-Light split after vad",
+        required=True,
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        default=None,
+        help="Path to the output folder of generated csv files",
+        required=True,
+    )
+    parser.add_argument(
+        "--max_length",
+        type=float,
+        default=20.2,
+        help="The max length for each prepared audio clip" "(default is 20.2)",
+    )
+    parser.add_argument(
+        "--n_processes",
+        type=int,
+        default=32,
+        help="Number of parallel processes",
+    )
+    args = parser.parse_args()
+    return args


@TParcollet do you have an opinion on that? Usually the data prep is being done directly in the train.py ,BUT, given how large the dataset is, I would say that 99% of the folks will prefer to first process the dataset, and then, train models (which means a clear way of executing only the data prep). In this regard, I would say we could stick to argparse for this recipe, but I'd like to get the goat opinion on this design choice. thanks

This is an excellent question as I have been dealing with recipes with 10k+ hours recently and i've had this exact same issue. I resorted in adding a yaml argument to just do the dataprep.

I think that the key here is to find the potential issue that it may cause. If one starts a training with a very long data prep on a single process, personally, I don't think it is a problem and it will work just fine. Problem is ... these very long data prep imply that people will want to use multiple GPUs, hence DDP, hence time out on the dataprep. There is also the cluster occupation issue, as it is inefficient to occupy multiple GPUs for a few hours.

I think that what @shucongzhang proposes here makes sense. However, we would also need to discuss more about that with people like @pplantinga, @mravanelli and @ASU. Another possibility would simply be to always respect SB way of doing thing ... and therefore have one librilight_prepare.py and call it in a recipe with some checks to make sure that it's not run with DDP.

I would still, however, by convention, name the file librilight_prepare.py

Hi @Adel-Moumen How are you. Thank you so much for the quick and detailed comments! I really appreciate it! For the script, I've rewritten it with SpeechBrain utils functions. For having a separate data preparation, I think the reasons are

Before the SpeechBrain data preparation, the user needs to use external VAD to pre-process the data, which is already a separate step.

There is no dev set for LibriLight. Thus, the dev.csv should be taken from somewhere else, which is also already a separate data preparation.

As you said, the data is huge and prepare it first before training may be a good option.

I'm also discussing with Titouan about this. I'll update this PR after there's a conclusion about how should we do the data preparation for this dataset.

Thank you again!

Hi @Adel-Moumen How are you. Thank you so much for the quick and detailed comments! I really appreciate it! For the script, I've rewritten it with SpeechBrain utils functions. For having a separate data preparation, I think the reasons are

1. Before the SpeechBrain data preparation, the user needs to use external VAD to pre-process the data, which is already a separate step. 2. There is no dev set for LibriLight. Thus, the dev.csv should be taken from somewhere else, which is also already a separate data preparation. 3. As you said, the data is huge and prepare it first before training may be a good option.

I'm also discussing with Titouan about this. I'll update this PR after there's a conclusion about how should we do the data preparation for this dataset.

Thank you again!

Hi!

1. Before the SpeechBrain data preparation, the user needs to use external VAD to pre-process the data, which is already a separate step.

can you please elaborate on this? I didn't knew you had to run VAD on top of Libri-Light. According to the official paper, they said that they already ran a VAD no?

2. There is no dev set for LibriLight. Thus, the dev.csv should be taken from somewhere else, which is also already a separate data preparation.

For the dev set, I guess we can directly in train.py or something like that take N% of training set as dev set?

Let me know the outcome of the discussion.

Do you plan to add a recipe e.g. best-rq or wav2vec? Woudl be nice to have that with this PR.

@Adel-Moumen Hi Adel, I've rewritten the scripts with SB libraries, with a script of training the BEST-RQ.
For VAD, we have to do it. Please refer to "1B. Segmenting" in https://github.com/facebookresearch/libri-light/tree/main/data_preparation
For the dev set, I still thinking using for example LibriSpeech dev-clean would be good. The reasons are

We do not really care about the loss on the dev-clean. We just want to use it to monitor the training.

If we use a percentage of the training set as the dev, then we are wasting that amount of data.

LibriSpeech is easy to access, and the dev/test sets do not overlap with the Libri-Light dataset.

I'm looking forward to your further comments. Thx!

…echbrain into librilight_prep

Adel-Moumen

Hi @shucongzhang, I left you some comments. Thanks again for the PR :) I will try to train an SSL model so that we have something to release along this PR :)

Adel-Moumen · 2024-12-10T14:23:54Z

+2- Use the ```cut_by_vad.py``` script from the Libri-Light repo to do the VAD of each downloaded split. If you want to the use the small split, and you want to have most clips after VAD to be 20 seconds
+
+    python cut_by_vad.py \
+        --input_dir path_to_Libri-Light/small \
+        --output_dir Libri-Light_VAD/small_vad \
+        --target_len_sec 20
+


I guess maybe you should specify that you need to git clone the repo :p

Adel-Moumen · 2024-12-10T15:02:48Z

+ * Mirco Ravanelli, 2020
+ * Ju-Chieh Chou, 2020
+ * Loren Lugosch, 2020
+ * Pierre Champion, 2023
+ * Adel Moumen, 2024


Not sure about hwo relevants those people are with respect to this file. If they didnt contributed to this file, then you can remove them

Adel-Moumen · 2024-12-10T15:03:55Z

+    data_folder = data_folder
+    splits = vad_splits
+    save_folder = save_folder


Adel-Moumen · 2024-12-10T15:04:37Z

+    if not os.path.exists(save_folder):
+        os.makedirs(save_folder)


-> os.makedirs(save_folder, exist_ok=True)

Adel-Moumen · 2024-12-10T15:05:11Z

+    # snt_id = wav_file.split("/")[-1].replace(".flac", "")
+    snt_id = wav_file


Under different folders, there are duplicated names for the .flac files. I've double checked and under different folders the .flac file with the same names are different utterances. Thus, in the beginning, I just use the full audio path as the ID. Now, the ID is changed to the subfolder name under each split + the name of the .flac.

Adel-Moumen · 2024-12-10T15:05:29Z

+    duration = info.num_frames / info.sample_rate
+
+    return LLRow(
+        snt_id=snt_id,


wait. Is snt_id == file_path?

Adel-Moumen · 2024-12-10T15:06:21Z

+@dataclass
+class LLRow:
+    snt_id: str
+    duration: float
+    file_path: str


I think you should add a small docstring abt this :)

Adel-Moumen · 2024-12-10T15:07:32Z

+        wavs, wav_lens, mask = (
+            wavs.to(self.device),
+            wav_lens.to(self.device),
+            mask.to(self.device),
+        )


not needed. You just need to do batch.to(self.device) and it will move everything on the right device :p

The batch variable in this function is a list, batch.to(self.device) is not feasible. I guess there's a way that to pass a tensor rather than a list to this function, but I'm less familiar with this function and not sure how to do this, sorry :(

shucongzhang · 2024-12-17T12:30:31Z

Hi @shucongzhang, I left you some comments. Thanks again for the PR :) I will try to train an SSL model so that we have something to release along this PR :)

Hi @Adel-Moumen , thank you for the comments again! Sorry for the delay, I was ill last week. I have modified the PR based on your comments. I'll take holidays now and be back next year, and I wish a good holiday season to you! :)

…echbrain into librilight_prep

TParcollet

CI keeps crashing due to storage issue. But once it's green LGTM!

TParcollet · 2024-11-22T16:35:30Z

+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--input_dir",
+        type=str,
+        default=None,
+        help="Path to the Libri-Light split after vad",
+        required=True,
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        default=None,
+        help="Path to the output folder of generated csv files",
+        required=True,
+    )
+    parser.add_argument(
+        "--max_length",
+        type=float,
+        default=20.2,
+        help="The max length for each prepared audio clip" "(default is 20.2)",
+    )
+    parser.add_argument(
+        "--n_processes",
+        type=int,
+        default=32,
+        help="Number of parallel processes",
+    )
+    args = parser.parse_args()
+    return args


I would still, however, by convention, name the file librilight_prepare.py

… Center Cambridge) (speechbrain#2765) Co-authored-by: Shucong Zhang/Embedded AI /SRUK/Engineer/Samsung Electronics <s1.zhang@sruk-ccn4.eu.corp.samsungelectronics.net> Co-authored-by: Parcollet Titouan <parcollet.titouan@gmail.com>

Shucong Zhang/Embedded AI /SRUK/Engineer/Samsung Electronics added 3 commits November 20, 2024 17:29

Libri-Light data prep for SSL

8f73491

add readme

8fdddb3

Libri-Light data prepare

0c21d74

Adel-Moumen assigned Adel-Moumen and unassigned Adel-Moumen Nov 21, 2024

Adel-Moumen self-requested a review November 21, 2024 16:26

Merge branch 'develop' into librilight_prep

c9f878e

Adel-Moumen reviewed Nov 21, 2024

View reviewed changes

Shucong Zhang/Embedded AI /SRUK/Engineer/Samsung Electronics added 5 commits December 3, 2024 18:34

rewrite the data prep with SB functions

cbc48ab

Merge branch 'librilight_prep' of https://github.com/shucongzhang/spe…

140ee9d

…echbrain into librilight_prep

Libri-Light prep with SpeechBrain functions

95899ad

Libri-Light prep with SpeechBrain functions

078b0e1

Libri-Light prep with SpeechBrain functions

6ea771b

Adel-Moumen reviewed Dec 10, 2024

View reviewed changes

Shucong Zhang/Embedded AI /SRUK/Engineer/Samsung Electronics added 3 commits December 17, 2024 12:19

PR2

375ae95

Merge branch 'develop' into librilight_prep

e312c6c

pr

1beefc3

TParcollet added 4 commits May 23, 2025 16:30

Merge branch 'develop' into librilight_prep

6399fd9

update

2876c65

Merge branch 'librilight_prep' of https://github.com/shucongzhang/spe…

4bc6108

…echbrain into librilight_prep

new path and fixes

9738cf9

TParcollet approved these changes May 23, 2025

View reviewed changes

Merge branch 'develop' into librilight_prep

88053f5

TParcollet merged commit 075d5f2 into speechbrain:develop May 28, 2025
4 of 5 checks passed

		# snt_id = wav_file.split("/")[-1].replace(".flac", "")
		snt_id = wav_file

Conversation

shucongzhang commented Nov 21, 2024

What does this PR do?

Uh oh!

Adel-Moumen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TParcollet Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Adel-Moumen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shucongzhang commented Dec 17, 2024

Uh oh!

TParcollet left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TParcollet Nov 22, 2024 •

edited

Loading