Add new Audio Tokenziers by poonehmousavi · Pull Request #2751 · speechbrain/speechbrain

poonehmousavi · 2024-11-07T20:56:45Z

What does this PR do?

Add Interface for Mimi tokenizer
Add interface for WavTokenizer
Add interface for SQ-Codec (removed and moved to another PR)
refactor SpeechTokenzier
Add support for accepting .safetensors file from Huggingface
Update discrete_ssl docstring and HF repo for hubert and wav2vec2

Fixes #<issue_number>

Before submitting

Did you read the contributor guideline?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified
Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
Review the self-review checklist to ensure the code is ready for review

poonehmousavi · 2024-11-18T13:16:22Z

@mravanelli and @TParcollet, this PR is ready for review. It would be super helpful to review it ASAP since many tasks depend on these new features.

TParcollet · 2024-11-22T11:52:08Z

Thanks @poonehmousavi for the work! I am on it, but this will take some time because it's a lot of new code.

TParcollet

Thanks for the work again. Quite a few minor things. As mentioned already, I don't think that these new lobes should be put in a discrete folder. They should either go to the HF folder if using transformer or to the general lobe folder. This raise a good question around how to start regrouping models under a better directory tree, but we must all talk about that -- and I am pretty sure that the output type shouldn't be a criterion.

TParcollet · 2024-11-22T11:56:44Z

@@ -0,0 +1,157 @@
+"""This lobe enables the integration of  pretrained WavTokenizer.
+
+Note that you need topip install git+https://github.com/Tomiinek/WavTokenizer` to use this module.


watch out for typos. Shouldn't our pre-commit catch this @pplantinga ?

TParcollet · 2024-11-22T11:57:14Z

+
+
+class WavTokenizer(nn.Module):
+    """An wrapper for the WavTokenizer model


Same an -> a. Not seen by tests @pplantinga ?

This is a grammar error, not spelling

Don't see it fixed.

TParcollet · 2024-11-22T11:57:43Z

+
+
+class WavTokenizer(nn.Module):
+    """An wrapper for the WavTokenizer model


This is not describing enough the lobe / model / reference / wrapper from what source.

added more info

TParcollet · 2024-11-22T12:22:10Z

+    save_path : str, optional
+        Directory where the model and configuration files are saved (default is None).
+    config : str, optional
+        Configuration filename for the model (default is 'config.yaml').


Config of what? HF? SB?

HF.You need to specify which file to use because of the way the model is loaded in the original code.

Is this standard to HF or is it a requirement of our implementation? Anyway, this needs to be explicated in the doctstring, maybe giving an example of a path in a HF repo?

In general, their code is not very well organized and they are different version of that. The version that I used is the code that sent to me by the author. They upload everything on HF In one zip file without providing any interface and in their code, they expect that you download and extract everything manually.

I added a little bit more info in the docstring.

TParcollet · 2024-11-22T12:23:31Z

+        self.dim_codebook = dim_codebook
+        self.n_codebook = n_codebook
+        self.bw = bw
+        self.freq = self.n_codebook * 50


This 50 seems to be arbitrary? Shouldn't it be a parameter? Nothing must be coded in hard.

there is no usecase for this attribute, so I remove it

TParcollet · 2024-11-22T12:24:45Z

+        """
+        exp_model_config = OmegaConf.load(config)
+        scalar_codec = ScalarModel(**exp_model_config.generator.config)
+        parameter_dict = torch.load(self.ckpt_path)


This operation looks like something that our Pretrainer class should be able to handle, it would be wiser to use it for clarity and our users habits.

TParcollet · 2024-11-22T12:25:58Z

+        out = self.scalar_codec.decode(emb_quant.float().to(self.device))
+        return out.detach().cpu().squeeze(0)
+
+    def infer(self, wav_root):


Infer is not very self descriptive, especially since there is also an encode function. Maybe rename to same clearer?

changed to "reconstruct"

TParcollet

@mravanelli requested that I do another review of this PR. I started, but the problem is that I don't see the fixes to my previous comments (certainly not pushed yet as commented as done?). I therefore stopped doing the review as to limit the number of review rounds. Please @poonehmousavi let me know when it's pushed so I can do a last review.

Thanks!

TParcollet · 2024-12-10T09:37:34Z

+from torch.nn.utils import remove_weight_norm, weight_norm
+
+
+def download_and_extract(repo_id, filename, save_path):


Right. Here is what we should do then: This function should be removed, as mentioned, we avoid utility functions in lobes when possible. So let's use fetch for the download and let's create in the fetch.py file another function than enables the extraction of files. Could you please try to write this function such that it can allow a few more extraction format? Maybe using gzip instead of zipfile? Maybe this function could be linked directly to fetch. Like if fetch detects a certain extensions (tar / tar.gz / others) then it triggers the extraction? @Gastron may have a point of view on this.

TParcollet · 2024-12-10T09:38:10Z

+    print(f"File downloaded, extracted to '{save_path}', and ZIP file removed.")
+
+
+def decimal_to_ternary_matrix(decimals, D):


Well I don't see it :p

TParcollet · 2024-12-10T09:38:25Z

+    print(f"File downloaded, extracted to '{save_path}', and ZIP file removed.")
+
+
+def decimal_to_ternary_matrix(decimals, D):


Still missing.

TParcollet · 2024-12-10T09:38:33Z

+    return ternary_matrix
+
+
+def ternary_matrix_to_decimal(matrix):


Still missing.

TParcollet · 2024-12-10T09:39:22Z

+    return int((kernel_size * dilation - dilation) / 2)
+
+
+class round_func5(InplaceFunction):


Could you push it? I don't see it. Thanks.

TParcollet · 2024-12-10T09:39:59Z

+    return int((kernel_size * dilation - dilation) / 2)
+
+
+class round_func5(InplaceFunction):


I guess that a right place for this would then be in nnet.

poonehmousavi · 2024-12-11T06:23:59Z

@mravanelli requested that I do another review of this PR. I started, but the problem is that I don't see the fixes to my previous comments (certainly not pushed yet as commented as done?). I therefore stopped doing the review as to limit the number of review rounds. Please @poonehmousavi let me know when it's pushed so I can do a last review.

Thanks!

@TParcollet I think I have already pushed all the changes.. you could find the new sq-coded in this file: https://github.com/poonehmousavi/speechbrain/blob/af69859606c4ccaab06f20d17c866898c6d4e9f0/speechbrain/lobes/models/discrete/sq_codec.py#L528

Alos this class is going to be moved to the integration folder, so we might not need to change the loading function.

TParcollet

Hi @poonehmousavi I did another pass, thanks for this very important work! I think most comments are minor, but at least one is quite important and will require a bit of re-thinking. Once the comments are addressed, I think we can merge this PR (or wait for the integrations one and adapt?) But please do ping me, I'll make a final review :-)

TParcollet · 2024-12-18T09:44:35Z

    torch.Size([8, 10, 2])
-    >>> wav=model.decode(tokens)
-    >>> print(wav.shape)
+    >>> wav=model.decode(tokens)  # doctest: +SKIP


Why this? It seems terrible to have a skip of the test in the doctest while it's supposed to be the main feature?

TParcollet · 2024-12-18T09:45:12Z

@@ -0,0 +1,1394 @@
+"""This lobe enables the integration of speech codec model (SQ-Codec) with scalar quantization,.


Maybe prepare the docstring header for the move to integration?

@pplantinga is there any specific format i need to follow here?

TParcollet · 2024-12-18T09:46:34Z

+        filename,
+        save_path=None,
+        config="config.yaml",
+        checkpoint="ckpt_00190000.pth",


Please remove this hardcoded model, let's avoid having any forced model in the arguments. If it's mandatory for some reason, this must be justified and explained in the doctstring.

i removed the hardcoded one.

TParcollet · 2024-12-18T09:48:05Z

+        emb, emb_quant, x = self.scalar_codec.inference(wav)
+        return x.detach().cpu().squeeze(0)
+
+    @property


What is this? Return True? This does not look good.

it will return the quantized singal (x)

This function does nothing in that case as it's not even based on a class attribute that may change. I propose to just remove this function. (it's always true, so no point having it).

TParcollet · 2024-12-18T10:07:34Z

+
+
+class WavTokenizer(nn.Module):
+    # """A wrapper for the WavTokenizer model


Remove this line.

TParcollet · 2024-12-18T10:12:02Z

+
+
+class Mimi(HFTransformersInterface):
+    # """An wrapper for the HuggingFace Mimi model


remove line

TParcollet · 2024-12-18T10:13:30Z

+        whether the model will be frozen (e.g. not trainable if used
+        as part of training another model)
+    num_codebooks : int  (default: 8)
+        Number of qunatizer. It could be [2,3,4,5,6,7,8]


Dunno what a qunatizer is.

poonehmousavi · 2024-12-18T13:10:15Z

@pplantinga I'm adding new tokenizers in this PR, and I believe all of them need to be transferred to the integration folder. I’ve added you to this PR to ensure that we’re aligned on the settings and the steps required for the transfer. This will help us converge the effort efficiently.

pplantinga · 2024-12-19T17:24:14Z

@pplantinga I'm adding new tokenizers in this PR, and I believe all of them need to be transferred to the integration folder. I’ve added you to this PR to ensure that we’re aligned on the settings and the steps required for the transfer. This will help us converge the effort efficiently.

From the perspective of the integrations folder I have no objections to this PR moving forward as-is. Once its merged they can be moved to the integrations folder without too much trouble.

…eechbrain into audiotokenizers

poonehmousavi · 2024-12-29T00:44:24Z

@TParcollet

I have addressed most of your points. Regarding the unittests and the suggestion to use the SB module instead of the provided module, as I mentioned in the comments, I’m currently working on another recipe that incorporates several quantization techniques. This requires re-implementing many of these techniques in SB from scratch.

Once that work is complete, I’ll focus on modifying these models accordingly. For now, since these models are in the integration folder, there shouldn’t be any issues.

TParcollet

Ignore this comment and review, GitHub completely bugged and I couldn't see the updated version of the code. I'll do another review with the correct code.

TParcollet · 2025-01-06T08:47:08Z

+    print(f"File downloaded, extracted to '{save_path}', and ZIP file removed.")
+
+
+def decimal_to_ternary_matrix(decimals, D):


@poonehmousavi I still don't see it, even though it is marked as 'done'?

TParcollet · 2025-01-06T08:53:52Z

@@ -0,0 +1,157 @@
+"""This lobe enables the integration of  pretrained WavTokenizer.


wavTokenzier.py --> wavTokenizer . Also @pplantinga maybe we want to have a look at unifying the naming in lobe as part of the integration folder? Like camelcase for all models?

TParcollet · 2025-01-06T08:55:19Z

+
+
+class WavTokenizer(nn.Module):
+    """An wrapper for the WavTokenizer model


Don't see it fixed.

TParcollet

@poonehmousavi thanks for your work! This is a 'condition' approval as a few important comments have been pushed to another PR. If this other PR does not come in, we will not be able to release this PR into the master branch (i.e. will be reverted).

TParcollet · 2025-01-06T09:01:07Z

+        emb, emb_quant, x = self.scalar_codec.inference(wav)
+        return x.detach().cpu().squeeze(0)
+
+    @property


This function does nothing in that case as it's not even based on a class attribute that may change. I propose to just remove this function. (it's always true, so no point having it).

TParcollet · 2025-01-06T09:02:10Z

+
+class ScalarModel(nn.Module):
+    """
+    A custom neural network model for encoding and decoding audio signals.


Sure, but a SpeechBrain user will not look at another class docstring when looking at this one. Can we add the relevant reference to this class docstring as well?

TParcollet · 2025-01-06T09:04:32Z

+        return x
+
+
+class CustomRoundingFunction(Function):


I'll trust you on this. It is critical, don't forget the unit test as it definitely is a function that could go yolo in the future. Please ping me on the upcoming PR so I can double check as well.

TParcollet · 2025-01-06T09:14:45Z

+        return output
+
+
+class DownsampleLayer(nn.Module):


Well @pplantinga is just supposed to transpose code to another folder (with the integration folder), not to re-develop some modules -- especially ones that he does not master. I'll trust you again on that but unit tests and this are important issues, if we don't see a PR addressing these issues at some point we will have to revert. It also becomes harder to track for the reviewers because your new PR will certainly add many new features that will require review, and we will forget about this -- so please, when creating the PR, refer to this one and this comment.

TParcollet · 2025-01-06T09:15:08Z

+            remove_weight_norm(self.layer)
+
+
+class UpsampleLayer(nn.Module):


Tag for later -- this must be verified in new PR.

TParcollet · 2025-01-06T09:15:57Z

+        return x
+
+
+class Conv1d(nn.Conv1d):


That is starting to be a lot of things. If not done in next PR, we will have to revert.

TParcollet · 2025-01-06T09:16:49Z

+        return super(Conv1d, self).forward(x)
+
+
+class ConvTranspose1d(nn.ConvTranspose1d):


tag for further review next PR

TParcollet · 2025-01-06T09:17:47Z

+    print(f"File downloaded, extracted to '{save_path}', and ZIP file removed.")
+
+
+def decimal_to_ternary_matrix(decimals, D):


tag for newer PR

TParcollet · 2025-01-06T09:17:58Z

+    return ternary_matrix
+
+
+def ternary_matrix_to_decimal(matrix):


tag for newer PR

TParcollet · 2025-01-06T09:18:38Z

@@ -0,0 +1,165 @@
+"""This lobe enables the integration of pretrained WavTokenizer.


filename (wavTokenzier -> wavTokenizer)

poonehmousavi · 2025-01-06T18:14:31Z

@TParcollet Since all the remaining points are related to SQ-Codec, I could remove the file and we only merge Wavtokenzier and Mimi for now.. later I will add a new PR for SQ-Codec,

TParcollet · 2025-01-09T08:53:17Z

@poonehmousavi sounds good to me!

poonehmousavi · 2025-01-10T16:20:58Z

@TParcollet and @mravanelli I have removed teh sq_codec for now. The other tokenizers I think are ready to merge.

poonehmousavi · 2025-01-10T16:44:16Z

No idea why I get an error for MERT here, i didn't change anything just removed a file... is there any new update to CI that could cause this?

pplantinga · 2025-01-10T16:52:54Z

No idea why I get an error for MERT here, i didn't change anything just removed a file... is there any new update to CI that could cause this?

Probably an error on the huggingface side. I have run the test locally on this branch and it passes. I restarted the test, hopefully it passes this time. And in the future, this will be part of the integrations, so failures here won't cause CI failures.

poonehmousavi · 2025-01-10T16:58:18Z

still the same problem. i remembered it has the same issue but then we fixed it by adding
WARNING: ... line in the docstring to catch that error

pplantinga · 2025-01-10T17:05:06Z

Aha, this is actually due to new version of transformers. I just upgraded my local version and got the same error

pplantinga · 2025-01-10T17:06:11Z

This test will be moved to integrations folder soon, so let's find a simple way to skip it for now

poonehmousavi · 2025-01-10T17:07:00Z

could we simply add skip test for now?

…ntegrations folder changes are finished

poonehmousavi added 4 commits November 7, 2024 15:55

add suppot for safetensors

d2f2e42

add mimi

4992d32

fix precommit

75acc4a

fix docstring

4bf1836

TParcollet self-requested a review November 9, 2024 10:15

poonehmousavi added 4 commits November 11, 2024 19:01

add sqcoded and wavtokenizer

3d17fff

add wavtokenizer

ef40b02

remove sq-codec

154d982

fix test

bc8f92c

poonehmousavi self-assigned this Nov 13, 2024

poonehmousavi added the enhancement New feature or request label Nov 13, 2024

poonehmousavi added 11 commits November 15, 2024 19:41

add SQ-Codec

40ad61b

reformat and clean SQ-Codec

c4670e4

fix CI

aadc379

Merge branch 'develop' into audiotokenizers

1003a93

fix speechtokenizer name

cff7fd1

fix CI

23847fc

fix precommit

dbb6fae

fix CI

bf8138c

fix dicstring

4ea7b9f

fix import module

8ed9deb

fix typo

2853cd4

poonehmousavi marked this pull request as ready for review November 17, 2024 17:29

poonehmousavi mentioned this pull request Nov 21, 2024

fix safetensors #2763

Merged

13 tasks

TParcollet requested changes Nov 22, 2024

View reviewed changes

Merge branch 'develop' into audiotokenizers

34adebb

mravanelli requested a review from lucadellalib November 22, 2024 19:06

improve WavTokenizers's code

feb7e45

poonehmousavi requested review from TParcollet and mravanelli December 6, 2024 15:40

TParcollet requested changes Dec 10, 2024

View reviewed changes

TParcollet requested changes Dec 18, 2024

View reviewed changes

Merge branch 'develop' into audiotokenizers

71ac4a5

poonehmousavi added 2 commits December 28, 2024 17:04

apply reviews

8f27ea6

Merge branch 'audiotokenizers' of https://github.com/poonehmousavi/sp…

c83210f

…eechbrain into audiotokenizers

poonehmousavi requested a review from TParcollet December 29, 2024 18:45

TParcollet requested changes Jan 6, 2025

View reviewed changes

TParcollet approved these changes Jan 6, 2025

View reviewed changes

remove sq_codec and update docstring for discrete_ssl

bcea3da

Merge branch 'speechbrain:develop' into audiotokenizers

ffd4393

pplantinga and others added 3 commits January 10, 2025 12:14

Temporarily skip mert test, failing due to new version of HF, until i…

8424697

…ntegrations folder changes are finished

fix filename

1fa96a4

fix conftest.py

6e1eb79

pplantinga merged commit e1fe891 into speechbrain:develop Jan 10, 2025

poonehmousavi deleted the audiotokenizers branch February 5, 2025 01:34

		@@ -0,0 +1,157 @@
		"""This lobe enables the integration of pretrained WavTokenizer.

		Note that you need topip install git+https://github.com/Tomiinek/WavTokenizer` to use this module.



		class WavTokenizer(nn.Module):
		"""An wrapper for the WavTokenizer model

		from torch.nn.utils import remove_weight_norm, weight_norm


		def download_and_extract(repo_id, filename, save_path):

		print(f"File downloaded, extracted to '{save_path}', and ZIP file removed.")


		def decimal_to_ternary_matrix(decimals, D):

		return int((kernel_size * dilation - dilation) / 2)


		class round_func5(InplaceFunction):

		@@ -0,0 +1,1394 @@
		"""This lobe enables the integration of speech codec model (SQ-Codec) with scalar quantization,.



		class WavTokenizer(nn.Module):
		# """A wrapper for the WavTokenizer model



		class Mimi(HFTransformersInterface):
		# """An wrapper for the HuggingFace Mimi model

		remove_weight_norm(self.layer)


		class UpsampleLayer(nn.Module):

		return super(Conv1d, self).forward(x)


		class ConvTranspose1d(nn.ConvTranspose1d):

		@@ -0,0 +1,165 @@
		"""This lobe enables the integration of pretrained WavTokenizer.

Conversation

poonehmousavi commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Uh oh!

poonehmousavi commented Nov 18, 2024

Uh oh!

TParcollet commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TParcollet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

poonehmousavi Dec 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TParcollet left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

poonehmousavi commented Dec 11, 2024

Uh oh!

TParcollet left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

poonehmousavi commented Nov 7, 2024 •

edited

Loading

TParcollet commented Nov 22, 2024 •

edited

Loading

poonehmousavi Dec 2, 2024 •

edited

Loading

TParcollet left a comment •

edited

Loading

poonehmousavi Dec 28, 2024 •

edited

Loading