Skip to content

Add new Audio Tokenziers#2751

Merged
pplantinga merged 37 commits intospeechbrain:developfrom
poonehmousavi:audiotokenizers
Jan 10, 2025
Merged

Add new Audio Tokenziers#2751
pplantinga merged 37 commits intospeechbrain:developfrom
poonehmousavi:audiotokenizers

Conversation

@poonehmousavi
Copy link
Copy Markdown
Collaborator

@poonehmousavi poonehmousavi commented Nov 7, 2024

What does this PR do?

  • Add Interface for Mimi tokenizer
  • Add interface for WavTokenizer
  • Add interface for SQ-Codec (removed and moved to another PR)
  • refactor SpeechTokenzier
  • Add support for accepting .safetensors file from Huggingface
  • Update discrete_ssl docstring and HF repo for hubert and wav2vec2

Fixes #<issue_number>

Before submitting
  • Did you read the contributor guideline?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
  • Review the self-review checklist to ensure the code is ready for review

@TParcollet TParcollet self-requested a review November 9, 2024 10:15
@poonehmousavi poonehmousavi self-assigned this Nov 13, 2024
@poonehmousavi poonehmousavi added the enhancement New feature or request label Nov 13, 2024
@poonehmousavi poonehmousavi marked this pull request as ready for review November 17, 2024 17:29
@poonehmousavi
Copy link
Copy Markdown
Collaborator Author

@mravanelli and @TParcollet, this PR is ready for review. It would be super helpful to review it ASAP since many tasks depend on these new features.

@poonehmousavi poonehmousavi mentioned this pull request Nov 21, 2024
13 tasks
@TParcollet
Copy link
Copy Markdown
Collaborator

TParcollet commented Nov 22, 2024

Thanks @poonehmousavi for the work! I am on it, but this will take some time because it's a lot of new code.

Copy link
Copy Markdown
Collaborator

@TParcollet TParcollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work again. Quite a few minor things. As mentioned already, I don't think that these new lobes should be put in a discrete folder. They should either go to the HF folder if using transformer or to the general lobe folder. This raise a good question around how to start regrouping models under a better directory tree, but we must all talk about that -- and I am pretty sure that the output type shouldn't be a criterion.

Comment thread speechbrain/lobes/models/discrete/speechTokenizer.py
Comment thread speechbrain/lobes/models/discrete/speechTokenizer.py
@@ -0,0 +1,157 @@
"""This lobe enables the integration of pretrained WavTokenizer.

Note that you need topip install git+https://github.com/Tomiinek/WavTokenizer` to use this module.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

watch out for typos. Shouldn't our pre-commit catch this @pplantinga ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed



class WavTokenizer(nn.Module):
"""An wrapper for the WavTokenizer model
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same an -> a. Not seen by tests @pplantinga ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a grammar error, not spelling

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see it fixed.



class WavTokenizer(nn.Module):
"""An wrapper for the WavTokenizer model
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not describing enough the lobe / model / reference / wrapper from what source.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added more info

save_path : str, optional
Directory where the model and configuration files are saved (default is None).
config : str, optional
Configuration filename for the model (default is 'config.yaml').
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config of what? HF? SB?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HF.You need to specify which file to use because of the way the model is loaded in the original code.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this standard to HF or is it a requirement of our implementation? Anyway, this needs to be explicated in the doctstring, maybe giving an example of a path in a HF repo?

Copy link
Copy Markdown
Collaborator Author

@poonehmousavi poonehmousavi Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, their code is not very well organized and they are different version of that. The version that I used is the code that sent to me by the author. They upload everything on HF In one zip file without providing any interface and in their code, they expect that you download and extract everything manually.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a little bit more info in the docstring.

Comment thread speechbrain/lobes/models/discrete/SQ-Codec.py Outdated
self.dim_codebook = dim_codebook
self.n_codebook = n_codebook
self.bw = bw
self.freq = self.n_codebook * 50
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 50 seems to be arbitrary? Shouldn't it be a parameter? Nothing must be coded in hard.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no usecase for this attribute, so I remove it

"""
exp_model_config = OmegaConf.load(config)
scalar_codec = ScalarModel(**exp_model_config.generator.config)
parameter_dict = torch.load(self.ckpt_path)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation looks like something that our Pretrainer class should be able to handle, it would be wiser to use it for clarity and our users habits.

out = self.scalar_codec.decode(emb_quant.float().to(self.device))
return out.detach().cpu().squeeze(0)

def infer(self, wav_root):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infer is not very self descriptive, especially since there is also an encode function. Maybe rename to same clearer?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to "reconstruct"

Copy link
Copy Markdown
Collaborator

@TParcollet TParcollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mravanelli requested that I do another review of this PR. I started, but the problem is that I don't see the fixes to my previous comments (certainly not pushed yet as commented as done?). I therefore stopped doing the review as to limit the number of review rounds. Please @poonehmousavi let me know when it's pushed so I can do a last review.

Thanks!

from torch.nn.utils import remove_weight_norm, weight_norm


def download_and_extract(repo_id, filename, save_path):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Here is what we should do then: This function should be removed, as mentioned, we avoid utility functions in lobes when possible. So let's use fetch for the download and let's create in the fetch.py file another function than enables the extraction of files. Could you please try to write this function such that it can allow a few more extraction format? Maybe using gzip instead of zipfile? Maybe this function could be linked directly to fetch. Like if fetch detects a certain extensions (tar / tar.gz / others) then it triggers the extraction? @Gastron may have a point of view on this.

print(f"File downloaded, extracted to '{save_path}', and ZIP file removed.")


def decimal_to_ternary_matrix(decimals, D):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I don't see it :p

print(f"File downloaded, extracted to '{save_path}', and ZIP file removed.")


def decimal_to_ternary_matrix(decimals, D):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing.

return ternary_matrix


def ternary_matrix_to_decimal(matrix):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing.

return int((kernel_size * dilation - dilation) / 2)


class round_func5(InplaceFunction):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you push it? I don't see it. Thanks.

return int((kernel_size * dilation - dilation) / 2)


class round_func5(InplaceFunction):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that a right place for this would then be in nnet.

@poonehmousavi
Copy link
Copy Markdown
Collaborator Author

@mravanelli requested that I do another review of this PR. I started, but the problem is that I don't see the fixes to my previous comments (certainly not pushed yet as commented as done?). I therefore stopped doing the review as to limit the number of review rounds. Please @poonehmousavi let me know when it's pushed so I can do a last review.

Thanks!

@TParcollet I think I have already pushed all the changes.. you could find the new sq-coded in this file: https://github.com/poonehmousavi/speechbrain/blob/af69859606c4ccaab06f20d17c866898c6d4e9f0/speechbrain/lobes/models/discrete/sq_codec.py#L528

Alos this class is going to be moved to the integration folder, so we might not need to change the loading function.

Copy link
Copy Markdown
Collaborator

@TParcollet TParcollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @poonehmousavi I did another pass, thanks for this very important work! I think most comments are minor, but at least one is quite important and will require a bit of re-thinking. Once the comments are addressed, I think we can merge this PR (or wait for the integrations one and adapt?) But please do ping me, I'll make a final review :-)

Comment thread conftest.py
torch.Size([8, 10, 2])
>>> wav=model.decode(tokens)
>>> print(wav.shape)
>>> wav=model.decode(tokens) # doctest: +SKIP
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this? It seems terrible to have a skip of the test in the doctest while it's supposed to be the main feature?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -0,0 +1,1394 @@
"""This lobe enables the integration of speech codec model (SQ-Codec) with scalar quantization,.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe prepare the docstring header for the move to integration?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pplantinga is there any specific format i need to follow here?

filename,
save_path=None,
config="config.yaml",
checkpoint="ckpt_00190000.pth",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this hardcoded model, let's avoid having any forced model in the arguments. If it's mandatory for some reason, this must be justified and explained in the doctstring.

Copy link
Copy Markdown
Collaborator Author

@poonehmousavi poonehmousavi Dec 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i removed the hardcoded one.

emb, emb_quant, x = self.scalar_codec.inference(wav)
return x.detach().cpu().squeeze(0)

@property
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this? Return True? This does not look good.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will return the quantized singal (x)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does nothing in that case as it's not even based on a class attribute that may change. I propose to just remove this function. (it's always true, so no point having it).



class WavTokenizer(nn.Module):
# """A wrapper for the WavTokenizer model
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread speechbrain/lobes/models/discrete/wavTokenzier.py
Comment thread speechbrain/lobes/models/discrete/wavTokenzier.py


class Mimi(HFTransformersInterface):
# """An wrapper for the HuggingFace Mimi model
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove line

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

whether the model will be frozen (e.g. not trainable if used
as part of training another model)
num_codebooks : int (default: 8)
Number of qunatizer. It could be [2,3,4,5,6,7,8]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno what a qunatizer is.

@poonehmousavi
Copy link
Copy Markdown
Collaborator Author

@pplantinga I'm adding new tokenizers in this PR, and I believe all of them need to be transferred to the integration folder. I’ve added you to this PR to ensure that we’re aligned on the settings and the steps required for the transfer. This will help us converge the effort efficiently.

@pplantinga
Copy link
Copy Markdown
Collaborator

@pplantinga I'm adding new tokenizers in this PR, and I believe all of them need to be transferred to the integration folder. I’ve added you to this PR to ensure that we’re aligned on the settings and the steps required for the transfer. This will help us converge the effort efficiently.

From the perspective of the integrations folder I have no objections to this PR moving forward as-is. Once its merged they can be moved to the integrations folder without too much trouble.

@poonehmousavi
Copy link
Copy Markdown
Collaborator Author

@TParcollet

I have addressed most of your points. Regarding the unittests and the suggestion to use the SB module instead of the provided module, as I mentioned in the comments, I’m currently working on another recipe that incorporates several quantization techniques. This requires re-implementing many of these techniques in SB from scratch.

Once that work is complete, I’ll focus on modifying these models accordingly. For now, since these models are in the integration folder, there shouldn’t be any issues.

Copy link
Copy Markdown
Collaborator

@TParcollet TParcollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this comment and review, GitHub completely bugged and I couldn't see the updated version of the code. I'll do another review with the correct code.

print(f"File downloaded, extracted to '{save_path}', and ZIP file removed.")


def decimal_to_ternary_matrix(decimals, D):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@poonehmousavi I still don't see it, even though it is marked as 'done'?

@@ -0,0 +1,157 @@
"""This lobe enables the integration of pretrained WavTokenizer.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wavTokenzier.py --> wavTokenizer . Also @pplantinga maybe we want to have a look at unifying the naming in lobe as part of the integration folder? Like camelcase for all models?



class WavTokenizer(nn.Module):
"""An wrapper for the WavTokenizer model
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see it fixed.

Copy link
Copy Markdown
Collaborator

@TParcollet TParcollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@poonehmousavi thanks for your work! This is a 'condition' approval as a few important comments have been pushed to another PR. If this other PR does not come in, we will not be able to release this PR into the master branch (i.e. will be reverted).

emb, emb_quant, x = self.scalar_codec.inference(wav)
return x.detach().cpu().squeeze(0)

@property
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does nothing in that case as it's not even based on a class attribute that may change. I propose to just remove this function. (it's always true, so no point having it).


class ScalarModel(nn.Module):
"""
A custom neural network model for encoding and decoding audio signals.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but a SpeechBrain user will not look at another class docstring when looking at this one. Can we add the relevant reference to this class docstring as well?

return x


class CustomRoundingFunction(Function):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll trust you on this. It is critical, don't forget the unit test as it definitely is a function that could go yolo in the future. Please ping me on the upcoming PR so I can double check as well.

return output


class DownsampleLayer(nn.Module):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well @pplantinga is just supposed to transpose code to another folder (with the integration folder), not to re-develop some modules -- especially ones that he does not master. I'll trust you again on that but unit tests and this are important issues, if we don't see a PR addressing these issues at some point we will have to revert. It also becomes harder to track for the reviewers because your new PR will certainly add many new features that will require review, and we will forget about this -- so please, when creating the PR, refer to this one and this comment.

remove_weight_norm(self.layer)


class UpsampleLayer(nn.Module):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tag for later -- this must be verified in new PR.

return x


class Conv1d(nn.Conv1d):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is starting to be a lot of things. If not done in next PR, we will have to revert.

return super(Conv1d, self).forward(x)


class ConvTranspose1d(nn.ConvTranspose1d):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tag for further review next PR

print(f"File downloaded, extracted to '{save_path}', and ZIP file removed.")


def decimal_to_ternary_matrix(decimals, D):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tag for newer PR

return ternary_matrix


def ternary_matrix_to_decimal(matrix):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tag for newer PR

@@ -0,0 +1,165 @@
"""This lobe enables the integration of pretrained WavTokenizer.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filename (wavTokenzier -> wavTokenizer)

@poonehmousavi
Copy link
Copy Markdown
Collaborator Author

@TParcollet Since all the remaining points are related to SQ-Codec, I could remove the file and we only merge Wavtokenzier and Mimi for now.. later I will add a new PR for SQ-Codec,

@TParcollet
Copy link
Copy Markdown
Collaborator

@poonehmousavi sounds good to me!

@poonehmousavi
Copy link
Copy Markdown
Collaborator Author

@TParcollet and @mravanelli I have removed teh sq_codec for now. The other tokenizers I think are ready to merge.

@poonehmousavi
Copy link
Copy Markdown
Collaborator Author

No idea why I get an error for MERT here, i didn't change anything just removed a file... is there any new update to CI that could cause this?

@pplantinga
Copy link
Copy Markdown
Collaborator

No idea why I get an error for MERT here, i didn't change anything just removed a file... is there any new update to CI that could cause this?

Probably an error on the huggingface side. I have run the test locally on this branch and it passes. I restarted the test, hopefully it passes this time. And in the future, this will be part of the integrations, so failures here won't cause CI failures.

@poonehmousavi
Copy link
Copy Markdown
Collaborator Author

still the same problem. i remembered it has the same issue but then we fixed it by adding
WARNING: ... line in the docstring to catch that error

@pplantinga
Copy link
Copy Markdown
Collaborator

Aha, this is actually due to new version of transformers. I just upgraded my local version and got the same error

@pplantinga
Copy link
Copy Markdown
Collaborator

This test will be moved to integrations folder soon, so let's find a simple way to skip it for now

@poonehmousavi
Copy link
Copy Markdown
Collaborator Author

could we simply add skip test for now?

@pplantinga pplantinga merged commit e1fe891 into speechbrain:develop Jan 10, 2025
@poonehmousavi poonehmousavi deleted the audiotokenizers branch February 5, 2025 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants