Skip to content

audio_utils improvements#21998

Merged
sgugger merged 39 commits intohuggingface:mainfrom
hollance:audio_utils
May 9, 2023
Merged

audio_utils improvements#21998
sgugger merged 39 commits intohuggingface:mainfrom
hollance:audio_utils

Conversation

@hollance
Copy link
Copy Markdown
Contributor

@hollance hollance commented Mar 7, 2023

What does this PR do?

Recently the audio_utils.py file was added to Transformers to provide shared functions for audio processing such as STFT. This PR aims to clean up the code and make the API more robust.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

HuggingFaceDocBuilderDev commented Mar 7, 2023

The documentation is not available anymore as the PR was closed or merged.

@hollance
Copy link
Copy Markdown
Contributor Author

hollance commented Mar 7, 2023

I cleaned up hertz_to_mel and mel_to_hertz a bit:

  • more consistent doc comments
  • both support single float inputs as well as numpy arrays
  • simplified the formulas so it's not literally the same as the librosa code but also doesn't do pointless calculations

Since I think this implementation was based on librosa, we should also give them credit.

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of the support for numpy arrays. As long as the whisper and CLAP extraction tests pass (they are not slow so should be good!) should be good!

@hollance
Copy link
Copy Markdown
Contributor Author

I rewrote power_to_db and added amplitude_to_db. They still work like the librosa versions but with argument names that make more sense to me.

@hollance
Copy link
Copy Markdown
Contributor Author

Changed get_mel_filter_banks into mel_filter_bank. Mostly renamed arguments and variables and cleaned up the doc comments, so that the naming is more in line with the rest of Transformers, e.g. num_frequency_bins instead of nb_frequency_bins.

@hollance hollance force-pushed the audio_utils branch 2 times, most recently from 23a2c80 to c06a824 Compare April 13, 2023 13:08
@hollance
Copy link
Copy Markdown
Contributor Author

Pushed significant changes to the stft code.

  • Removed fram_wave; this is really an implementation detail that should happen inside the STFT.

  • The new stft gives the same results as librosa and torchaudio for the same options. It's 25% faster than the previous implementation, mostly due to using rfft instead of fft (since the input is always real-only, not complex).

  • librosa is still faster since they use a bunch of tricks under the hood to avoid memory copies etc; we can slowly work towards matching this speed (not super important to do this immediately since the new stft is already faster than what we had before)

  • No batching yet.

I will be replacing the other hand-rolled STFTs with this soon (also in this PR).

None of the changes I made are set in stone — feel free to discuss things like the argument names, the shapes of the returned tensors, and so on.

@hollance hollance force-pushed the audio_utils branch 2 times, most recently from 5794a00 to b382211 Compare April 18, 2023 11:04
@hollance
Copy link
Copy Markdown
Contributor Author

Replaced the hand-rolled STFT in the different models with the one from audio_utils:

  • CLAP
  • M-CTC-T
  • SpeechT5
  • TVLT
  • Whisper

Did not do audio_spectrogram_transformer and speech_to_text. These use ta_kaldi.fbank, which is simple enough and faster than audio_utils. If we want to get completely rid of torchaudio we could also replace these.

@hollance hollance marked this pull request as ready for review April 18, 2023 13:37
@hollance
Copy link
Copy Markdown
Contributor Author

@sanchit-gandhi @ArthurZucker I think this is ready for review now. Feel free to look at this with a critical eye!

The STFT code is currently written for ease of understanding and flexibility, not speed, although it does outperform the previous methods we were using.

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏🏻 Awesome work! The utils look super nice, great work on using this on the other models and adding tests were it was missing!
I don't remember how many other models would benefit from this (maybe 2-3) but can definitely be in a follow up PR! Same with batching that would bring even more impact!
Great work here! 🚀

Comment thread docs/source/en/internal/audio_utils.mdx Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes here are kind of breaking no? these functions were not private! Would just be in favor of warning

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

audio_utils isn't private per se, but it's also not really meant as a public API either. Adding a warning for get_mel_filter_banks is possible but then we'd also need to add a warning for stft and rename the new stft function since it uses very different arguments. I doubt that's worth it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are in the main documentation, I don't know how it gets more public than this.

Comment on lines 34 to 40
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice seing this disappear

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are also part of another pr! Let's remove them once it is merged

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can do a rebase and they should go away then.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool that you added that test!

Comment thread src/transformers/audio_utils.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not 100% sure we need this one-liner

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a one-liner but the code is non-trivial.

Comment thread src/transformers/audio_utils.py Outdated
Comment on lines 220 to 203
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not seem to be used + not really a fan of one-liners!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I should actually call this function. ;-) It's a non-trivial calculation, so I think it's OK to have a function for it.

Comment thread src/transformers/audio_utils.py Outdated
Comment on lines 270 to 250
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo interesting that's why I had to do this with Whisper!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool I think it is important to have these kind of params in the feature extractor config to know what it special about each one when extracting mel.

Comment thread src/transformers/audio_utils.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rfft also supports batching !

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, and that's why librosa is currently faster. Even for a single input waveform they split it into batches.

Copy link
Copy Markdown
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice PR @hollance. The audio utils code is clear, comprehensive and easy to understand. Great to see so much feature extractor code being replaced by a simple one-function calls! Given the flexibility in the code, this will really simplify new feature extractor additions going forwards.

Comment thread src/transformers/audio_utils.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this is maybe a bit too 'magic' to define here and import in the other files? Wonder maybe if a more verbose docstring could help explain how the optimal FFT length is computed - think this would increase the chance a contributor would use this function in a new model addition

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

Comment thread src/transformers/audio_utils.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

Comment thread src/transformers/audio_utils.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for me to leave this un-batched over the STFT frames for the time being

Comment thread src/transformers/audio_utils.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Whether to pad the waveform so that so that frame `t` is centered around time `t * hop_length`. If `False`,
Whether to pad the waveform so that frame `t` is centered around time `t * hop_length`. If `False`,

Comment thread src/transformers/audio_utils.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth also maybe summarising "edge" and "reflect"?

Comment thread src/transformers/audio_utils.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not faster to force dtype=np.float32 with the fft?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see this is not possible with np.fft and is used for a faster, compiled implementation (https://numpy.org/doc/stable/reference/routines.fft.html#type-promotion)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. here I'm not sure if the user would immediately reach for optimal_fft_length or just write this themselves based on the current docstring!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Is it clearer to transpose here or in the __call__? Just wonder whether we can have consistency across our feature extractors by always returning the non-transposed version?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used the shapes as returned by librosa and torchaudio, which is (bins, length) but not every model uses it in that order. For existing models I didn't want to change the shape of tensors returned by the feature extractors.

To me it makes most sense to have the feature extractor return the spectrogram in the shape the model intends to use it.

(BTW, right now none of the feature extractors actually documents its return values.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me!

Comment thread tests/utils/test_audio_utils.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very comprehensive! Wonder whether we can slim it down to 3-4 tests that have max coverage (e.g. that would tell us if we break something)? Or as Arthur suggested run it as a nightly test only (cc @ydshieh)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of combinations to test, so I wouldn't want to remove any of these tests.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you run test against this test file on a CPU machine, and see how long it takes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ydshieh On my 2019 Intel iMac it takes 14 seconds to run these tests.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be fine to run all these on CircleCI, I assume either your hardware doesn't have GPU, or these tests don't use real checkpoints.

I can run them on CircleCI runners to be sure this afternoon.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just loads the "hf-internal-testing/librispeech_asr_dummy" dataset (like the tests of most of the audio models do) but no checkpoints.

@ArthurZucker ArthurZucker mentioned this pull request Apr 26, 2023
5 tasks
@hollance
Copy link
Copy Markdown
Contributor Author

hollance commented May 1, 2023

@sanchit-gandhi @ArthurZucker Are you OK with the PR in its current state? Then I can ask a core maintainer for a final review.

@sanchit-gandhi
Copy link
Copy Markdown
Contributor

Took a second look through and the changes LGTM @hollance!

@hollance hollance requested a review from sgugger May 3, 2023 09:42
Copy link
Copy Markdown
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this and cleaning those up. I'm afraid renaming documented functions without taking care is not an option however, as it is breaking the public API. So either need to revert the naming (get_mel_filter_banks->mel_filter_banks is purely cosmetic so not worth doing IMO) or leave the functions with a deprecation warning (for fram_wave maybe?)

Comment thread docs/source/en/internal/audio_utils.mdx Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are in the main documentation, I don't know how it gets more public than this.

Comment thread src/transformers/audio_utils.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't rename a documented function without at least a deprecation cycle. Since those are utils, let's maybe avoid all of this and not do the rename?

Comment thread src/transformers/audio_utils.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before on the renaming.

Copy link
Copy Markdown
Contributor Author

@hollance hollance May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want fram_wave gone or throwing an error. It should never have been exposed and keeping it "just because" achieves the opposite of what I wanted to do with this PR (clean up the code and make it solid).

Edit: Sorry if that sounded aggressive but I find it annoying that you're asking me to put bad code back after I spent a bunch of effort improving it. (And I disagree that choosing clear names is purely cosmetic.)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hollance As @sgugger mentioned, we can't just remove/change code without a deprecation cycle when it will break things. Even if it's bad code, it's exposed and users use them. We can't just think about the clean code without considering what our users will face.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry but backward-compatibility is not something we can compromise on. I can understand if the function from_wave should never have been exposed, but the harm is now done, and we need proper deprecation for at least two minor releases before removing it entirely.

Likewise for get_mel_filter_banks if you feel strongly about the renaming.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the deprecated functions back with a warning. There's a failing test now but it seems unrelated.

Copy link
Copy Markdown
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@hollance
Copy link
Copy Markdown
Contributor Author

hollance commented May 9, 2023

If everyone's happy with it, feel free to merge (I don't have rights).

@sgugger sgugger merged commit 7f91950 into huggingface:main May 9, 2023
gojiteji pushed a commit to gojiteji/transformers that referenced this pull request Jun 5, 2023
* silly change to allow making a PR

* clean up doc comments

* simplify hertz_to_mel and mel_to_hertz

* fixup

* clean up power_to_db

* also add amplitude_to_db

* move functions

* clean up mel_filter_bank

* fixup

* credit librosa & torchaudio authors

* add unit tests

* tests for power_to_db and amplitude_to_db

* add mel_filter_bank tests

* rewrite STFT

* add convenience spectrogram function

* missing transpose

* fewer transposes

* add integration test to M-CTC-T

* frame length can be either window or FFT length

* rewrite stft API

* add preemphasis coefficient

* move argument

* add log option to spectrogram

* replace M-CTC-T feature extractor

* fix api thing

* replace whisper STFT

* replace whisper mel filters

* replace tvlt's stft

* allow alternate window names

* replace speecht5 stft

* fixup

* fix integration tests

* fix doc comments

* remove manual FFT length calculation

* fix docs

* go away, deprecation warnings

* combine everything into spectrogram function

* add deprecated functions back

* fixup
novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
* silly change to allow making a PR

* clean up doc comments

* simplify hertz_to_mel and mel_to_hertz

* fixup

* clean up power_to_db

* also add amplitude_to_db

* move functions

* clean up mel_filter_bank

* fixup

* credit librosa & torchaudio authors

* add unit tests

* tests for power_to_db and amplitude_to_db

* add mel_filter_bank tests

* rewrite STFT

* add convenience spectrogram function

* missing transpose

* fewer transposes

* add integration test to M-CTC-T

* frame length can be either window or FFT length

* rewrite stft API

* add preemphasis coefficient

* move argument

* add log option to spectrogram

* replace M-CTC-T feature extractor

* fix api thing

* replace whisper STFT

* replace whisper mel filters

* replace tvlt's stft

* allow alternate window names

* replace speecht5 stft

* fixup

* fix integration tests

* fix doc comments

* remove manual FFT length calculation

* fix docs

* go away, deprecation warnings

* combine everything into spectrogram function

* add deprecated functions back

* fixup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants