audio_utils improvements by hollance · Pull Request #21998 · huggingface/transformers

hollance · 2023-03-07T14:27:19Z

What does this PR do?

Recently the audio_utils.py file was added to Transformers to provide shared functions for audio processing such as STFT. This PR aims to clean up the code and make the API more robust.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2023-03-07T14:57:21Z

The documentation is not available anymore as the PR was closed or merged.

hollance · 2023-03-07T15:43:25Z

I cleaned up hertz_to_mel and mel_to_hertz a bit:

more consistent doc comments
both support single float inputs as well as numpy arrays
simplified the formulas so it's not literally the same as the librosa code but also doesn't do pointless calculations

Since I think this implementation was based on librosa, we should also give them credit.

ArthurZucker

Thanks for taking care of the support for numpy arrays. As long as the whisper and CLAP extraction tests pass (they are not slow so should be good!) should be good!

hollance · 2023-03-28T14:41:19Z

I rewrote power_to_db and added amplitude_to_db. They still work like the librosa versions but with argument names that make more sense to me.

hollance · 2023-03-29T11:42:12Z

Changed get_mel_filter_banks into mel_filter_bank. Mostly renamed arguments and variables and cleaned up the doc comments, so that the naming is more in line with the rest of Transformers, e.g. num_frequency_bins instead of nb_frequency_bins.

hollance · 2023-04-13T13:11:52Z

Pushed significant changes to the stft code.

Removed fram_wave; this is really an implementation detail that should happen inside the STFT.
The new stft gives the same results as librosa and torchaudio for the same options. It's 25% faster than the previous implementation, mostly due to using rfft instead of fft (since the input is always real-only, not complex).
librosa is still faster since they use a bunch of tricks under the hood to avoid memory copies etc; we can slowly work towards matching this speed (not super important to do this immediately since the new stft is already faster than what we had before)
No batching yet.

I will be replacing the other hand-rolled STFTs with this soon (also in this PR).

None of the changes I made are set in stone — feel free to discuss things like the argument names, the shapes of the returned tensors, and so on.

hollance · 2023-04-18T13:30:49Z

Replaced the hand-rolled STFT in the different models with the one from audio_utils:

CLAP
M-CTC-T
SpeechT5
TVLT
Whisper

Did not do audio_spectrogram_transformer and speech_to_text. These use ta_kaldi.fbank, which is simple enough and faster than audio_utils. If we want to get completely rid of torchaudio we could also replace these.

hollance · 2023-04-18T13:40:40Z

@sanchit-gandhi @ArthurZucker I think this is ready for review now. Feel free to look at this with a critical eye!

The STFT code is currently written for ease of understanding and flexibility, not speed, although it does outperform the previous methods we were using.

ArthurZucker

👏🏻 Awesome work! The utils look super nice, great work on using this on the other models and adding tests were it was missing!
I don't remember how many other models would benefit from this (maybe 2-3) but can definitely be in a follow up PR! Same with batching that would bring even more impact!
Great work here! 🚀

ArthurZucker · 2023-04-21T12:27:21Z

changes here are kind of breaking no? these functions were not private! Would just be in favor of warning

audio_utils isn't private per se, but it's also not really meant as a public API either. Adding a warning for get_mel_filter_banks is possible but then we'd also need to add a warning for stft and rename the new stft function since it uses very different arguments. I doubt that's worth it.

They are in the main documentation, I don't know how it gets more public than this.

ArthurZucker · 2023-04-21T12:28:53Z

nice seing this disappear

ArthurZucker · 2023-04-21T12:32:16Z

these are also part of another pr! Let's remove them once it is merged

Yeah I can do a rebase and they should go away then.

ArthurZucker · 2023-04-21T12:32:36Z

ArthurZucker · 2023-04-21T12:32:54Z

cool that you added that test!

ArthurZucker · 2023-04-21T12:57:57Z

not 100% sure we need this one-liner

It's a one-liner but the code is non-trivial.

ArthurZucker · 2023-04-21T12:58:27Z

this does not seem to be used + not really a fan of one-liners!

Good point, I should actually call this function. ;-) It's a non-trivial calculation, so I think it's OK to have a function for it.

ArthurZucker · 2023-04-21T12:59:36Z

Ooo interesting that's why I had to do this with Whisper!

ArthurZucker · 2023-04-21T13:04:26Z

cool I think it is important to have these kind of params in the feature extractor config to know what it special about each one when extracting mel.

ArthurZucker · 2023-04-21T13:05:20Z

I think rfft also supports batching !

It does, and that's why librosa is currently faster. Even for a single input waveform they split it into batches.

sanchit-gandhi

Very nice PR @hollance. The audio utils code is clear, comprehensive and easy to understand. Great to see so much feature extractor code being replaced by a simple one-function calls! Given the flexibility in the code, this will really simplify new feature extractor additions going forwards.

sanchit-gandhi · 2023-04-24T16:37:01Z

Do you think this is maybe a bit too 'magic' to define here and import in the other files? Wonder maybe if a more verbose docstring could help explain how the optimal FFT length is computed - think this would increase the chance a contributor would use this function in a new model addition

Good point!

sanchit-gandhi · 2023-04-24T16:39:29Z

sanchit-gandhi · 2023-04-24T16:40:07Z

Good for me to leave this un-batched over the STFT frames for the time being

sanchit-gandhi · 2023-04-24T16:43:18Z

Suggested change

Whether to pad the waveform so that so that frame `t` is centered around time `t * hop_length`. If `False`,

Whether to pad the waveform so that frame `t` is centered around time `t * hop_length`. If `False`,

sanchit-gandhi · 2023-04-24T16:44:22Z

Worth also maybe summarising "edge" and "reflect"?

sanchit-gandhi · 2023-04-24T16:49:01Z

It's not faster to force dtype=np.float32 with the fft?

Ah I see this is not possible with np.fft and is used for a faster, compiled implementation (https://numpy.org/doc/stable/reference/routines.fft.html#type-promotion)

sanchit-gandhi · 2023-04-24T16:56:09Z

E.g. here I'm not sure if the user would immediately reach for optimal_fft_length or just write this themselves based on the current docstring!

sanchit-gandhi · 2023-04-24T17:02:46Z

(nit) Is it clearer to transpose here or in the __call__? Just wonder whether we can have consistency across our feature extractors by always returning the non-transposed version?

I've used the shapes as returned by librosa and torchaudio, which is (bins, length) but not every model uses it in that order. For existing models I didn't want to change the shape of tensors returned by the feature extractors.

To me it makes most sense to have the feature extractor return the spectrogram in the shape the model intends to use it.

(BTW, right now none of the feature extractors actually documents its return values.)

Sounds good to me!

sanchit-gandhi · 2023-04-24T17:03:25Z

Very comprehensive! Wonder whether we can slim it down to 3-4 tests that have max coverage (e.g. that would tell us if we break something)? Or as Arthur suggested run it as a nightly test only (cc @ydshieh)

There are a lot of combinations to test, so I wouldn't want to remove any of these tests.

Could you run test against this test file on a CPU machine, and see how long it takes?

@ydshieh On my 2019 Intel iMac it takes 14 seconds to run these tests.

I think it would be fine to run all these on CircleCI, I assume either your hardware doesn't have GPU, or these tests don't use real checkpoints.

I can run them on CircleCI runners to be sure this afternoon.

It just loads the "hf-internal-testing/librispeech_asr_dummy" dataset (like the tests of most of the audio models do) but no checkpoints.

hollance · 2023-05-01T09:42:22Z

@sanchit-gandhi @ArthurZucker Are you OK with the PR in its current state? Then I can ask a core maintainer for a final review.

sanchit-gandhi · 2023-05-02T16:50:47Z

Took a second look through and the changes LGTM @hollance!

sgugger

Thanks for working on this and cleaning those up. I'm afraid renaming documented functions without taking care is not an option however, as it is breaking the public API. So either need to revert the naming (get_mel_filter_banks->mel_filter_banks is purely cosmetic so not worth doing IMO) or leave the functions with a deprecation warning (for fram_wave maybe?)

sgugger · 2023-05-03T13:00:18Z

They are in the main documentation, I don't know how it gets more public than this.

sgugger · 2023-05-03T13:01:40Z

Can't rename a documented function without at least a deprecation cycle. Since those are utils, let's maybe avoid all of this and not do the rename?

sgugger · 2023-05-03T13:03:04Z

Same comment as before on the renaming.

I want fram_wave gone or throwing an error. It should never have been exposed and keeping it "just because" achieves the opposite of what I wanted to do with this PR (clean up the code and make it solid).

Edit: Sorry if that sounded aggressive but I find it annoying that you're asking me to put bad code back after I spent a bunch of effort improving it. (And I disagree that choosing clear names is purely cosmetic.)

@hollance As @sgugger mentioned, we can't just remove/change code without a deprecation cycle when it will break things. Even if it's bad code, it's exposed and users use them. We can't just think about the clean code without considering what our users will face.

I'm sorry but backward-compatibility is not something we can compromise on. I can understand if the function from_wave should never have been exposed, but the harm is now done, and we need proper deprecation for at least two minor releases before removing it entirely.

Likewise for get_mel_filter_banks if you feel strongly about the renaming.

Added the deprecated functions back with a warning. There's a failing test now but it seems unrelated.

sgugger

Thanks!

hollance · 2023-05-09T08:34:19Z

If everyone's happy with it, feel free to merge (I don't have rights).

* silly change to allow making a PR * clean up doc comments * simplify hertz_to_mel and mel_to_hertz * fixup * clean up power_to_db * also add amplitude_to_db * move functions * clean up mel_filter_bank * fixup * credit librosa & torchaudio authors * add unit tests * tests for power_to_db and amplitude_to_db * add mel_filter_bank tests * rewrite STFT * add convenience spectrogram function * missing transpose * fewer transposes * add integration test to M-CTC-T * frame length can be either window or FFT length * rewrite stft API * add preemphasis coefficient * move argument * add log option to spectrogram * replace M-CTC-T feature extractor * fix api thing * replace whisper STFT * replace whisper mel filters * replace tvlt's stft * allow alternate window names * replace speecht5 stft * fixup * fix integration tests * fix doc comments * remove manual FFT length calculation * fix docs * go away, deprecation warnings * combine everything into spectrogram function * add deprecated functions back * fixup

ArthurZucker approved these changes Mar 27, 2023

View reviewed changes

hollance force-pushed the audio_utils branch from eeddef2 to e2bed8e Compare March 28, 2023 12:48

hollance force-pushed the audio_utils branch from 1a9fcd1 to b54874f Compare March 29, 2023 11:48

hollance force-pushed the audio_utils branch 2 times, most recently from 23a2c80 to c06a824 Compare April 13, 2023 13:08

hollance force-pushed the audio_utils branch 2 times, most recently from 5794a00 to b382211 Compare April 18, 2023 11:04

hollance marked this pull request as ready for review April 18, 2023 13:37

hollance requested review from ArthurZucker and sanchit-gandhi April 18, 2023 13:40

ArthurZucker approved these changes Apr 21, 2023

View reviewed changes

hollance force-pushed the audio_utils branch from c911ed6 to b01f53b Compare April 24, 2023 09:14

sanchit-gandhi approved these changes Apr 24, 2023

View reviewed changes

hollance force-pushed the audio_utils branch from b01f53b to e55d59d Compare April 25, 2023 09:05

ArthurZucker mentioned this pull request Apr 26, 2023

Add Pop2Piano #21785

Merged

5 tasks

hollance requested a review from sgugger May 3, 2023 09:42

sgugger reviewed May 3, 2023

View reviewed changes

hollance added 4 commits May 8, 2023 11:09

silly change to allow making a PR

ff3f407

clean up doc comments

1050de5

simplify hertz_to_mel and mel_to_hertz

fc590c7

fixup

5c27568

hollance added 20 commits May 8, 2023 11:09

rewrite stft API

bcb6c79

add preemphasis coefficient

41b8501

move argument

e8663b0

add log option to spectrogram

4eedc5c

replace M-CTC-T feature extractor

71656b8

fix api thing

a3120c7

replace whisper STFT

99c1ce6

replace whisper mel filters

f896650

replace tvlt's stft

890fa72

allow alternate window names

1b2026e

replace speecht5 stft

a4680c9

fixup

067c87c

fix integration tests

22608a0

fix doc comments

534c07a

remove manual FFT length calculation

31f30fd

fix docs

d3144c5

go away, deprecation warnings

7151ba0

combine everything into spectrogram function

dd1046b

add deprecated functions back

9582720

fixup

b3eba99

hollance force-pushed the audio_utils branch from e55d59d to b3eba99 Compare May 8, 2023 11:52

sgugger approved these changes May 8, 2023

View reviewed changes

hollance mentioned this pull request May 8, 2023

Whisper feature extraction: tiny condition check error #23203

Closed

sgugger merged commit 7f91950 into huggingface:main May 9, 2023

hollance mentioned this pull request May 15, 2023

Flaky Whisper PT-TF & PT-Flax Equivalence Test #23258

Open

4 tasks

poonehmousavi mentioned this pull request Jun 22, 2023

Change needed in Whisper fine-tuning recipe to accommodate transformers4.30.0 speechbrain/speechbrain#2016

Merged

sanchit-gandhi mentioned this pull request Jul 3, 2023

Fix audio feature extractor deps #24636

Merged

	Whether to pad the waveform so that so that frame `t` is centered around time `t * hop_length`. If `False`,
	Whether to pad the waveform so that frame `t` is centered around time `t * hop_length`. If `False`,

Conversation

hollance commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hollance commented Mar 7, 2023

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

hollance commented Mar 28, 2023

Uh oh!

hollance commented Mar 29, 2023

Uh oh!

hollance commented Apr 13, 2023

Uh oh!

hollance commented Apr 18, 2023

Uh oh!

hollance commented Apr 18, 2023

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanchit-gandhi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

hollance commented Mar 7, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 7, 2023 •

edited

Loading