Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
I cleaned up
Since I think this implementation was based on librosa, we should also give them credit. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Thanks for taking care of the support for numpy arrays. As long as the whisper and CLAP extraction tests pass (they are not slow so should be good!) should be good!
|
I rewrote |
|
Changed |
23a2c80 to
c06a824
Compare
|
Pushed significant changes to the
I will be replacing the other hand-rolled STFTs with this soon (also in this PR). None of the changes I made are set in stone — feel free to discuss things like the argument names, the shapes of the returned tensors, and so on. |
5794a00 to
b382211
Compare
|
Replaced the hand-rolled STFT in the different models with the one from
Did not do |
|
@sanchit-gandhi @ArthurZucker I think this is ready for review now. Feel free to look at this with a critical eye! The STFT code is currently written for ease of understanding and flexibility, not speed, although it does outperform the previous methods we were using. |
ArthurZucker
left a comment
There was a problem hiding this comment.
👏🏻 Awesome work! The utils look super nice, great work on using this on the other models and adding tests were it was missing!
I don't remember how many other models would benefit from this (maybe 2-3) but can definitely be in a follow up PR! Same with batching that would bring even more impact!
Great work here! 🚀
There was a problem hiding this comment.
changes here are kind of breaking no? these functions were not private! Would just be in favor of warning
There was a problem hiding this comment.
audio_utils isn't private per se, but it's also not really meant as a public API either. Adding a warning for get_mel_filter_banks is possible but then we'd also need to add a warning for stft and rename the new stft function since it uses very different arguments. I doubt that's worth it.
There was a problem hiding this comment.
They are in the main documentation, I don't know how it gets more public than this.
There was a problem hiding this comment.
nice seing this disappear
There was a problem hiding this comment.
these are also part of another pr! Let's remove them once it is merged
There was a problem hiding this comment.
Yeah I can do a rebase and they should go away then.
There was a problem hiding this comment.
cool that you added that test!
There was a problem hiding this comment.
not 100% sure we need this one-liner
There was a problem hiding this comment.
It's a one-liner but the code is non-trivial.
There was a problem hiding this comment.
this does not seem to be used + not really a fan of one-liners!
There was a problem hiding this comment.
Good point, I should actually call this function. ;-) It's a non-trivial calculation, so I think it's OK to have a function for it.
There was a problem hiding this comment.
Ooo interesting that's why I had to do this with Whisper!
There was a problem hiding this comment.
cool I think it is important to have these kind of params in the feature extractor config to know what it special about each one when extracting mel.
There was a problem hiding this comment.
I think rfft also supports batching !
There was a problem hiding this comment.
It does, and that's why librosa is currently faster. Even for a single input waveform they split it into batches.
sanchit-gandhi
left a comment
There was a problem hiding this comment.
Very nice PR @hollance. The audio utils code is clear, comprehensive and easy to understand. Great to see so much feature extractor code being replaced by a simple one-function calls! Given the flexibility in the code, this will really simplify new feature extractor additions going forwards.
There was a problem hiding this comment.
Do you think this is maybe a bit too 'magic' to define here and import in the other files? Wonder maybe if a more verbose docstring could help explain how the optimal FFT length is computed - think this would increase the chance a contributor would use this function in a new model addition
There was a problem hiding this comment.
Good for me to leave this un-batched over the STFT frames for the time being
There was a problem hiding this comment.
| Whether to pad the waveform so that so that frame `t` is centered around time `t * hop_length`. If `False`, | |
| Whether to pad the waveform so that frame `t` is centered around time `t * hop_length`. If `False`, |
There was a problem hiding this comment.
Worth also maybe summarising "edge" and "reflect"?
There was a problem hiding this comment.
It's not faster to force dtype=np.float32 with the fft?
There was a problem hiding this comment.
Ah I see this is not possible with np.fft and is used for a faster, compiled implementation (https://numpy.org/doc/stable/reference/routines.fft.html#type-promotion)
There was a problem hiding this comment.
E.g. here I'm not sure if the user would immediately reach for optimal_fft_length or just write this themselves based on the current docstring!
There was a problem hiding this comment.
(nit) Is it clearer to transpose here or in the __call__? Just wonder whether we can have consistency across our feature extractors by always returning the non-transposed version?
There was a problem hiding this comment.
I've used the shapes as returned by librosa and torchaudio, which is (bins, length) but not every model uses it in that order. For existing models I didn't want to change the shape of tensors returned by the feature extractors.
To me it makes most sense to have the feature extractor return the spectrogram in the shape the model intends to use it.
(BTW, right now none of the feature extractors actually documents its return values.)
There was a problem hiding this comment.
Very comprehensive! Wonder whether we can slim it down to 3-4 tests that have max coverage (e.g. that would tell us if we break something)? Or as Arthur suggested run it as a nightly test only (cc @ydshieh)
There was a problem hiding this comment.
There are a lot of combinations to test, so I wouldn't want to remove any of these tests.
There was a problem hiding this comment.
Could you run test against this test file on a CPU machine, and see how long it takes?
There was a problem hiding this comment.
@ydshieh On my 2019 Intel iMac it takes 14 seconds to run these tests.
There was a problem hiding this comment.
I think it would be fine to run all these on CircleCI, I assume either your hardware doesn't have GPU, or these tests don't use real checkpoints.
I can run them on CircleCI runners to be sure this afternoon.
There was a problem hiding this comment.
It just loads the "hf-internal-testing/librispeech_asr_dummy" dataset (like the tests of most of the audio models do) but no checkpoints.
|
@sanchit-gandhi @ArthurZucker Are you OK with the PR in its current state? Then I can ask a core maintainer for a final review. |
|
Took a second look through and the changes LGTM @hollance! |
sgugger
left a comment
There was a problem hiding this comment.
Thanks for working on this and cleaning those up. I'm afraid renaming documented functions without taking care is not an option however, as it is breaking the public API. So either need to revert the naming (get_mel_filter_banks->mel_filter_banks is purely cosmetic so not worth doing IMO) or leave the functions with a deprecation warning (for fram_wave maybe?)
There was a problem hiding this comment.
They are in the main documentation, I don't know how it gets more public than this.
There was a problem hiding this comment.
Can't rename a documented function without at least a deprecation cycle. Since those are utils, let's maybe avoid all of this and not do the rename?
There was a problem hiding this comment.
Same comment as before on the renaming.
There was a problem hiding this comment.
I want fram_wave gone or throwing an error. It should never have been exposed and keeping it "just because" achieves the opposite of what I wanted to do with this PR (clean up the code and make it solid).
Edit: Sorry if that sounded aggressive but I find it annoying that you're asking me to put bad code back after I spent a bunch of effort improving it. (And I disagree that choosing clear names is purely cosmetic.)
There was a problem hiding this comment.
There was a problem hiding this comment.
I'm sorry but backward-compatibility is not something we can compromise on. I can understand if the function from_wave should never have been exposed, but the harm is now done, and we need proper deprecation for at least two minor releases before removing it entirely.
Likewise for get_mel_filter_banks if you feel strongly about the renaming.
There was a problem hiding this comment.
Added the deprecated functions back with a warning. There's a failing test now but it seems unrelated.
|
If everyone's happy with it, feel free to merge (I don't have rights). |
* silly change to allow making a PR * clean up doc comments * simplify hertz_to_mel and mel_to_hertz * fixup * clean up power_to_db * also add amplitude_to_db * move functions * clean up mel_filter_bank * fixup * credit librosa & torchaudio authors * add unit tests * tests for power_to_db and amplitude_to_db * add mel_filter_bank tests * rewrite STFT * add convenience spectrogram function * missing transpose * fewer transposes * add integration test to M-CTC-T * frame length can be either window or FFT length * rewrite stft API * add preemphasis coefficient * move argument * add log option to spectrogram * replace M-CTC-T feature extractor * fix api thing * replace whisper STFT * replace whisper mel filters * replace tvlt's stft * allow alternate window names * replace speecht5 stft * fixup * fix integration tests * fix doc comments * remove manual FFT length calculation * fix docs * go away, deprecation warnings * combine everything into spectrogram function * add deprecated functions back * fixup
* silly change to allow making a PR * clean up doc comments * simplify hertz_to_mel and mel_to_hertz * fixup * clean up power_to_db * also add amplitude_to_db * move functions * clean up mel_filter_bank * fixup * credit librosa & torchaudio authors * add unit tests * tests for power_to_db and amplitude_to_db * add mel_filter_bank tests * rewrite STFT * add convenience spectrogram function * missing transpose * fewer transposes * add integration test to M-CTC-T * frame length can be either window or FFT length * rewrite stft API * add preemphasis coefficient * move argument * add log option to spectrogram * replace M-CTC-T feature extractor * fix api thing * replace whisper STFT * replace whisper mel filters * replace tvlt's stft * allow alternate window names * replace speecht5 stft * fixup * fix integration tests * fix doc comments * remove manual FFT length calculation * fix docs * go away, deprecation warnings * combine everything into spectrogram function * add deprecated functions back * fixup
What does this PR do?
Recently the
audio_utils.pyfile was added to Transformers to provide shared functions for audio processing such as STFT. This PR aims to clean up the code and make the API more robust.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.