Skip to content

Commit f9f21c6

Browse files
asumagicmravanelli
andauthored
Streaming ASR interfaces (#2377)
* Implemented high level streaming interfaces More WIP Bunch of filter properties impl More WIP interfaces stuff Fix type annotation in filter_analysis more wip interfaces wip Fix wrong context var set wip thoughts Implement file transcription * Renames and fixes * Add transcribe_file_streaming * Formatting fixes * Revert accidentally introduced change to max_batch_len * Reworking interface naming and docstring * Use the searcher directly, forwarding extra args * More docstrings * Fix parameter order * Add WIP StreamingTransducerASR example * Formatting * More docstrings and renames for interfaces * Docstrings for context * Merge the unnecessary wrapper * Rename StreamingTransducerASR to StreamingASR * Fix precommit * Remove unused fea_extractor field * Fix test error by commenting out inference stuff * Add some docstrings to streamingfeaturewrapper * More docs * Formatting * Feature extraction streaming wrapper docstrings * Add missing file docstring for filter_analysis * Tentative fix for docs gen error * Fix some missing docstring args in ASR * Allow using ffmpeg streaming with StreamingASR * Extract stream logic into _get_audio_stream * Docstring for _get_audio_stream * Formatting * Move out some streaming tokenizer logic * Accept stupid suggestions from formatter * Somewhat more generic StreamingASR * Tokenizer-agnostic StreamingASR * Add commented out tokenizer streaming hparams * Add missing docstring * Remove unused import from ASR * CI and configuration fixes; use python 3.9 in CI * Fix doctest using inconsistent left context size * Clarify on tokenizer_context init * Update HPARAMS_NEEDED for StreamingASR * Improve transducer forward docs for extra args * Fix code blocks in filter_analysis * Linting * fix broken indent in filter_analysis examples... * Update author lists * Remove currently unused has_overlap * Clarify on fea_streaming_extractor properties * Fix ASRStreamingContext doc wording * Improve docstring for `get_chunk_size_frames` * wip test * Streaming feature wrapper test + better docs * Improve StreamingFeatureWraper docstring * Improve docstring and comments on spm streaming decode * Fixed accidentally duplicated docstring * Fix very stupid typo * Add notice for trained streaming ASR inference * Use LengthsCapableSequential instead of custom wrapper * Precommit fix * Added mechanism to inject zero chunks at the end to fix trunc * Simplify apply in YAML * Add decoding_function abstraction for StreamingASR * Fix partial apply shenanigans --------- Co-authored-by: Mirco Ravanelli <mirco.ravanelli@gmail.com>
1 parent f9b5473 commit f9f21c6

18 files changed

Lines changed: 1081 additions & 31 deletions

File tree

.github/workflows/pre-commit.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,5 +12,5 @@ jobs:
1212
- uses: actions/checkout@v2
1313
- uses: actions/setup-python@v2
1414
with:
15-
python-version: '3.8'
15+
python-version: '3.9'
1616
- uses: pre-commit/action@v2.0.3

.github/workflows/release.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ jobs:
1717
ref: main
1818
- uses: actions/setup-python@v2
1919
with:
20-
python-version: 3.8
20+
python-version: 3.9
2121
- name: Install pypa/build
2222
run: python -m pip install build --user
2323
- name: Build binary wheel and source tarball

.github/workflows/verify-docs-gen.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,10 @@ jobs:
1111
runs-on: ubuntu-latest
1212
steps:
1313
- uses: actions/checkout@v2
14-
- name: Setup Python 3.8
14+
- name: Setup Python 3.9
1515
uses: actions/setup-python@v2
1616
with:
17-
python-version: '3.8'
17+
python-version: '3.9'
1818
- name: Full dependencies
1919
run: |
2020
# up to k2 compatible torch version

docs/conf.py

Lines changed: 1 addition & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -97,20 +97,6 @@ def run_apidoc(app):
9797
import better_apidoc
9898

9999
better_apidoc.APP = app
100-
101-
better_apidoc.main(
102-
[
103-
"better-apidoc",
104-
"-t",
105-
"_apidoc_templates",
106-
"--force",
107-
"--no-toc",
108-
"--separate",
109-
"-o",
110-
"API",
111-
os.path.dirname(hyperpyyaml.__file__),
112-
]
113-
)
114100
better_apidoc.main(
115101
[
116102
"better-apidoc",
@@ -122,6 +108,7 @@ def run_apidoc(app):
122108
"-o",
123109
"API",
124110
os.path.join("../", "speechbrain"),
111+
os.path.dirname(hyperpyyaml.__file__),
125112
]
126113
)
127114

recipes/LibriSpeech/ASR/transducer/README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,18 @@ may end up forming indirect dependencies to audio many seconds ago.
6464
| 4 | - | 3.12% | 3.13% | 3.37% | 3.51% | 3.80% |
6565
| 2 | - | 3.19% | 3.24% | 3.50% | 3.79% | 4.38% |
6666

67+
### Inference
68+
69+
Once your model is trained, you need a few manual steps in order to use it with the high-level streaming interfaces (`speechbrain.inference.ASR.StreamingASR`):
70+
71+
1. Create a new directory where you want to store the model.
72+
2. Copy `results/conformer_transducer/<seed>/lm.ckpt` (optional; currently, for streaming rescoring LMs might be unsupported) and `tokenizer.ckpt` to that directory.
73+
3. Copy `results/conformer_transducer/<seed>/save/CKPT+????/model.ckpt` and `normalizer.ckpt` to that directory.
74+
4. Copy your hyperparameters file to that directory. Uncomment the streaming specific keys and remove any training-specific keys. Alternatively, grab the inference hyperparameters YAML for this model from HuggingFace and adapt it to any changes you may have done.
75+
5. You can now instantiate a `StreamingASR` with your model using `StreamingASR.from_hparams("/path/to/model/")`.
76+
77+
The contents of that directory may be uploaded as a HuggingFace model, in which case the model source path can just be specified as `youruser/yourmodel`.
78+
6779
# **About SpeechBrain**
6880
- Website: https://speechbrain.github.io/
6981
- Code: https://github.com/speechbrain/speechbrain/

recipes/LibriSpeech/ASR/transducer/hparams/conformer_transducer.yaml

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010

1111
# Seed needs to be set at top of yaml, before objects with parameters are made
1212
seed: 3407
13-
__set_seed: !!python/object/apply:torch.manual_seed [!ref <seed>]
13+
__set_seed: !apply:torch.manual_seed [!ref <seed>]
1414
output_folder: !ref results/conformer_transducer_large/<seed>
1515
output_wer_folder: !ref <output_folder>/
1616
save_folder: !ref <output_folder>/save
@@ -399,3 +399,23 @@ error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
399399

400400
cer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
401401
split_tokens: True
402+
403+
# for the inference hparams, you will need to include and uncomment something like this:
404+
405+
# make_tokenizer_streaming_context: !name:speechbrain.tokenizers.SentencePiece.SentencePieceDecoderStreamingContext
406+
# tokenizer_decode_streaming: !name:speechbrain.tokenizers.SentencePiece.spm_decode_preserve_leading_space
407+
408+
# make_decoder_streaming_context: !name:speechbrain.decoders.transducer.TransducerGreedySearcherStreamingContext # default constructor
409+
# decoding_function: !name:speechbrain.decoders.transducer.TransducerBeamSearcher.transducer_greedy_decode_streaming
410+
# - !ref <Greedysearcher> # self
411+
412+
# fea_streaming_extractor: !new:speechbrain.lobes.features.StreamingFeatureWrapper
413+
# module: !new:speechbrain.nnet.containers.LengthsCapableSequential
414+
# - !ref <compute_features>
415+
# - !ref <normalize>
416+
# - !ref <CNN>
417+
# # don't consider normalization as part of the input filter chain.
418+
# # normalization will operate at chunk level, which mismatches training
419+
# # somewhat, but does not appear to result in noticeable degradation.
420+
# properties: !apply:speechbrain.utils.filter_analysis.stack_filter_properties
421+
# - [!ref <compute_features>, !ref <CNN>]

speechbrain/decoders/transducer.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,19 @@
55
Sung-Lin Yeh 2020
66
"""
77
import torch
8+
from dataclasses import dataclass
89
from functools import partial
10+
from typing import Optional, Any
11+
12+
13+
@dataclass
14+
class TransducerGreedySearcherStreamingContext(torch.nn.Module):
15+
"""Simple wrapper for the hidden state of the transducer greedy searcher.
16+
Used by :meth:`~TransducerBeamSearcher.transducer_greedy_decode_streaming`.
17+
"""
18+
19+
hidden: Optional[Any] = None
20+
"""Hidden state; typically a tensor or a tuple of tensors."""
921

1022

1123
class TransducerBeamSearcher(torch.nn.Module):
@@ -255,6 +267,29 @@ def transducer_greedy_decode(
255267

256268
return ret
257269

270+
def transducer_greedy_decode_streaming(
271+
self, x: torch.Tensor, context: TransducerGreedySearcherStreamingContext
272+
):
273+
"""Tiny wrapper for
274+
:meth:`~TransducerBeamSearcher.transducer_greedy_decode` with an API
275+
that makes it suitable to be passed as a `decoding_function` for
276+
streaming.
277+
278+
Arguments
279+
---------
280+
x : torch.Tensor
281+
Outputs of the prediction network (equivalent to `tn_output`)
282+
context : TransducerGreedySearcherStreamingContext
283+
Mutable streaming context object, which must be specified and reused
284+
across calls when streaming.
285+
You can obtain an initial context by initializing a default object.
286+
"""
287+
(hyp, _scores, _, _, hidden) = self.transducer_greedy_decode(
288+
x, context.hidden, return_hidden=True
289+
)
290+
context.hidden = hidden
291+
return hyp
292+
258293
def transducer_beam_search_decode(self, tn_output):
259294
"""Transducer beam search decoder is a beam search decoder over batch which apply Transducer rules:
260295
1- for each utterance:

0 commit comments

Comments
 (0)