[espnet3-14.3] Add integration test for ASR#6392
[espnet3-14.3] Add integration test for ASR#6392Masao-Someki wants to merge 48 commits intoespnet:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds integration tests for ASR in espnet3, introducing a new recipe structure with template configurations and a more robust optimization and inference pipeline. The changes are extensive and well-structured, improving flexibility and maintainability. A key improvement is the refactoring of the optimizer and scheduler handling in the LightningModule, moving to a manual optimization loop for multi-optimizer setups which is cleaner and more aligned with PyTorch Lightning's practices. The inference pipeline is also enhanced with better artifact handling. However, I found a significant performance issue in the new CI script.
| --training_config conf/training.yaml \ | ||
| --inference_config "${inference_config}" \ | ||
| --metrics_config conf/metrics.yaml | ||
| rm -rf exp data |
There was a problem hiding this comment.
The rm -rf exp data command inside the run_with_training_config function, which is called in a loop, will cause the dataset to be downloaded and prepared repeatedly for each training configuration. This is highly inefficient and will significantly slow down the CI process. The create_dataset and train_tokenizer stages are idempotent and can be run just once before the loop. To fix this, you should remove data from this rm command to persist the prepared data across test runs within the same CI job. The exp directory should still be cleaned up to ensure a fresh state for each training run.
| rm -rf exp data | |
| rm -rf exp |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #6392 +/- ##
==========================================
+ Coverage 70.16% 70.25% +0.08%
==========================================
Files 787 791 +4
Lines 73367 73606 +239
==========================================
+ Hits 51477 51710 +233
- Misses 21890 21896 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…espnet into espnet3/integration_test_3
sw005320
left a comment
There was a problem hiding this comment.
There are many dataset-dependent names or paths in the config file.
Please avoid them as much as possible.
You should make them under the recipe directory and avoid specifying the dataset names.
I think the tokenizer part is well-designed.
Please follow this way
| exp_tag: ${load_yaml:training.yaml,exp_tag} | ||
| inference_dir: ${exp_dir}/inference | ||
| dataset_dir: ${data_dir}/mini_an4 |
There was a problem hiding this comment.
This would be an issue as we need to specify the dataset name mini_an4, and it would cause some errors.
How about creating the dataset_dir under egs3/mini_anr so that we don't need to specify it?
Is there another way to avoid it?
| create_dataset: | ||
| func: src.creating_dataset.create_dataset | ||
| dataset_dir: ${dataset_dir} | ||
| archive_path: ${recipe_dir}/../../egs2/mini_an4/asr1/downloads.tar.gz |
There was a problem hiding this comment.
Ditto
Instead of specifying the data in egs2, it is better to put it under the recipe directory.
Also, please avoid using egs2 files.
Make espnet3 recipes independent as much as possible.
| task: espnet3.systems.asr.task.ASRTask | ||
|
|
||
| exp_tag: train_asr_rnn_debug | ||
| dataset_dir: ${data_dir}/mini_an4 |
| ########################################################## | ||
| task: espnet3.systems.asr.task.ASRTask | ||
|
|
||
| exp_tag: train_asr_rnn_debug |
There was a problem hiding this comment.
Why _debug?
Please remove unnecessary names and avoid confusions
| train: | ||
| - name: train_nodev | ||
| dataset: | ||
| _target_: src.dataset.MiniAN4Dataset |
There was a problem hiding this comment.
ditto
This would bring some issues as we need to specify the dataset name
| create_dataset: | ||
| func: src.creating_dataset.create_dataset | ||
| dataset_dir: ${dataset_dir} | ||
| archive_path: ${recipe_dir}/../../egs2/mini_an4/asr1/downloads.tar.gz |
| train: | ||
| - name: train_nodev | ||
| dataset: | ||
| _target_: src.dataset.MiniAN4Dataset |
| valid: | ||
| - name: train_dev | ||
| dataset: | ||
| _target_: src.dataset.MiniAN4Dataset |
| dataset_dir = Path(dataset_dir) | ||
| logger = setup_logger(name="mini_an4.create_dataset") | ||
|
|
||
| archive = Path("../../../egs2/mini_an4/asr1/downloads.tar.gz") |
There was a problem hiding this comment.
avoid to specify the espnet2 directory
There was a problem hiding this comment.
This seems to be dataset independent.
It would be better to move this to the recipe common place (e.g., template)
…3/integration_test_3
…espnet into espnet3/integration_test_3
for more information, see https://pre-commit.ci
…espnet into espnet3/integration_test_3
…espnet into espnet3/integration_test_3
|
Can you even remove |
|
The change in c78db55 removes the Instead, experiment identity is handled at runner setup time:
Motivation
In the CI issue, the transducer failure was a concrete example of this.. What changed
|
|
@sw005320 If this looks okay to you, could you merge this PR? We can move to the next (3.14.5) bug fix PR! |
…3/integration_test_3
…espnet into espnet3/integration_test_3
for more information, see https://pre-commit.ci
…espnet into espnet3/integration_test_3
|
I have fixed the config name. I will handle other config-related items in the following PRs! |
| python-version: ["3.10", "3.12"] | ||
| pytorch-version: [2.9.1, 2.10.0, 2.11.0] | ||
| use-conda: [false] | ||
| chainer-version: [6.0.0] |
There was a problem hiding this comment.
Thanks for the comment!
chainer-version was not actually used in the espnet3's CI workflow, so I removed it in 2e58522
| dataset: | ||
| test: | ||
| - name: test | ||
| data_src_args: |
There was a problem hiding this comment.
Do you need args?
data_src or data_sources would be good enough?
Also, it is better to add similar explanation comments to the training data
##########################################################
DATASET DEFINITION
##########################################################
Dataset splits:
train:
- split="train" from the dataset.dataset.Minian4Dataset
valid:
- split="valid" from the dataset.dataset.Minian4Dataset
Notes:
- actual data sources are defined via ${recipe_dir}
There was a problem hiding this comment.
Thank you, and yes, I think it is good to have data_src_args.
This would be passed directly to the dataset class. For example, suppose we have a dataset class like:
class MyDataset(Dataset):
def __init__(self, split="train",):
self.data = self._build_data(split)
...
Then we can feed the data_src_args.split to the __init__().
We could remove data_src_args, but then we would need additional code to manually remove (pop) data_src from the config, and then feed the rest configs to __init__().
|
Please check my other comments |
…espnet into espnet3/integration_test_3
| # - uses the "test" split defined in the dataset.dataset.Minian4Dataset | ||
| dataset: | ||
| test: | ||
| - name: test |
There was a problem hiding this comment.
I believe this should also include valid
There was a problem hiding this comment.
@Masao-Someki, please check this comment
We usually do not have only the test data.
If it is intentional to reduce the computation cost, that is valid (but I think we do not have so many additions for the computation time)
| ########################################################## | ||
| # PATHS AND INPUTS # | ||
| ########################################################## | ||
| inference_dir: ${exp_dir}/inference |
There was a problem hiding this comment.
Where do we specify the data?
Is it included in this directory specification?
I think it should be commented.
There was a problem hiding this comment.
We can skip RNN related integration test.
This is not used anymore.
There was a problem hiding this comment.
ditto
We can skip RNN related integration test.
This is not used anymore.
There was a problem hiding this comment.
In this case, the data_aug part can be moved to the other config
| # - name: librispeech_test_clean | ||
| # data_src: librispeech/asr | ||
| # data_src_args: | ||
| # split: test_clean | ||
| # data_path: ${dataset_dir} | ||
| # - name: librispeech_test_other | ||
| # data_src: librispeech/asr | ||
| # data_src_args: | ||
| # split: test_other | ||
| # data_path: ${dataset_dir} |
There was a problem hiding this comment.
The dev sets should be included
There was a problem hiding this comment.
@Masao-Someki, please check this comment
We usually do not have only the test data.
If it is intentional to reduce the computation cost, that is valid (but I think we do not have so many additions for the computation time)
| - data_src: mini_an4/asr | ||
| data_src_args: |
There was a problem hiding this comment.
These lines look redundant.
Do we really need data_src_args:?
What did you change?
Added ASR integration test
Why did you make this change?
To test espnet3 pipeline
Is your PR small enough?
No, 28 files change and 2,069 addition.
Except for the following edits, the reviews are done in #6331.
Additional Context