Skip to content

Started to add a diarization recipe for ESPnet3 based on DiariZen#6364

Draft
popcornell wants to merge 120 commits intoespnet:masterfrom
popcornell:espnet3/diarization
Draft

Started to add a diarization recipe for ESPnet3 based on DiariZen#6364
popcornell wants to merge 120 commits intoespnet:masterfrom
popcornell:espnet3/diarization

Conversation

@popcornell
Copy link
Copy Markdown
Contributor

@popcornell popcornell commented Feb 12, 2026

@Masao-Someki I will change the base branch once you merge

This PR adds a diarization recipe for DiariZen-style diarization but built with ESPnet legacy components:

  • WavLM or XEUS as front-end
  • Any ESPnet2 speaker embedding mode

The architecture just follows DiariZen [1].

Basically it is EEND-VC/Pyannote style:

  1. we use a fixed chunk e.g. 20 seconds and powerset EEND within the chunk
  2. we use the activities to extract speaker embeddings for each speaker for each chunk and then use clustering to reassign global speaker IDs.

[1] Han, Jiangyu, et al. "Leveraging self-supervised learning for speaker diarization." ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025.

Masao-Someki and others added 30 commits December 26, 2025 11:51
- This is to avoid using egs folder
- Assume hypothesis to be "" when hypothesis is blank
- Previously we asked developer to create a user-defined modle, but I supported as a default.
- Userd can set `val_scheduler_criterion` as espnet2 to use this function.
- supported train/valid switching for preprocessor
- Add new default resolver to load external config file
1. Added the Python version as metadata.
2. Added a flag to generate requirements.txt for experiment-level environment logging.
3. Added a log rotation function for cases where the log file already exists (in espnet2, this was previously handled by a Perl script).
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant set of features for ESPnet3, including a new diarization recipe based on DiariZen, extensive documentation, and a new framework for creating and deploying demos with Gradio. The changes also include substantial improvements to the core infrastructure, such as enhanced logging, configuration handling, and a more robust parallel execution mechanism. Overall, the code is well-structured and the new features are a great addition. I've identified a few high-severity issues related to hardcoded paths in configuration files and an unimplemented feature that is set as default, which could cause recipes to fail. Addressing these will improve the robustness and usability of the new recipes.

debug_configs=train_asr_transformer_debug.yaml

echo "==== [ESPnet3] ASR Demo pack ===="
python -m pip install -e '.[asr]'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The script uses python directly here, but ${python} in other places (lines 32, 36). This inconsistency can lead to using the system's default Python interpreter instead of the one from the activated virtual environment, potentially causing the CI job to fail. For consistency and to ensure the correct interpreter is used, ${python} should be used here as well.

Suggested change
python -m pip install -e '.[asr]'
${python} -m pip install -e '.[asr]'

num_nodes: 1

# Path scaffold
recipe_dir: /Users/samco/Projects/ESPnet3/espnet/egs3/ami_diar/diar
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The recipe_dir is hardcoded to a local user path. This will cause the recipe to fail for any other user or in any other environment (e.g., CI). This path should be made relative or be determined at runtime, similar to how it's handled in other configuration files where it's commented as being set automatically by run.py.

recipe_dir: .

Comment on lines +326 to +342

labels = clustering.fit_predict(embeddings)
return labels

def _cluster_vbx(
self,
embeddings: np.ndarray,
num_speakers: int,
) -> np.ndarray:
"""Variational Bayes clustering (VBx).

Note: This is a placeholder. For full VBx implementation,
integrate with VBDiarization library or similar.

Args:
embeddings: (num_speakers, embedding_dim)
num_speakers: Target number of clusters
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _cluster_vbx method raises a NotImplementedError, but the default configuration in egs3/ami_diar/diar/conf/inference.yaml and egs3/ami_diar/diar/conf/tuning/train_xeus_conformer_powerset.yaml sets clustering_backend: vbx. This will cause inference to fail with the default settings. The default in the config files should be changed to a supported backend like ahc, or this method should be implemented.

exp_dir: ${recipe_dir}/exp/${exp_tag}
stats_dir: ${recipe_dir}/exp/stats
decode_dir: ${exp_dir}/decode
dataset_dir: /path/to/LibriSpeech
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The dataset_dir is hardcoded to /path/to/LibriSpeech. This will cause the recipe to fail on any machine where the dataset is not at this exact location. It's better to use a placeholder or an environment variable so that users can easily configure the path. For example, you could use an OmegaConf interpolation like ${oc.env:LIBRISPEECH,/path/to/LibriSpeech} to use an environment variable with a fallback.

dataset_dir: /path/to/your/LibriSpeech  # Or better, use an environment variable

Comment on lines +1 to +17
"""Inference output helpers for ASR recipes."""


def output_fn(*, data, model_output, idx):
"""Build a dict of outputs for SCP writing."""
uttid = data.get("uttid", str(idx))
hyp = model_output[0][0]
ref = data.get("text", "")
return {"uttid": uttid, "hyp": hyp, "ref": ref}


def output_fn_transducer(*, data, model_output, idx):
"""Build a dict of outputs for transducer models."""
uttid = data.get("uttid", str(idx))
hyp = model_output[0]
ref = data.get("text", "")
return {"uttid": uttid, "hyp": hyp, "ref": ref}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This file appears to be a duplicate of egs3/mini_an4/asr/src/infer.py. Having duplicate code increases maintenance overhead and can lead to inconsistencies. It would be better to consolidate them into a single file and have all dependent configurations point to it.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Feb 17, 2026

This pull request is now in conflict :(

@mergify mergify Bot added the conflicts label Feb 17, 2026
@Fhrozen Fhrozen added this to the v.202601 milestone Feb 22, 2026
@Fhrozen Fhrozen modified the milestones: v.202604, v.202607 Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants