Skip to content

Improve TrainingDatasetMetadata and get_shuffle_engine for incomplete projects#3313

Open
deruyter92 wants to merge 5 commits into
mainfrom
jaap/improve_training_dataset_metadata
Open

Improve TrainingDatasetMetadata and get_shuffle_engine for incomplete projects#3313
deruyter92 wants to merge 5 commits into
mainfrom
jaap/improve_training_dataset_metadata

Conversation

@deruyter92
Copy link
Copy Markdown
Collaborator

@deruyter92 deruyter92 commented May 6, 2026

Motivation
see #3312: currently get_shuffle_engine is broken for projects with shuffles are not stored in metadata under training-datasets. While it can be a valid choice to make this metadata file the single source of truth, it currently breaks backwards compatibility, since for older DLC versions, only a model folder was sufficient for inference.

Proposed changes:
The current PR aims to implement small changes that

  • allow fallback for get_shuffle_engine to inferring the engine from the model folder. (This fixes the currently broken analyze_videos for projects that only have a model folder, but no training dataset)
  • improve error messages, so users can diagnose if their project is incomplete / if they just specified a non-existent shuffle.
  • don't write an empty metadata file when the metadata was not found in the first place (this made the state of the project even more corrupt than it was)

Note that more refactors could be implemented to guarantee long-term maintainability of this code (e.g. separate responsibilities and clear contracts - again see #3312) but the scope of the current PR is just some easy wins.

  • improve error messages in TrainingDatasetMetadata
  • implement fallback for get_shuffle_engine based on folder structure
  • add tests

deruyter92 added 2 commits May 6, 2026 18:02
This commit addresses two issues:
- trainset_index can be out of bounds for TrainingFraction — that's an IndexError but surfaces as a confusing message about fractions.
- The shuffle is not in the metadata — but you have no idea if that's because the metadata is empty, or because the index is simply wrong.

More informative messages let you easier diagnose the problem
- Don't save empty TrainingDatasetMetadata if not found
- Fallback: search in model folders for same shuffle index if metadata not found
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves robustness of training-dataset metadata handling so inference can work on “model-folder-only” / incomplete DeepLabCut projects, and provides clearer diagnostics when metadata or shuffle selection is invalid.

Changes:

  • Improve TrainingDatasetMetadata.get() error reporting (out-of-bounds trainset index, and listing known shuffles).
  • Avoid writing an empty metadata.yaml when no shuffles can be discovered.
  • Update get_shuffle_engine() to fall back to inferring engine from model folder structure when metadata is missing/incomplete.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread deeplabcut/generate_training_dataset/metadata.py Outdated
Comment thread deeplabcut/generate_training_dataset/metadata.py Outdated
Comment thread deeplabcut/generate_training_dataset/metadata.py
Comment thread deeplabcut/generate_training_dataset/metadata.py
Comment thread deeplabcut/generate_training_dataset/metadata.py
@deruyter92 deruyter92 marked this pull request as ready for review May 12, 2026 10:37
@deruyter92 deruyter92 requested a review from C-Achard May 12, 2026 10:37
Copy link
Copy Markdown
Collaborator

@C-Achard C-Achard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, I like the fact that this prevents writing partial metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

config Related to config.yaml, ruamel, YAML parsing, ... pytorch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants