Skip to content

Fix matlab file not dropping likelihood column if present from machine labels #3323

Open
C-Achard wants to merge 3 commits into
mainfrom
cy/fix-matlab-column-num
Open

Fix matlab file not dropping likelihood column if present from machine labels #3323
C-Achard wants to merge 3 commits into
mainfrom
cy/fix-matlab-column-num

Conversation

@C-Achard
Copy link
Copy Markdown
Collaborator

@C-Achard C-Achard commented May 11, 2026

Issue

The latest napari-deeplabcut version retains the likelihood column when refining machine annotations (they are added to the CollectedData h5 if present).
The matlab file creation function always assumed only x and y are present, and did not filter out likelihood, leading to dataset creation failure.
Note: this raises in terminal but could cause the GUI to hang instead, as reported in #3319.

Related

Fix

Improves the robustness of the training data formatting by ensuring that any "likelihood" columns present in the input DataFrame are dropped before .mat formatting, and adds a corresponding test to verify this behavior.

Data formatting fixes:

  • Updated format_training_data in trainingsetmanipulation.py to automatically detect and remove "likelihood" columns from the DataFrame before processing, ensuring only "x" and "y" coordinates are used. Added validation to require both "x" and "y" columns and to check that the number of coordinate values per row is even.
  • Improved error handling for malformed data, raising clear exceptions when required coordinate columns are missing or when the data shape is unexpected.

Testing:

  • Added a new test test_format_training_data_ignores_likelihood_columns in test_trainingsetmanipulation.py to verify that the presence of "likelihood" columns does not affect the output of format_training_data.

C-Achard added 2 commits May 11, 2026 10:39
Fix format_training_data to handle MultiIndex columns with likelihoods: detect coord level, keep only x/y columns (logging when likelihoods are dropped), and raise if x/y are missing. Use the row values to ensure an even number of coordinates after dropping non-coord columns (error if odd), reshape into (N,2), filter NaNs, and clip out-of-image joints. Also skip images without labels.
Add a unit test to tests/test_trainingsetmanipulation.py that verifies format_training_data ignores 'likelihood' columns when formatting training data. The test monkeypatches read_image_shape_fast, constructs a DataFrame with inserted likelihood columns after each y coordinate, and compares the formatted outputs (image, size, joints) against a baseline produced from the original x/y-only DataFrame to ensure identical results.
@C-Achard
Copy link
Copy Markdown
Collaborator Author

Note: after discussion with @deruyter92, it has been decided not to retain likelihood columns in CollectedData, as previously. See DeepLabCut/napari-deeplabcut#204.

@deruyter92
Copy link
Copy Markdown
Collaborator

Should napari-deeplabcut always discard likelihood when saving to CollectedData?

For future reference we concluded yes for the following reasons:

  • Anything reviewed by human-labeling should be considered ground-truth labels
  • Keeping likelihood in the CollectedData leaves the wrong impression that the data is uncertain or unreliable, while these scores are only applicable to the NN-based judgements, not the human labels.
  • Since the likelihood scores for machine-labels are also stored separately, there is no need to also keep them here, and we can safely remove these from the CollectedData.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug fix! fix for a real buggy one...

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Shuffle creation failure

2 participants