FIX HDBSCAN cluster_selection_epsilon TypeError with tied distances#33630
Open
gambletan wants to merge 1 commit intoscikit-learn:mainfrom
Open
FIX HDBSCAN cluster_selection_epsilon TypeError with tied distances#33630gambletan wants to merge 1 commit intoscikit-learn:mainfrom
gambletan wants to merge 1 commit intoscikit-learn:mainfrom
Conversation
In `traverse_upwards` (_tree.pyx), boolean-index lookups on the
condensed cluster tree can return a shape-(1,) array instead of a
scalar when tied distances cause duplicate child entries. NumPy 2.4+
removed the silent array-to-scalar conversion, surfacing this as:
TypeError: only 0-dimensional arrays can be converted to Python scalars
Fix: add explicit `[0]` indexing on both the `parent` and `parent_eps`
lookups so that the first (and logically equivalent) element is taken.
Adds a non-regression test using the precomputed distance matrix from
the issue report, which reliably triggers tied-distance rows in the
condensed tree.
Fixes scikit-learn#33219
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes a
TypeErrorraised byHDBSCANwhencluster_selection_epsilonis set and the input data contains tied distances.Fixes #33219
Root cause
In
traverse_upwards(sklearn/cluster/_hdbscan/_tree.pyx), two lookups on the condensed cluster tree use boolean indexing:When tied distances exist in the data, the condensed tree can contain multiple rows with the same
childid. The boolean-index lookups therefore return a shape-(1,)array instead of a scalar. NumPy 2.4+ removed the implicit array-to-scalar conversion that had silently papered over this, causing:Fix
Add explicit
[0]indexing on both lookups. All duplicate rows share the same parent and lambda value, so taking the first element is correct:This matches the behaviour of the original
hdbscanpackage from which scikit-learn's implementation was ported.Tests
Added
test_hdbscan_cluster_selection_epsilon_tied_distancesinsklearn/cluster/tests/test_hdbscan.pyusing the exact precomputed distance matrix from the issue report. The test verifies:TypeErroris raisedFull HDBSCAN test suite: 149 passed, 0 failed.
Note on triage
The
betatimscikit-learn member confirmed the fix approach in the issue thread and provided the exact diff, so this PR implements that confirmed fix with a tested, clean implementation.