Skip to content

FIX HDBSCAN cluster_selection_epsilon TypeError with tied distances#33630

Open
gambletan wants to merge 1 commit intoscikit-learn:mainfrom
gambletan:fix/hdbscan-cluster-selection-epsilon-33219
Open

FIX HDBSCAN cluster_selection_epsilon TypeError with tied distances#33630
gambletan wants to merge 1 commit intoscikit-learn:mainfrom
gambletan:fix/hdbscan-cluster-selection-epsilon-33219

Conversation

@gambletan
Copy link
Copy Markdown

What does this PR do?

Fixes a TypeError raised by HDBSCAN when cluster_selection_epsilon is set and the input data contains tied distances.

Fixes #33219

Root cause

In traverse_upwards (sklearn/cluster/_hdbscan/_tree.pyx), two lookups on the condensed cluster tree use boolean indexing:

parent     = cluster_tree[cluster_tree['child'] == leaf]['parent']
parent_eps = 1 / cluster_tree[cluster_tree['child'] == parent]['value']

When tied distances exist in the data, the condensed tree can contain multiple rows with the same child id. The boolean-index lookups therefore return a shape-(1,) array instead of a scalar. NumPy 2.4+ removed the implicit array-to-scalar conversion that had silently papered over this, causing:

TypeError: only 0-dimensional arrays can be converted to Python scalars

Fix

Add explicit [0] indexing on both lookups. All duplicate rows share the same parent and lambda value, so taking the first element is correct:

-    parent = cluster_tree[cluster_tree['child'] == leaf]['parent']
+    parent = cluster_tree[cluster_tree['child'] == leaf]['parent'][0]
 ...
-    parent_eps = 1 / cluster_tree[cluster_tree['child'] == parent]['value']
+    parent_eps = 1 / cluster_tree[cluster_tree['child'] == parent]['value'][0]

This matches the behaviour of the original hdbscan package from which scikit-learn's implementation was ported.

Tests

Added test_hdbscan_cluster_selection_epsilon_tied_distances in sklearn/cluster/tests/test_hdbscan.py using the exact precomputed distance matrix from the issue report. The test verifies:

  • No TypeError is raised
  • All non-outlier points are assigned to a single cluster

Full HDBSCAN test suite: 149 passed, 0 failed.

Note on triage

The betatim scikit-learn member confirmed the fix approach in the issue thread and provided the exact diff, so this PR implements that confirmed fix with a tested, clean implementation.

In `traverse_upwards` (_tree.pyx), boolean-index lookups on the
condensed cluster tree can return a shape-(1,) array instead of a
scalar when tied distances cause duplicate child entries.  NumPy 2.4+
removed the silent array-to-scalar conversion, surfacing this as:

    TypeError: only 0-dimensional arrays can be converted to Python scalars

Fix: add explicit `[0]` indexing on both the `parent` and `parent_eps`
lookups so that the first (and logically equivalent) element is taken.

Adds a non-regression test using the precomputed distance matrix from
the issue report, which reliably triggers tied-distance rows in the
condensed tree.

Fixes scikit-learn#33219

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HDBSCAN fails when using cluster_selection_epsilon

1 participant