FIX HDBSCAN cluster_selection_epsilon TypeError with tied distances by gambletan · Pull Request #33630 · scikit-learn/scikit-learn

gambletan · 2026-03-25T09:11:16Z

What does this PR do?

Fixes a TypeError raised by HDBSCAN when cluster_selection_epsilon is set and the input data contains tied distances.

Fixes #33219

Root cause

In traverse_upwards (sklearn/cluster/_hdbscan/_tree.pyx), two lookups on the condensed cluster tree use boolean indexing:

parent     = cluster_tree[cluster_tree['child'] == leaf]['parent']
parent_eps = 1 / cluster_tree[cluster_tree['child'] == parent]['value']

When tied distances exist in the data, the condensed tree can contain multiple rows with the same child id. The boolean-index lookups therefore return a shape-(1,) array instead of a scalar. NumPy 2.4+ removed the implicit array-to-scalar conversion that had silently papered over this, causing:

TypeError: only 0-dimensional arrays can be converted to Python scalars

Fix

Add explicit [0] indexing on both lookups. All duplicate rows share the same parent and lambda value, so taking the first element is correct:

-    parent = cluster_tree[cluster_tree['child'] == leaf]['parent']
+    parent = cluster_tree[cluster_tree['child'] == leaf]['parent'][0]
 ...
-    parent_eps = 1 / cluster_tree[cluster_tree['child'] == parent]['value']
+    parent_eps = 1 / cluster_tree[cluster_tree['child'] == parent]['value'][0]

This matches the behaviour of the original hdbscan package from which scikit-learn's implementation was ported.

Tests

Added test_hdbscan_cluster_selection_epsilon_tied_distances in sklearn/cluster/tests/test_hdbscan.py using the exact precomputed distance matrix from the issue report. The test verifies:

No TypeError is raised
All non-outlier points are assigned to a single cluster

Full HDBSCAN test suite: 149 passed, 0 failed.

Note on triage

The betatim scikit-learn member confirmed the fix approach in the issue thread and provided the exact diff, so this PR implements that confirmed fix with a tested, clean implementation.

In `traverse_upwards` (_tree.pyx), boolean-index lookups on the condensed cluster tree can return a shape-(1,) array instead of a scalar when tied distances cause duplicate child entries. NumPy 2.4+ removed the silent array-to-scalar conversion, surfacing this as: TypeError: only 0-dimensional arrays can be converted to Python scalars Fix: add explicit `[0]` indexing on both the `parent` and `parent_eps` lookups so that the first (and logically equivalent) element is taken. Adds a non-regression test using the precomputed distance matrix from the issue report, which reliably triggers tied-distance rows in the condensed tree. Fixes scikit-learn#33219 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FIX HDBSCAN cluster_selection_epsilon TypeError with tied distances#33630

FIX HDBSCAN cluster_selection_epsilon TypeError with tied distances#33630
gambletan wants to merge 1 commit intoscikit-learn:mainfrom
gambletan:fix/hdbscan-cluster-selection-epsilon-33219

gambletan commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

gambletan commented Mar 25, 2026

What does this PR do?

Root cause

Fix

Tests

Note on triage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant