Add DBSCAN clustering algorithm in machine_learning/#14851
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request adds a new machine_learning/dbscan.py module implementing the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering algorithm from scratch, providing a density-based alternative to existing clustering implementations in the repository.
Changes:
- Adds a from-scratch DBSCAN implementation (
dbscan) with input validation and-1noise labeling. - Adds supporting helpers (
euclidean_distance,get_neighbors) used by the clustering routine. - Includes doctests and reference links (Wikipedia + original paper) and a
__main__doctest runner.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Point Types: | ||
| - Core point: Has at least `min_points` neighbors within `epsilon` distance | ||
| - Border point: Within `epsilon` of a core point, but has fewer than | ||
| `min_points` neighbors | ||
| - Noise point: Neither core nor border — labeled as -1 |
| def get_neighbors( | ||
| data: list[list[float]], point_index: int, epsilon: float | ||
| ) -> list[int]: | ||
| """ | ||
| Return indices of all points within epsilon distance of data[point_index]. | ||
|
|
||
| >>> data = [[0.0, 0.0], [0.1, 0.1], [5.0, 5.0]] | ||
| >>> get_neighbors(data, 0, 0.5) | ||
| [0, 1] | ||
| >>> get_neighbors(data, 2, 0.5) | ||
| [2] | ||
| >>> get_neighbors(data, 0, 10.0) | ||
| [0, 1, 2] | ||
| """ | ||
| return [ | ||
| index | ||
| for index, point in enumerate(data) | ||
| if euclidean_distance(data[point_index], point) <= epsilon | ||
| ] |
zain-cs
left a comment
There was a problem hiding this comment.
Good implementation overall! The docstrings, type hints, and doctests are comprehensive and well-written.
One observation in dbscan(): after reassigning a noise point to the current cluster here:
if labels[current_point] == -1:
labels[current_point] = current_cluster_idthe already_in_another check immediately below will never trigger for that point since it was just assigned. The guard is only meaningful for points that were previously assigned to another cluster — consider reordering the logic to make the intent clearer.
Otherwise looks solid! 👍
Describe your change:
Add DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
clustering algorithm implemented from scratch without any external ML libraries.
Checklist:
Fixes #10683