Skip to content

ENH: Descending sorts with built-in dtype implementations#31345

Open
MaanasArora wants to merge 51 commits into
numpy:mainfrom
MaanasArora:reverse-sorts
Open

ENH: Descending sorts with built-in dtype implementations#31345
MaanasArora wants to merge 51 commits into
numpy:mainfrom
MaanasArora:reverse-sorts

Conversation

@MaanasArora
Copy link
Copy Markdown
Contributor

This is a draft PR depending on #31278 that enables descending sorts on the Python side and adds support to built-in dtypes (bools, ints, floats, dates). It copies the comparison functions rather than the sorts, but the template wrappers are still a bit verbose. I added some tests for only sorting (not argsorting) but more are needed.

Note that SIMD sorts are not used for descending, nor is radixsort, as these would require more changes to the sort functions themselves (radixsort uses raw memory, for example).

AI Disclosure

Claude wrote the (arg)sort_*_descend functions (not the comparisons), and I looked over very briefly so far.

@MaanasArora MaanasArora marked this pull request as draft April 27, 2026 23:08
Comment thread numpy/_core/src/common/numpy_tag.h Outdated
static int less_equal(type const& a, type const& b) {
return !less(b, a);
}
static int greater(type const& a, type const& b) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably we can just create new tags, but doing this seems nicer for consolidation.

@MaanasArora MaanasArora changed the title Reverse sorts ENH: Descending sorts with built-in dtype implementations Apr 27, 2026
Copy link
Copy Markdown
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I guess it's worth looking if we can't just clean this up... and in the age of agents, maybe just with the help of an agent...

This whole tag business was always annoying. One small part that would help here, is to push the new sort loop creation into C++ so we only have a make_sort_loops_@name@(...) and everything else can use templating without having to export dozens of functions explicitly.

(And even if this tag business makes sense, I really don't get why it can't use a subclassed template to clean this up. Probably also a C++11 limitation?)

EDIT: I'll see if I can't make a gross simplification in a separate PR quickly... Done in gh-31352, probably more possible, but a minimal start that is probably sufficient for this...

Comment thread numpy/_core/src/common/numpy_tag.h Outdated
}
};

template <typename Tag>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just use bool reverse now, maybe that wasn't possible with C++11 (where all of this text monstrosity comes from).

@seberg
Copy link
Copy Markdown
Member

seberg commented Apr 28, 2026

One more thing: It seems to me that both highway and x86 sort should have a descending parameter. For highway, I didn't immediately find whether its NaN handling is what we need, but we can probably just try (if not, CI should fail as we'll have some test for it :)).

@MaanasArora
Copy link
Copy Markdown
Contributor Author

MaanasArora commented Apr 28, 2026

push the new sort loop creation into C++ so we only have a make_sort_loops_@name@(...)

Push into C++ or do you mean .c.src? But yes, if you mean including this in arraytypes.c.src, we can easily use templating for that I think, with name, tag name, and whether to use SIMD dispatch.

For highway, I didn't immediately find whether its NaN handling is what we need, but we can probably just try (if not, CI should fail as we'll have some test for it :)).

Will do this and above now! We actually already have a basic test for NaN handling for descending in this PR (though only floats I think)!

Edit: I have a draft for quicksort, but let me wait for the other two PRs to merge to rebase before pushing or this will be challenging to rebase :)

@MaanasArora
Copy link
Copy Markdown
Contributor Author

MaanasArora commented Apr 29, 2026

I rebased over the other PR, and moved the sort loops registration to C++, it really is much nicer!

I've pushed something broken... the issue is mainly, though, that I'm including .cpp files directly (!) because I'm a bit hesitant to rename/move to .hpp and then just take npy_*sort functions out. It might be hard to review the line-by-line changes otherwise. Not really sure how to move forward here, I guess if we're going to clean this up just moving the files would make sense, maybe after a brief review? Thanks!

Edit: I've moved the template code to header files, and everything works well! So if you like I can push after you have a brief pass at the line-by-line changes?

@MaanasArora
Copy link
Copy Markdown
Contributor Author

I've pushed what I have now - going to clean things up and add back the ArrFuncs.

@MaanasArora
Copy link
Copy Markdown
Contributor Author

MaanasArora commented Apr 30, 2026

I did a couple more iterations:

  • Added back sorting arrfuncs during registration
  • Fixed some critical bugs, especially with string and unicode sorting
  • Added a few tests for both sorts and argsorts, with randomization and nan tests
  • Made sure the stubs match
    • including masked arrays but following patterns for stable, raise on descending
  • Made sure linting passed

What is remaining is adding SIMD sorts (though not sure if that should be follow-up, it's not too invasive, but still requires some changes to x86_simd_sort), and cleaning up the Tag::template cmp (I'm having some trouble because it might be repetitive to keep creating a helper).

But I think this should be mostly ready now! Thanks.

Edit: oh, and need a release note, and some explanatory docs maybe.

@MaanasArora MaanasArora marked this pull request as ready for review April 30, 2026 06:34
@seberg seberg added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Apr 30, 2026
@MaanasArora
Copy link
Copy Markdown
Contributor Author

MaanasArora commented May 1, 2026

OK, I pushed SIMD sorts for ascending, still trying to check if descending can be included. Also added a release note. Aside from this, mainly just cleanup remaining, hopefully.

Edit: nevermind, some strange bug with the SIMD on some systems... (Seems fixed now.)

@MaanasArora
Copy link
Copy Markdown
Contributor Author

MaanasArora commented May 3, 2026

The SIMD bug was for np::half, seems solved now!

I was able to thread descending through SIMD, but as far as I can tell even x86 simd sort doesn't sort nans to the end on descending, so haven't pushed. Closer look, and it seems nans are replaced with positive infinity for both orders.

Copy link
Copy Markdown
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, two more small nits, I'll have another look over the tests, the large moves and things later, but I think this is settling.

(One silly thing: I find the _generic name a bit awkward, because it's exactly the other file that implements the "generic" version that uses the cmp function. Not sure whats best, maybe _templates or even swapping the name -- anyway, just a nit. )

Comment thread numpy/_core/src/common/numpy_tag.h Outdated
Comment thread numpy/_core/src/multiarray/arraytypes.c.src Outdated
Comment thread numpy/_core/src/npysort/quicksort_generic.hpp Outdated
@MaanasArora
Copy link
Copy Markdown
Contributor Author

MaanasArora commented May 4, 2026

Thanks for reviewing! It seems the greater function for complex tag had another bug, which is that we didn't special case NaNs for the real part. We didn't do it for less either, but perhaps got away because nans are always less in c floats. I pushed a fix, if that looks correct, should we mirror it for less or keep it as is?

We do need better tests for complex nans, given three "types" of (heterogenous) nans! Will try to draft some soon.

I find the _generic name a bit awkward, because it's exactly the other file that implements the "generic" version

Ha, I was thinking of generics on types aka templates and somehow thought in the moment that would be clear - I'll probably just swap the names unless you have a preference for another name!

Copy link
Copy Markdown
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I should go through once more (but I am not sure I'll get to it this week).

But I think this is ready now so if anyone else wants to take a look, please do, because otherwise I may just make a pass put it in (probably early next week).

r += make_sorts_<npy::timedelta_tag>(&PyArray_TimedeltaDType, "timedelta");
r += make_string_sorts_<npy::string_tag>(&PyArray_BytesDType, "string");
r += make_string_sorts_<npy::unicode_tag>(&PyArray_UnicodeDType, "unicode");
r += make_sorts_<npy::half_tag>(&PyArray_HalfDType, "half");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, we don't have error handling on the other side now. (Although, it probably doesn't matter much, if this fails some Python error will bubble up and that is all that matters. -- it may be slightly nicer to deal with the errors right away, with a && chain or just a loop, although that may need making some local struct array to loop.)

Copy link
Copy Markdown
Contributor Author

@MaanasArora MaanasArora May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right - I made a suggestion in another comment, though yeah, not sure if it's much better given it's not chained!

* Add sorting array methods for the new types.
*/

register_all_sorts();
Copy link
Copy Markdown
Contributor Author

@MaanasArora MaanasArora May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
register_all_sorts();
if (register_all_sorts() < 0) {
return -1;
}

Perhaps this suffices, import will crash anyway this way?

@MaanasArora
Copy link
Copy Markdown
Contributor Author

Thanks for reviewing! I swapped the names for quicksort/timsort_generic. I'll try to look at better complex nan tests soon!

@jorenham jorenham requested a review from seiko2plus May 6, 2026 18:45
@MaanasArora
Copy link
Copy Markdown
Contributor Author

MaanasArora commented May 7, 2026

Added various complex tests (maybe too complex :)) and fixed a SIMD dispatch bug.

@charris brought up benchmarks in the community call, so I ran them against the latest commit.

Sort Benchmarks
| Change   | Before [12e16ab3] <main>   | After [ed45342a] <reverse-sorts~1>   |   Ratio | Benchmark (Parameter)                                                         |
|----------|----------------------------|--------------------------------------|---------|-------------------------------------------------------------------------------|
| +        | 8.25±0.05ms                | 76.9±0.7ms                           |    9.32 | bench_function_base.Sort.time_sort('heap', 'float64', ('random',))            |
| +        | 13.2±0.2ms                 | 71.4±0.2ms                           |    5.4  | bench_function_base.Sort.time_sort('heap', 'int64', ('random',))              |
| +        | 4.13±0.08ms                | 21.5±0.07ms                          |    5.21 | bench_function_base.Sort.time_sort('merge', 'int16', ('sorted_block', 100))   |
| +        | 4.73±0.02ms                | 18.8±0.1ms                           |    3.98 | bench_function_base.Sort.time_sort('heap', 'float32', ('sorted_block', 10))   |
| +        | 5.02±0.02ms                | 16.3±0.06ms                          |    3.24 | bench_function_base.Sort.time_sort('heap', 'uint32', ('sorted_block', 10))    |
| +        | 4.32±0.04ms                | 13.3±0.07ms                          |    3.08 | bench_function_base.Sort.time_sort('merge', 'int16', ('sorted_block', 1000))  |
| +        | 4.90±0.01ms                | 13.1±0.08ms                          |    2.68 | bench_function_base.Sort.time_sort('heap', 'float32', ('sorted_block', 100))  |
| +        | 5.24±0.04ms                | 12.5±0.04ms                          |    2.39 | bench_function_base.Sort.time_sort('heap', 'int32', ('sorted_block', 10))     |
| +        | 8.98±0.05ms                | 21.0±0.1ms                           |    2.33 | bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 10))   |
| +        | 6.45±3ms                   | 13.5±0.04ms                          |    2.09 | bench_function_base.Sort.time_sort('merge', 'int16', ('sorted_block', 10))    |
| +        | 4.75±0.2ms                 | 76.0±0.1ms                           |   16    | bench_function_base.Sort.time_sort('heap', 'float32', ('random',))            |
| +        | 4.92±0.05ms                | 69.1±0.2ms                           |   14.06 | bench_function_base.Sort.time_sort('heap', 'uint32', ('random',))             |
| +        | 4.55±0.4ms                 | 63.7±0.08ms                          |   13.99 | bench_function_base.Sort.time_sort('merge', 'int16', ('random',))             |
| +        | 5.10±0.09ms                | 69.5±0.5ms                           |   13.63 | bench_function_base.Sort.time_sort('heap', 'int32', ('random',))              |
| +        | 5.30±0.01ms                | 9.79±0.04ms                          |    1.85 | bench_function_base.Sort.time_sort('heap', 'float32', ('sorted_block', 1000)) |
| +        | 511±20μs                   | 844±2μs                              |    1.65 | bench_function_base.Sort.time_sort('heap', 'float32', ('uniform',))           |
| +        | 8.96±0.09ms                | 14.2±0.06ms                          |    1.59 | bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 100))  |
| +        | 5.32±0.03ms                | 7.67±0.02ms                          |    1.44 | bench_function_base.Sort.time_sort('heap', 'uint32', ('sorted_block', 100))   |
| +        | 5.50±0.06ms                | 7.83±0.07ms                          |    1.42 | bench_function_base.Sort.time_sort('heap', 'int32', ('sorted_block', 100))    |
| +        | 376±7μs                    | 532±40μs                             |    1.42 | bench_function_base.Sort.time_sort('heap', 'uint32', ('uniform',))            |
| +        | 378±30μs                   | 490±10μs                             |    1.29 | bench_function_base.Sort.time_sort('heap', 'int32', ('uniform',))             |
| +        | 9.20±0.1ms                 | 10.9±0.2ms                           |    1.18 | bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 1000)) |
| +        | 691±20μs                   | 786±30μs                             |    1.14 | bench_function_base.Sort.time_sort('heap', 'int64', ('uniform',))             |
| +        | 58.2±0.4ms                 | 63.7±0.3ms                           |    1.1  | bench_function_base.Sort.time_sort('heap', 'int16', ('random',))              |
| +        | 13.9±0.03ms                | 14.7±0.03ms                          |    1.05 | bench_function_base.Sort.time_sort('heap', 'int64', ('sorted_block', 10))     |
| -        | 20.1±2ms                   | 18.8±0.08ms                          |    0.94 | bench_function_base.Sort.time_sort('merge', 'float32', ('sorted_block', 10))  |
| -        | 49.8±9ms                   | 46.6±0.1ms                           |    0.93 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 10))    |
| -        | 58.8±4ms                   | 53.8±0.2ms                           |    0.92 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 100))   |
| -        | 83.6±10ms                  | 76.0±0.1ms                           |    0.91 | bench_function_base.Sort.time_sort('merge', 'float32', ('random',))           |
| -        | 13.9±2ms                   | 12.5±0.06ms                          |    0.9  | bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 10))    |
| -        | 23.2±0.9ms                 | 20.9±0.06ms                          |    0.9  | bench_function_base.Sort.time_sort('quick', 'int16', ('ordered',))            |
| -        | 85.7±8ms                   | 76.4±0.1ms                           |    0.89 | bench_function_base.Sort.time_sort('merge', 'float64', ('random',))           |
| -        | 17.1±3ms                   | 14.9±0.2ms                           |    0.87 | bench_function_base.Sort.time_sort('merge', 'int64', ('sorted_block', 10))    |
| -        | 8.87±1ms                   | 7.68±0.08ms                          |    0.87 | bench_function_base.Sort.time_sort('merge', 'uint32', ('sorted_block', 100))  |
| -        | 6.60±1ms                   | 5.72±0.05ms                          |    0.87 | bench_function_base.Sort.time_sort('merge', 'uint32', ('sorted_block', 1000)) |
| -        | 57.0±10ms                  | 49.1±0.1ms                           |    0.86 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 1000))  |
| -        | 9.43±2ms                   | 7.85±0.04ms                          |    0.83 | bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 100))   |
| -        | 586±60μs                   | 488±9μs                              |    0.83 | bench_function_base.Sort.time_sort('merge', 'uint32', ('uniform',))           |
| -        | 432±20μs                   | 356±20μs                             |    0.82 | bench_function_base.Sort.time_sort('quick', 'int32', ('uniform',))            |
| -        | 935±200μs                  | 701±20μs                             |    0.75 | bench_function_base.Sort.time_sort('merge', 'float32', ('reversed',))         |
| -        | 24.9±0.2ms                 | 18.3±0.2ms                           |    0.74 | bench_function_base.Sort.time_sort('heap', 'float16', ('random',))            |
| -        | 14.1±0.09ms                | 9.55±0.05ms                          |    0.68 | bench_function_base.Sort.time_sort('heap', 'int64', ('sorted_block', 100))    |
| -        | 34.9±10ms                  | 22.2±0.2ms                           |    0.64 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 10))  |
| -        | 8.95±3ms                   | 5.42±0.03ms                          |    0.61 | bench_function_base.Sort.time_sort('quick', 'int16', ('uniform',))            |
| -        | 14.4±0.07ms                | 7.36±0.06ms                          |    0.51 | bench_function_base.Sort.time_sort('heap', 'int64', ('sorted_block', 1000))   |
| -        | 22.6±0.05ms                | 11.0±0.3ms                           |    0.49 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 100))  |
| -        | 22.8±0.07ms                | 9.62±0.06ms                          |    0.42 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 10))   |
| -        | 20.8±0.03ms                | 8.27±0.3ms                           |    0.4  | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 1000)) |
| -        | 54.9±0.2ms                 | 21.5±0.04ms                          |    0.39 | bench_function_base.Sort.time_sort('heap', 'int16', ('sorted_block', 100))    |
| -        | 7.09±0.2ms                 | 2.10±0.04ms                          |    0.3  | bench_function_base.Sort.time_sort('merge', 'int16', ('ordered',))            |
| -        | 47.9±0.2ms                 | 13.4±0.02ms                          |    0.28 | bench_function_base.Sort.time_sort('heap', 'int16', ('sorted_block', 10))     |
| -        | 50.4±0.1ms                 | 13.3±0.03ms                          |    0.26 | bench_function_base.Sort.time_sort('heap', 'int16', ('sorted_block', 1000))   |
| -        | 4.21±0.09ms                | 862±20μs                             |    0.2  | bench_function_base.Sort.time_sort('heap', 'float32', ('ordered',))           |
| -        | 4.24±0.03ms                | 731±40μs                             |    0.17 | bench_function_base.Sort.time_sort('heap', 'float32', ('reversed',))          |
| -        | 4.55±0.04ms                | 619±10μs                             |    0.14 | bench_function_base.Sort.time_sort('heap', 'uint32', ('reversed',))           |
| -        | 8.46±0.4ms                 | 1.07±0.04ms                          |    0.13 | bench_function_base.Sort.time_sort('heap', 'float64', ('ordered',))           |
| -        | 4.68±0.09ms                | 587±20μs                             |    0.13 | bench_function_base.Sort.time_sort('heap', 'int32', ('reversed',))            |
| -        | 8.40±0.1ms                 | 1.03±0.03ms                          |    0.12 | bench_function_base.Sort.time_sort('heap', 'float64', ('reversed',))          |
| -        | 4.66±0.05ms                | 490±9μs                              |    0.11 | bench_function_base.Sort.time_sort('heap', 'int32', ('ordered',))             |
| -        | 4.57±0.04ms                | 498±20μs                             |    0.11 | bench_function_base.Sort.time_sort('heap', 'uint32', ('ordered',))            |
| -        | 22.4±0.06ms                | 2.11±0.01ms                          |    0.09 | bench_function_base.Sort.time_sort('heap', 'int16', ('ordered',))             |
| -        | 13.2±0.04ms                | 1.07±0.08ms                          |    0.08 | bench_function_base.Sort.time_sort('heap', 'int64', ('reversed',))            |
| -        | 18.0±0.08ms                | 1.27±0ms                             |    0.07 | bench_function_base.Sort.time_sort('heap', 'float16', ('ordered',))           |
| -        | 14.8±0.1ms                 | 1.03±0ms                             |    0.07 | bench_function_base.Sort.time_sort('heap', 'float16', ('uniform',))           |
| -        | 5.60±0.1ms                 | 344±4μs                              |    0.06 | bench_function_base.Sort.time_sort('heap', 'int16', ('uniform',))             |
| -        | 13.3±0.08ms                | 823±8μs                              |    0.06 | bench_function_base.Sort.time_sort('heap', 'int64', ('ordered',))             |
Argsort Benchmarks
| Change   | Before [12e16ab3] <main>   | After [1cdac29d] <reverse-sorts>   |   Ratio | Benchmark (Parameter)                                                             |
|----------|----------------------------|------------------------------------|---------|-----------------------------------------------------------------------------------|
| +        | 11.0±0.4ms                 | 80.7±0.7ms                         |    7.36 | bench_function_base.Sort.time_argsort('merge', 'int16', ('random',))              |
| +        | 8.11±0.2ms                 | 28.8±0.2ms                         |    3.55 | bench_function_base.Sort.time_argsort('merge', 'int16', ('sorted_block', 100))    |
| +        | 32.9±0.3ms                 | 94.9±0.8ms                         |    2.88 | bench_function_base.Sort.time_argsort('heap', 'float32', ('random',))             |
| +        | 34.3±2ms                   | 96.2±2ms                           |    2.81 | bench_function_base.Sort.time_argsort('heap', 'float64', ('random',))             |
| +        | 32.6±0.3ms                 | 87.1±0.6ms                         |    2.67 | bench_function_base.Sort.time_argsort('heap', 'uint32', ('random',))              |
| +        | 42.2±0.8ms                 | 90.6±0.7ms                         |    2.15 | bench_function_base.Sort.time_argsort('heap', 'int64', ('random',))               |
| +        | 8.92±0.2ms                 | 18.5±0.2ms                         |    2.07 | bench_function_base.Sort.time_argsort('merge', 'int16', ('sorted_block', 1000))   |
| +        | 43.8±30ms                  | 89.1±1ms                           |    2.03 | bench_function_base.Sort.time_argsort('heap', 'int32', ('random',))               |
| +        | 9.43±0.2ms                 | 16.6±0.1ms                         |    1.76 | bench_function_base.Sort.time_argsort('merge', 'int16', ('sorted_block', 10))     |
| +        | 16.8±0.2ms                 | 20.7±0.8ms                         |    1.23 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 10))    |
| +        | 72.8±2ms                   | 80.1±0.4ms                         |    1.1  | bench_function_base.Sort.time_argsort('heap', 'int16', ('random',))               |
| +        | 5.96±0.1ms                 | 6.35±0.07ms                        |    1.07 | bench_function_base.Sort.time_argsort('quick', 'int16', ('uniform',))             |
| +        | 30.7±0.4ms                 | 32.6±0.1ms                         |    1.06 | bench_function_base.Sort.time_argsort('quick', 'int32', ('random',))              |
| -        | 24.4±0.1ms                 | 23.1±0.08ms                        |    0.95 | bench_function_base.Sort.time_argsort('quick', 'float16', ('sorted_block', 10))   |
| -        | 10.3±0.3ms                 | 9.60±0.2ms                         |    0.93 | bench_function_base.Sort.time_argsort('merge', 'float16', ('sorted_block', 1000)) |
| -        | 16.6±0.2ms                 | 15.3±0.2ms                         |    0.92 | bench_function_base.Sort.time_argsort('quick', 'float16', ('uniform',))           |
| -        | 26.1±0.6ms                 | 22.9±0.5ms                         |    0.88 | bench_function_base.Sort.time_argsort('heap', 'float32', ('sorted_block', 10))    |
| -        | 29.3±0.3ms                 | 24.2±0.5ms                         |    0.83 | bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 10))    |
| -        | 26.7±1ms                   | 20.2±0.4ms                         |    0.76 | bench_function_base.Sort.time_argsort('heap', 'uint32', ('sorted_block', 10))     |
| -        | 1.39±0.03ms                | 967±20μs                           |    0.69 | bench_function_base.Sort.time_argsort('heap', 'float32', ('uniform',))            |
| -        | 1.41±0.07ms                | 944±30μs                           |    0.67 | bench_function_base.Sort.time_argsort('heap', 'float32', ('ordered',))            |
| -        | 1.54±0.2ms                 | 997±30μs                           |    0.65 | bench_function_base.Sort.time_argsort('heap', 'float64', ('ordered',))            |
| -        | 1.06±0.04ms                | 683±40μs                           |    0.65 | bench_function_base.Sort.time_argsort('heap', 'int64', ('ordered',))              |
| -        | 31.5±5ms                   | 20.2±0.2ms                         |    0.64 | bench_function_base.Sort.time_argsort('heap', 'float16', ('random',))             |
| -        | 974±80μs                   | 611±10μs                           |    0.63 | bench_function_base.Sort.time_argsort('heap', 'uint32', ('ordered',))             |
| -        | 1.63±0.09ms                | 992±10μs                           |    0.61 | bench_function_base.Sort.time_argsort('heap', 'float64', ('uniform',))            |
| -        | 29.0±4ms                   | 17.6±0.2ms                         |    0.61 | bench_function_base.Sort.time_argsort('heap', 'int32', ('sorted_block', 10))      |
| -        | 1.08±0.1ms                 | 643±40μs                           |    0.6  | bench_function_base.Sort.time_argsort('heap', 'int32', ('uniform',))              |
| -        | 27.9±0.3ms                 | 16.4±0.5ms                         |    0.59 | bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 100))   |
| -        | 1.18±0.2ms                 | 684±80μs                           |    0.58 | bench_function_base.Sort.time_argsort('heap', 'int64', ('uniform',))              |
| -        | 29.1±0.2ms                 | 16.3±0.2ms                         |    0.56 | bench_function_base.Sort.time_argsort('heap', 'float32', ('sorted_block', 100))   |
| -        | 36.9±0.7ms                 | 19.3±0.9ms                         |    0.52 | bench_function_base.Sort.time_argsort('heap', 'int64', ('sorted_block', 10))      |
| -        | 27.5±3ms                   | 12.6±0.2ms                         |    0.46 | bench_function_base.Sort.time_argsort('heap', 'float16', ('sorted_block', 100))   |
| -        | 26.1±0.6ms                 | 11.9±0.4ms                         |    0.46 | bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 1000))  |
| -        | 26.8±0.1ms                 | 11.6±0.1ms                         |    0.43 | bench_function_base.Sort.time_argsort('heap', 'float32', ('sorted_block', 1000))  |
| -        | 25.0±3ms                   | 10.2±0.4ms                         |    0.41 | bench_function_base.Sort.time_argsort('heap', 'float16', ('sorted_block', 1000))  |
| -        | 1.54±0.6ms                 | 630±20μs                           |    0.41 | bench_function_base.Sort.time_argsort('heap', 'uint32', ('uniform',))             |
| -        | 26.5±3ms                   | 10.4±0.06ms                        |    0.39 | bench_function_base.Sort.time_argsort('heap', 'float16', ('sorted_block', 10))    |
| -        | 31.3±2ms                   | 12.4±0.1ms                         |    0.39 | bench_function_base.Sort.time_argsort('heap', 'int32', ('sorted_block', 100))     |
| -        | 75.8±20ms                  | 27.8±0.5ms                         |    0.37 | bench_function_base.Sort.time_argsort('heap', 'int16', ('sorted_block', 100))     |
| -        | 35.9±0.2ms                 | 12.6±0.5ms                         |    0.35 | bench_function_base.Sort.time_argsort('heap', 'int64', ('sorted_block', 100))     |
| -        | 35.9±20ms                  | 11.8±0.4ms                         |    0.33 | bench_function_base.Sort.time_argsort('heap', 'uint32', ('sorted_block', 100))    |
| -        | 2.19±1ms                   | 670±10μs                           |    0.31 | bench_function_base.Sort.time_argsort('heap', 'int32', ('ordered',))              |
| -        | 30.2±2ms                   | 8.67±0.2ms                         |    0.29 | bench_function_base.Sort.time_argsort('heap', 'int32', ('sorted_block', 1000))    |
| -        | 59.1±4ms                   | 16.6±0.09ms                        |    0.28 | bench_function_base.Sort.time_argsort('heap', 'int16', ('sorted_block', 10))      |
| -        | 67.7±5ms                   | 19.1±0.6ms                         |    0.28 | bench_function_base.Sort.time_argsort('heap', 'int16', ('sorted_block', 1000))    |
| -        | 33.7±1ms                   | 8.88±0.1ms                         |    0.26 | bench_function_base.Sort.time_argsort('heap', 'int64', ('sorted_block', 1000))    |
| -        | 18.7±1ms                   | 4.49±0.5ms                         |    0.24 | bench_function_base.Sort.time_argsort('heap', 'float16', ('uniform',))            |
| -        | 15.5±0.3ms                 | 3.79±0.2ms                         |    0.24 | bench_function_base.Sort.time_argsort('merge', 'int16', ('ordered',))             |
| -        | 46.7±20ms                  | 8.34±0.3ms                         |    0.18 | bench_function_base.Sort.time_argsort('heap', 'uint32', ('sorted_block', 1000))   |
| -        | 20.6±1ms                   | 3.02±0.2ms                         |    0.15 | bench_function_base.Sort.time_argsort('heap', 'float16', ('ordered',))            |
| -        | 27.9±1ms                   | 3.47±0.09ms                        |    0.12 | bench_function_base.Sort.time_argsort('heap', 'int16', ('ordered',))              |
| -        | 7.67±3ms                   | 593±10μs                           |    0.08 | bench_function_base.Sort.time_argsort('heap', 'int16', ('uniform',))              |
| -        | 21.2±1ms                   | 986±20μs                           |    0.05 | bench_function_base.Sort.time_argsort('heap', 'float64', ('reversed',))           |
| -        | 22.3±0.6ms                 | 911±20μs                           |    0.04 | bench_function_base.Sort.time_argsort('heap', 'float32', ('reversed',))           |
| -        | 28.2±0.2ms                 | 886±50μs                           |    0.03 | bench_function_base.Sort.time_argsort('heap', 'int64', ('reversed',))             |
| -        | 40.5±20ms                  | 869±30μs                           |    0.02 | bench_function_base.Sort.time_argsort('heap', 'int32', ('reversed',))             |
| -        | 39.0±20ms                  | 832±10μs                           |    0.02 | bench_function_base.Sort.time_argsort('heap', 'uint32', ('reversed',))            |

Basically:

  • "heap" sort behavior is dramatically different, which makes sense because it's not being used anymore (timsort is being used) - not sure if we're actually planning to deprecate it for legacy like this, though?
  • the other regression is for stable sorts (i.e. "merge") on small integers - which is because we don't use radixsort anymore, as @charris suspected, so we should add it back.

Not sure if we should add either in this PR though, the diff may explode. Radixsort needs to be done sometime, as it's a significant regression, though (perhaps moreso for bytes and bools, which are not benchmarked)...

@seberg
Copy link
Copy Markdown
Member

seberg commented May 7, 2026

Yeah, checking benchmarks was still missing and clearly important!

The heapsort thing seems also like a simple bug: We weren't using heapsort before, but now you are mapping heap-sort to stable rather than unstable. The sort path should be clearing the HEAPSORT flag before setting it for the sort parameter flags (and then just delete it from the enum).
As for radix-sort. Yes, definitely add it back as well.

@MaanasArora
Copy link
Copy Markdown
Contributor Author

MaanasArora commented May 7, 2026

Thanks! Added back radixsort (just a rename for the file), and fixed the heapsort bug (you were right, it was in PyArray_(Arg)Sort)! Benchmarks look much better now:

Sort Benchmarks
spin bench -t Sort.time_sort --compare main

| Change   | Before [532a1c5d] <main>   | After [4b3a4815] <reverse-sorts>   |   Ratio | Benchmark (Parameter)                                                          |
|----------|----------------------------|------------------------------------|---------|--------------------------------------------------------------------------------|
| -        | 9.64±0.04ms                | 9.04±0.02ms                        |    0.94 | bench_function_base.Sort.time_sort('merge', 'float16', ('sorted_block', 10))   |
| -        | 20.1±0.04ms                | 18.7±0.1ms                         |    0.93 | bench_function_base.Sort.time_sort('merge', 'float32', ('sorted_block', 10))   |
| -        | 9.98±0.09ms                | 9.21±0.2ms                         |    0.92 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100))  |
| -        | 10.9±0.8ms                 | 9.43±0.2ms                         |    0.87 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000)) |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
Argsort benchmarks
spin bench -t Sort.time_argsort --compare main

| Change   | Before [532a1c5d] <main>   | After [4b3a4815] <reverse-sorts>   |   Ratio | Benchmark (Parameter)                                                           |
|----------|----------------------------|------------------------------------|---------|---------------------------------------------------------------------------------|
| +        | 8.26±0.1ms                 | 9.08±0.09ms                        |    1.1  | bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 1000)) |
| -        | 23.5±2ms                   | 21.8±0.5ms                         |    0.93 | bench_function_base.Sort.time_argsort('quick', 'float32', ('reversed',))        |
| -        | 12.4±0.3ms                 | 11.4±0.3ms                         |    0.92 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 100)) |
| -        | 39.0±2ms                   | 35.8±0.4ms                         |    0.92 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 100))  |
| -        | 39.5±1ms                   | 36.1±0.5ms                         |    0.91 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 10))   |
| -        | 1.55±0.09ms                | 1.37±0.03ms                        |    0.89 | bench_function_base.Sort.time_argsort('quick', 'float32', ('uniform',))         |
| -        | 20.4±0.4ms                 | 17.6±0.1ms                         |    0.86 | bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 10))   |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@seberg
Copy link
Copy Markdown
Member

seberg commented May 11, 2026

OK, to wrap the current state: I would like to push forward, I think. Right now the performance for descending sorts is lower if we should be using x86-simd sort, because it's NaN handling is incompatible.

That isn't ideal and it is unclear if bad performance for descending=True should be blocking. However, Raghuveer has a fix here: numpy/x86-simd-sort#235 and integrating that into NumPy should be straight-forward enough that I would be willing to risk backporting (as a performance bug-fix) even after the rc release or in a 2.5.1 release.

Otherwise, there are no performance changes (beyond small random things) for the existing sorts.

@seberg seberg removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants