Commit a8082fe
committed
ENH: Use Highway's VQSort on AArch64
Introduces Highway, a SIMD library from Google. Highway has it's own dispatch mechanism and SIMD flavour (which adds padding) which may be a good replacement for NumPy intrinsics. Highway also has its own set of vectorised Math routines we could bolt on.
For this patch, I've integrated the VQSort algorithm only on AArch64 so
as not to impact other architectures. The VQSort algorithms performance
is comparable to the AVX512 algorithms, according to this report:
https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md
Performance improvements in the NumPy benchmark suite are as follows:
```
before after ratio
[f4b52d53] [f08058d4]
<main-setuppy> <highway-sort>
+ 53.0±0.08μs 221±0.4μs 4.18 bench_function_base.Sort.time_sort('quick', 'int64', ('ordered',))
+ 84.8±0.3μs 222±0.3μs 2.62 bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',))
+ 89.0±0.2μs 197±0.2μs 2.21 bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',))
+ 51.5±0.01μs 80.0±0.1μs 1.55 bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',))
+ 52.0±0.07μs 80.3±0.2μs 1.55 bench_function_base.Sort.time_sort('quick', 'int32', ('ordered',))
+ 148±0.3μs 197±0.2μs 1.33 bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',))
+ 471±0.2μs 598±0.5μs 1.27 bench_function_base.Sort.time_argsort('heap', 'float64', ('ordered',))
+ 536±0.7μs 644±0.8μs 1.20 bench_function_base.Sort.time_argsort('heap', 'float64', ('reversed',))
+ 39.8±0.02μs 47.4±0.05μs 1.19 bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 1000))
+ 39.8±0.01μs 47.4±0.02μs 1.19 bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 1000))
+ 40.1±0.01μs 47.6±0.04μs 1.19 bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 1000))
+ 27.9±0.03μs 32.6±0.03μs 1.17 bench_function_base.Sort.time_argsort('heap', 'uint32', ('uniform',))
+ 685±0.5μs 796±0.8μs 1.16 bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 10))
+ 68.7±0.05μs 78.4±0.04μs 1.14 bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 100))
+ 68.7±0.05μs 78.3±0.1μs 1.14 bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 100))
+ 69.4±0.03μs 78.8±0.05μs 1.14 bench_function_base.Sort.time_argsort('merge', 'int64', ('sorted_block', 100))
+ 22.0±0.01μs 24.7±0.01μs 1.12 bench_function_base.Sort.time_sort('heap', 'uint32', ('uniform',))
+ 674±0.7μs 743±0.4μs 1.10 bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 1000))
+ 730±0.6μs 794±0.8μs 1.09 bench_function_base.Sort.time_argsort('heap', 'float64', ('sorted_block', 100))
+ 464±2μs 492±10μs 1.06 bench_function_base.Sort.time_argsort('heap', 'int32', ('reversed',))
+ 95.5±0.1μs 101±0.1μs 1.06 bench_function_base.Sort.time_argsort('quick', 'float64', ('ordered',))
+ 839±1μs 886±2μs 1.06 bench_function_base.Sort.time_argsort('heap', 'float64', ('random',))
+ 95.7±0.1μs 101±0.05μs 1.06 bench_function_base.Sort.time_argsort('quick', 'float32', ('ordered',))
- 516±0.6μs 488±0.9μs 0.95 bench_function_base.Sort.time_sort('heap', 'float32', ('ordered',))
- 532±1μs 499±5μs 0.94 bench_function_base.Sort.time_argsort('heap', 'int16', ('reversed',))
- 34.5±0.05μs 32.2±0.03μs 0.93 bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 1000))
- 715±1μs 666±0.4μs 0.93 bench_function_base.Sort.time_sort('heap', 'float64', ('random',))
- 254±2μs 231±0.2μs 0.91 bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000))
- 655±0.5μs 592±0.3μs 0.90 bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 100))
- 24.7±0.02μs 22.0±0.07μs 0.89 bench_function_base.Sort.time_sort('heap', 'int32', ('uniform',))
- 621±0.7μs 549±0.3μs 0.88 bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 1000))
- 659±0.9μs 555±0.5μs 0.84 bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 10))
- 545±0.4μs 444±0.5μs 0.81 bench_function_base.Sort.time_sort('heap', 'float64', ('reversed',))
- 511±0.5μs 396±0.4μs 0.78 bench_function_base.Sort.time_sort('heap', 'float64', ('ordered',))
- 316±0.6μs 225±0.2μs 0.71 bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 10))
- 333±0.4μs 233±0.3μs 0.70 bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 100))
- 323±0.2μs 206±0.2μs 0.64 bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000))
- 134±0.9μs 85.0±0.07μs 0.64 bench_function_base.Sort.time_sort('quick', 'float32', ('reversed',))
- 376±0.9μs 228±0.3μs 0.61 bench_function_base.Sort.time_sort('quick', 'int64', ('random',))
- 418±0.4μs 207±0.4μs 0.50 bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100))
- 413±2μs 199±0.2μs 0.48 bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 10))
- 64.4±0.3ms 28.6±0.3ms 0.44 bench_function_base.Sort.time_sort_worst
- 501±2μs 204±0.3μs 0.41 bench_function_base.Sort.time_sort('quick', 'float64', ('random',))
- 248±1μs 86.9±0.2μs 0.35 bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 1000))
- 251±0.7μs 87.0±0.2μs 0.35 bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 1000))
- 329±0.9μs 91.4±0.1μs 0.28 bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 1000))
- 329±1μs 85.2±0.06μs 0.26 bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100))
- 329±0.7μs 85.0±0.06μs 0.26 bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 100))
- 315±0.5μs 80.4±0.05μs 0.26 bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 10))
- 320±1μs 80.3±0.09μs 0.25 bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 10))
- 372±0.4μs 82.1±0.06μs 0.22 bench_function_base.Sort.time_sort('quick', 'int32', ('random',))
- 414±0.4μs 89.6±0.1μs 0.22 bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 100))
- 384±1μs 81.8±0.02μs 0.21 bench_function_base.Sort.time_sort('quick', 'uint32', ('random',))
- 415±0.3μs 84.8±0.2μs 0.20 bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 10))
- 491±2μs 86.3±0.1μs 0.18 bench_function_base.Sort.time_sort('quick', 'float32', ('random',))
- 111±0.4μs 13.0±0.01μs 0.12 bench_function_base.Sort.time_sort('quick', 'float64', ('uniform',))
- 50.9±0.03μs 4.93±0.01μs 0.10 bench_function_base.Sort.time_sort('quick', 'int64', ('uniform',))
- 108±0.3μs 7.25±0.02μs 0.07 bench_function_base.Sort.time_sort('quick', 'float32', ('uniform',))
- 51.7±0.05μs 3.43±0.01μs 0.07 bench_function_base.Sort.time_sort('quick', 'int32', ('uniform',))
- 51.3±0.07μs 3.35±0.06μs 0.07 bench_function_base.Sort.time_sort('quick', 'uint32', ('uniform',))
```1 parent 7e77d36 commit a8082fe
8 files changed
Lines changed: 76 additions & 19 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
84 | 84 | | |
85 | 85 | | |
86 | 86 | | |
87 | | - | |
| 87 | + | |
88 | 88 | | |
89 | 89 | | |
90 | 90 | | |
| |||
105 | 105 | | |
106 | 106 | | |
107 | 107 | | |
108 | | - | |
| 108 | + | |
109 | 109 | | |
110 | 110 | | |
111 | 111 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | | - | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
3 | 5 | | |
4 | 6 | | |
5 | 7 | | |
| |||
11 | 13 | | |
12 | 14 | | |
13 | 15 | | |
14 | | - | |
| 16 | + | |
| 17 | + | |
15 | 18 | | |
16 | 19 | | |
17 | 20 | | |
| |||
89 | 92 | | |
90 | 93 | | |
91 | 94 | | |
92 | | - | |
| 95 | + | |
| 96 | + | |
93 | 97 | | |
94 | | - | |
| 98 | + | |
95 | 99 | | |
96 | | - | |
| 100 | + | |
97 | 101 | | |
98 | | - | |
| 102 | + | |
99 | 103 | | |
100 | | - | |
| 104 | + | |
101 | 105 | | |
102 | | - | |
| 106 | + | |
103 | 107 | | |
104 | | - | |
| 108 | + | |
105 | 109 | | |
106 | | - | |
| 110 | + | |
107 | 111 | | |
108 | | - | |
| 112 | + | |
109 | 113 | | |
110 | | - | |
| 114 | + | |
111 | 115 | | |
112 | | - | |
| 116 | + | |
113 | 117 | | |
114 | | - | |
| 118 | + | |
115 | 119 | | |
116 | | - | |
| 120 | + | |
117 | 121 | | |
118 | 122 | | |
119 | 123 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
13 | 17 | | |
14 | 18 | | |
15 | 19 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | | - | |
| 2 | + | |
| 3 | + | |
3 | 4 | | |
4 | 5 | | |
5 | 6 | | |
| |||
Submodule x86-simd-sort updated from b9f9340 to 85fbe7d
0 commit comments