Expand contiguous reduce specializations by prsabahrami · Pull Request #6338 · modular/modular

prsabahrami · 2026-04-02T01:37:00Z

Summary

expand the contiguous reduction specializations in reduction.mojo
keep the host-side benchmark sizing fix in bench_reduce.mojo
retain the live 3072 fixed-turn ILP2 and 4096 warp0-epilogue paths for contiguous axis-2 reductions

Benchmark Notes

historical branch-note 3072 path:
- 1x1024x3072 axis=2: +27.53% bf16 and +32.53% f16 versus the conservative control
- even against the prior high-water controls, the same shape remains +19.65% bf16 and +27.08% f16
historical branch-note 4096 tail path:
- 1x256x4096 axis=2: 241.07 bf16 / 237.48 f16 GElems/s
- versus the kept R100 controls: +28.58% bf16 and +23.55% f16
fresh exact-head B200 rerun on the current PR head produced:
- 1x1024x3072 axis=2: 376.77 bf16 / 365.46 f16 GElems/s
- 1x256x4096 axis=2: 161.52 bf16 / 158.92 f16 GElems/s
- 32x1024x256x1024 axis=3: 1178.02 bf16 / 1385.40 f16 GElems/s
the current exact-head rerun still succeeds on all 23 benchmark items, but it does not reproduce every earlier control-relative percentage or throughput note exactly; those earlier percentages should be read as branch-note campaign context

Copilot

Pull request overview

This PR updates the GPU reduction backend to improve contiguous-axis reduction performance and fixes a benchmark host buffer sizing issue.

Changes:

Fix host-side result buffer sizing in bench_reduce.mojo.
Collapse per-thread SIMD row accumulators to scalars before block_reduce.
Add a warp-level contiguous-axis fast path (warp_reduce_kernel) and update kernel dispatch/SIMD selection in reduce_launch.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`mojo/stdlib/std/algorithm/backend/gpu/reduction.mojo`	Adds warp-level contiguous reduction kernel, changes SIMD→scalar collapse before block reduce, and adjusts dispatch/SIMD width selection.
`max/kernels/benchmarks/gpu/bench_reduce.mojo`	Fixes host-side buffer allocation to match the copied device buffer size.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-02T01:40:58Z

+            var tail_col = (row_size // VEC_STRIDE) * VEC_STRIDE + lid
+            while tail_col < row_size:
+                row_coords[axis] = tail_col
+                var v = input_fn[dtype, 1, rank](row_coords).cast[accum_type]()
+                comptime for i in range(num_reductions):
+                    scalar_accum[i] = reduce_fn[accum_type, 1, i](scalar_accum[i], v)
+                tail_col += WARP_SIZE


In warp_reduce_kernel, tail_col starts at (row_size // VEC_STRIDE) * VEC_STRIDE + lid, but the vector loop already covers all elements < vec_limit (including the remainder when row_size < VEC_STRIDE or simd_width == 1). This causes some elements to be reduced twice (e.g. row_size=64, simd_width=8; or row_size=65, simd_width=1), producing incorrect results. Compute the scalar-tail start from the number of full SIMD vectors instead (e.g. tail_start = (row_size // simd_width) * simd_width, then tail_col = tail_start + lid) so only the final row_size % simd_width elements are handled in the tail loop.

Copilot · 2026-04-02T01:40:59Z

+            while vec_col < vec_limit:
+                row_coords[axis] = vec_col
+                var v = input_fn[dtype, simd_width, rank](row_coords).cast[accum_type]()
+                comptime for i in range(num_reductions):
+                    accum[i] = reduce_fn[accum_type, simd_width, i](accum[i], v)
+                vec_col += VEC_STRIDE


The per-thread accumulation in warp_reduce_kernel uses reduce_fn(accum, v) (accumulator as the first operand). Elsewhere in the GPU backend (and the CPU backend) the reduction wrapper is consistently applied as reduce_fn(val, acc) (new value first), so this flips operand order for non-commutative reductions and makes results kernel-dependent. Swap the argument order here (and similarly in the scalar tail / lane-collapse paths) to match the established reduce_fn(val, acc) convention.

Copilot · 2026-04-02T01:40:59Z

+        var lane_accum = accum[i][0]
+        comptime for lane in range(1, simd_width):
+            lane_accum = reduce_fn[accum_type, 1, i](lane_accum, accum[i][lane])
+        scalar_vals[i] = lane_accum


row_reduce now manually collapses the SIMD accumulator to a scalar before block_reduce, but the lane fold uses reduce_fn(lane_accum, accum[i][lane]) (accumulator first). Other call sites in this file reduce as reduce_fn(val, acc) (value first), and the std reduction wrapper contract uses (val, acc). To keep semantics consistent across kernels (especially for non-commutative reductions), flip the operand order in this lane-collapse fold.

abduld · 2026-04-02T18:25:17Z

!sync

joeatodd · 2026-04-02T18:07:21Z


    var in_host = alloc[Scalar[dtype]](cb_in.alloc_size())
-    var res_host = alloc[Scalar[dtype]](out_size)
+    var res_host = alloc[Scalar[dtype]](in_size)


This doesn't seem right to me. The host side results array should be of out_size. Shouldn't instead res_buffer be set to size out_size?

joeatodd · 2026-04-02T18:27:14Z

+            comptime contig_simd = simd_width_of[dtype, get_gpu_target()]()
            comptime for ax in range(rank):
                if axis == ax:
+                    comptime is_contig = (ax == rank - 1)


Don't need to recompute is_contig here. Just use reduce_contig_dim and pull reduce_simd out of the for loop.

joeatodd

Hi @prsabahrami, thanks for this PR. Did you get a chance to look over my review comments, and those from Copilot?

prsabahrami · 2026-04-15T05:39:40Z

Apologies for the delay here, will address these tomorrow!

Add warp-level contiguous reduce fast path

87f4185

Copilot AI review requested due to automatic review settings April 2, 2026 01:37

prsabahrami requested review from a team as code owners April 2, 2026 01:37

github-actions Bot added mojo-stdlib Tag for issues related to standard library waiting-on-review labels Apr 2, 2026

Copilot started reviewing on behalf of prsabahrami April 2, 2026 01:37 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

modular-automation Bot assigned abduld Apr 2, 2026

modularbot added the imported-internally Signals that a given pull request has been imported internally. label Apr 2, 2026

goldiegadde added the Kernels label Apr 2, 2026

joeatodd reviewed Apr 3, 2026

View reviewed changes

Expand contiguous reduce specializations

beefe6e

prsabahrami changed the title ~~Add a warp-level fast path for contiguous reduce~~ Expand contiguous reduce specializations Apr 6, 2026

joeatodd requested changes Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand contiguous reduce specializations#6338

Expand contiguous reduce specializations#6338
prsabahrami wants to merge 2 commits intomodular:mainfrom
prsabahrami:prs/reduce-warp-contiguous-fastpath

prsabahrami commented Apr 2, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

abduld commented Apr 2, 2026

Uh oh!

joeatodd Apr 2, 2026

Uh oh!

joeatodd Apr 2, 2026

Uh oh!

joeatodd left a comment

Uh oh!

prsabahrami commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

prsabahrami commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

abduld commented Apr 2, 2026

Uh oh!

joeatodd Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

joeatodd Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

joeatodd left a comment

Choose a reason for hiding this comment

Uh oh!

prsabahrami commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

prsabahrami commented Apr 2, 2026 •

edited

Loading