Skip to content

Expand contiguous reduce specializations#6338

Open
prsabahrami wants to merge 2 commits intomodular:mainfrom
prsabahrami:prs/reduce-warp-contiguous-fastpath
Open

Expand contiguous reduce specializations#6338
prsabahrami wants to merge 2 commits intomodular:mainfrom
prsabahrami:prs/reduce-warp-contiguous-fastpath

Conversation

@prsabahrami
Copy link
Copy Markdown
Contributor

@prsabahrami prsabahrami commented Apr 2, 2026

Summary

  • expand the contiguous reduction specializations in reduction.mojo
  • keep the host-side benchmark sizing fix in bench_reduce.mojo
  • retain the live 3072 fixed-turn ILP2 and 4096 warp0-epilogue paths for contiguous axis-2 reductions

Benchmark Notes

  • historical branch-note 3072 path:
    • 1x1024x3072 axis=2: +27.53% bf16 and +32.53% f16 versus the conservative control
    • even against the prior high-water controls, the same shape remains +19.65% bf16 and +27.08% f16
  • historical branch-note 4096 tail path:
    • 1x256x4096 axis=2: 241.07 bf16 / 237.48 f16 GElems/s
    • versus the kept R100 controls: +28.58% bf16 and +23.55% f16
  • fresh exact-head B200 rerun on the current PR head produced:
    • 1x1024x3072 axis=2: 376.77 bf16 / 365.46 f16 GElems/s
    • 1x256x4096 axis=2: 161.52 bf16 / 158.92 f16 GElems/s
    • 32x1024x256x1024 axis=3: 1178.02 bf16 / 1385.40 f16 GElems/s
  • the current exact-head rerun still succeeds on all 23 benchmark items, but it does not reproduce every earlier control-relative percentage or throughput note exactly; those earlier percentages should be read as branch-note campaign context

Copilot AI review requested due to automatic review settings April 2, 2026 01:37
@prsabahrami prsabahrami requested review from a team as code owners April 2, 2026 01:37
@github-actions github-actions Bot added mojo-stdlib Tag for issues related to standard library waiting-on-review labels Apr 2, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the GPU reduction backend to improve contiguous-axis reduction performance and fixes a benchmark host buffer sizing issue.

Changes:

  • Fix host-side result buffer sizing in bench_reduce.mojo.
  • Collapse per-thread SIMD row accumulators to scalars before block_reduce.
  • Add a warp-level contiguous-axis fast path (warp_reduce_kernel) and update kernel dispatch/SIMD selection in reduce_launch.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
mojo/stdlib/std/algorithm/backend/gpu/reduction.mojo Adds warp-level contiguous reduction kernel, changes SIMD→scalar collapse before block reduce, and adjusts dispatch/SIMD width selection.
max/kernels/benchmarks/gpu/bench_reduce.mojo Fixes host-side buffer allocation to match the copied device buffer size.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +640 to +646
var tail_col = (row_size // VEC_STRIDE) * VEC_STRIDE + lid
while tail_col < row_size:
row_coords[axis] = tail_col
var v = input_fn[dtype, 1, rank](row_coords).cast[accum_type]()
comptime for i in range(num_reductions):
scalar_accum[i] = reduce_fn[accum_type, 1, i](scalar_accum[i], v)
tail_col += WARP_SIZE
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In warp_reduce_kernel, tail_col starts at (row_size // VEC_STRIDE) * VEC_STRIDE + lid, but the vector loop already covers all elements < vec_limit (including the remainder when row_size < VEC_STRIDE or simd_width == 1). This causes some elements to be reduced twice (e.g. row_size=64, simd_width=8; or row_size=65, simd_width=1), producing incorrect results. Compute the scalar-tail start from the number of full SIMD vectors instead (e.g. tail_start = (row_size // simd_width) * simd_width, then tail_col = tail_start + lid) so only the final row_size % simd_width elements are handled in the tail loop.

Copilot uses AI. Check for mistakes.
Comment on lines +624 to +629
while vec_col < vec_limit:
row_coords[axis] = vec_col
var v = input_fn[dtype, simd_width, rank](row_coords).cast[accum_type]()
comptime for i in range(num_reductions):
accum[i] = reduce_fn[accum_type, simd_width, i](accum[i], v)
vec_col += VEC_STRIDE
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-thread accumulation in warp_reduce_kernel uses reduce_fn(accum, v) (accumulator as the first operand). Elsewhere in the GPU backend (and the CPU backend) the reduction wrapper is consistently applied as reduce_fn(val, acc) (new value first), so this flips operand order for non-commutative reductions and makes results kernel-dependent. Swap the argument order here (and similarly in the scalar tail / lane-collapse paths) to match the established reduce_fn(val, acc) convention.

Copilot uses AI. Check for mistakes.
Comment on lines +340 to +343
var lane_accum = accum[i][0]
comptime for lane in range(1, simd_width):
lane_accum = reduce_fn[accum_type, 1, i](lane_accum, accum[i][lane])
scalar_vals[i] = lane_accum
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row_reduce now manually collapses the SIMD accumulator to a scalar before block_reduce, but the lane fold uses reduce_fn(lane_accum, accum[i][lane]) (accumulator first). Other call sites in this file reduce as reduce_fn(val, acc) (value first), and the std reduction wrapper contract uses (val, acc). To keep semantics consistent across kernels (especially for non-commutative reductions), flip the operand order in this lane-collapse fold.

Copilot uses AI. Check for mistakes.
@abduld
Copy link
Copy Markdown
Contributor

abduld commented Apr 2, 2026

!sync

@modularbot modularbot added the imported-internally Signals that a given pull request has been imported internally. label Apr 2, 2026

var in_host = alloc[Scalar[dtype]](cb_in.alloc_size())
var res_host = alloc[Scalar[dtype]](out_size)
var res_host = alloc[Scalar[dtype]](in_size)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right to me. The host side results array should be of out_size. Shouldn't instead res_buffer be set to size out_size?

comptime contig_simd = simd_width_of[dtype, get_gpu_target()]()
comptime for ax in range(rank):
if axis == ax:
comptime is_contig = (ax == rank - 1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to recompute is_contig here. Just use reduce_contig_dim and pull reduce_simd out of the for loop.

@prsabahrami prsabahrami changed the title Add a warp-level fast path for contiguous reduce Expand contiguous reduce specializations Apr 6, 2026
Copy link
Copy Markdown
Contributor

@joeatodd joeatodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @prsabahrami, thanks for this PR. Did you get a chance to look over my review comments, and those from Copilot?

@prsabahrami
Copy link
Copy Markdown
Contributor Author

Apologies for the delay here, will address these tomorrow!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

imported-internally Signals that a given pull request has been imported internally. Kernels mojo-stdlib Tag for issues related to standard library waiting-on-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants