Tags · pytorch/executorch

ciflow/trunk/19044

Arm backend: Fix rejection criteria of TRANSPOSE from VIEW

When delegating a VIEW for Ethos-U55, we were overly pessimistic
whether we can delegate the TRANSPOSE that is needed for the
NHWC -> NCHW or NCHW -> NHWC permutation. As a result, some
RESHAPEs were left-over to the CPU when actually they could have
been run on NPU.

Signed-off-by: George Gekov <george.gekov@arm.com>
Change-Id: I34cc3b38cf0dbb0ceee32ac5d0044805c4e1f085

Apr 22, 2026
1cc2d53
zip
tar.gz

ciflow/trunk/19043

Merge branch 'main' into add-fp8-placeholder-support-for-serialization

Apr 22, 2026
fadc617
zip
tar.gz

ciflow/trunk/19015

Replace tosa_dim_order with explicit NCHW↔NHWC permutes (#19015)

Summary:
Pull Request resolved: #19015

Replace implicit `tosa_dim_order`-based layout handling with explicit
`permute_copy` ops around TOSA operators that require NHWC layout.

### Rewrite passes insert explicit NCHW↔NHWC permutes

`RewriteConvPass`, `RewriteAvgPool2dPass`, and `RewriteMaxPool2dPass`
now insert `aten.permute_copy` nodes (NCHW→NHWC before the TOSA op,
NHWC→NCHW after) instead of relying on `ToTosaMemoryFormatPass` for
layout conversion. This makes layout transitions visible in the graph.

### Grouped conv decomposition in NHWC

`RewriteConvPass` decomposes grouped convolutions (non-depthwise) into
per-group `TOSA.CONV2D` ops operating entirely in NHWC, with a single
input/output permute pair wrapping the whole group. Supports INT8,
INT16 (with and without bias) quantisation paths, including the full
INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) →
RESCALE(INT32→INT16).

### `ToTosaMemoryFormatPass` scoped down

Now only assigns non-identity dim_order to parameter/buffer
placeholders (for weight serialisation) and graph I/O. Inserts
`permute_copy` instead of `tosa.TRANSPOSE`. Skips users that already
carry a matching permute (inserted by the rewrite passes).

### TOSA dialect op metas expect NHWC

All TOSA op meta functions (`CONV2D`, `CONV3D`, `DEPTHWISE_CONV2D`,
`AVG_POOL2D`, `MAX_POOL2D`, `TRANSPOSE_CONV2D`) now assume NHWC
input layout and produce NHWC output shapes.

### Removed `tosa_dim_order` shape remapping

`tosa_shape()` no longer reorders dimensions—just resolves symints.
`_get_matching_fake_tensor()` returns `node.meta["val"]` directly.
Serialisation mapping always uses identity dim_order.

### Operator serialisation simplified

`op_amax`, `op_amin`, `op_any`, `op_cat`, `op_sum`, and `op_permute`
no longer remap reduction/concat axes through `dim_order` since
tensors are already in the layout expected by TOSA.

### Permute optimisation passes added

Six shared passes from `executorch/backends/transforms/` are now run
after TOSA lowering to fuse, cancel, and simplify the permutes
introduced above:
- `RemovePermutesAroundElementwiseOps` (extended for `RESCALE`)
- `FuseTransposeOrPermuteOpPairsPass` (extended for `RESCALE`)
- `ReplaceNopTransposeOrPermuteWithViewPass`
- `PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView`
- `FuseCascadedTransposeOrPermuteOps`
- `FuseCascadedViewOps`

### Removed passes

`DecomposeConvWithInt16ActivationPass` and `DecomposeGroupedConvPass`
are removed—their logic is now handled inline by `RewriteConvPass`.
`RewriteSlicePass` is repositioned after the permute optimisations.

### Ethos-U55 partitioner simplified

The dual NCHW/NHWC permute constraint check is removed since tensors
are always in the expected layout at partition time.

Differential Revision: D100712787

Apr 22, 2026
1736792
zip
tar.gz

ciflow/trunk/18948

Merge branch 'main' into change-1241864

Apr 22, 2026
5ac1d3c
zip
tar.gz

ciflow/trunk/18827

Add FuseConcatPass to eliminate redundant concat ops (#18827)

Summary:

Concat (torch.cat) in the Gen2 Executorch ARM/Ethos-U stack is lowered to
TOSA CONCAT, which Vela then converts to N x MemoryCopy operations — real
DMA data movement on the NPU. This pass eliminates concat operations that
can be proven unnecessary at the FX graph level, preventing Vela from
generating MemoryCopy ops entirely.

Inspired by Espresso's concat elimination techniques
(bolt/nn/espresso/transforms/remove_nops.py), three patterns are handled:

1. Single-input concat: cat([x]) is a no-op, replaced with x.
2. Concat-then-slice: if every consumer of cat([a, b, ...]) is a
   slice_copy that extracts exactly one original input, bypass both.
3. Slice-then-concat: if contiguous slices of the same tensor are
   concatenated back, the result is the original tensor.

Differential Revision: D97667069

Apr 22, 2026
db25b64
zip
tar.gz

ciflow/trunk/18758

Add multi-reader tests for Add/Sub rescale fusion (#18758)

Summary:

Add AddMultiReader and SubMultiReader test models (conv2(conv1(x)) +/- conv3(conv1(x)))
where conv1's output Rescale has two readers. These exercise the multi-reader
per-consumer fusion loop.

TOSA INT, U55 INT, U85 INT for both Add and Sub (6 new tests).

Reviewed By: digantdesai

Differential Revision: D99939008

Apr 22, 2026
ee9dabb
zip
tar.gz

ciflow/trunk/18285

Merge branch 'main' into change-1223321

Apr 22, 2026
60c1a8b
zip
tar.gz

ciflow/cuda/18903

remove extra tensor move action

Apr 22, 2026
9e72e8b
zip
tar.gz

ciflow/cuda/18901

Merge branch 'main' into per-weight-constant-cache

Apr 22, 2026
c1b280e
zip
tar.gz

ciflow/cuda/18809

remove unused env var

Apr 22, 2026
4237d17
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ciflow/trunk/19044

ciflow/trunk/19043

ciflow/trunk/19015

ciflow/trunk/18948

ciflow/trunk/18827

ciflow/trunk/18758

ciflow/trunk/18285

ciflow/cuda/18903

ciflow/cuda/18901

ciflow/cuda/18809

Tags: pytorch/executorch