Skip to content

Tags: pytorch/executorch

Tags

ciflow/trunk/19044

Toggle ciflow/trunk/19044's commit message
Arm backend: Fix rejection criteria of TRANSPOSE from VIEW

When delegating a VIEW for Ethos-U55, we were overly pessimistic
whether we can delegate the TRANSPOSE that is needed for the
NHWC -> NCHW or NCHW -> NHWC permutation. As a result, some
RESHAPEs were left-over to the CPU when actually they could have
been run on NPU.

Signed-off-by: George Gekov <george.gekov@arm.com>
Change-Id: I34cc3b38cf0dbb0ceee32ac5d0044805c4e1f085

ciflow/trunk/19043

Toggle ciflow/trunk/19043's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge branch 'main' into add-fp8-placeholder-support-for-serialization

ciflow/trunk/19015

Toggle ciflow/trunk/19015's commit message
Replace tosa_dim_order with explicit NCHW↔NHWC permutes (#19015)

Summary:
Pull Request resolved: #19015

Replace implicit `tosa_dim_order`-based layout handling with explicit
`permute_copy` ops around TOSA operators that require NHWC layout.

### Rewrite passes insert explicit NCHW↔NHWC permutes

`RewriteConvPass`, `RewriteAvgPool2dPass`, and `RewriteMaxPool2dPass`
now insert `aten.permute_copy` nodes (NCHW→NHWC before the TOSA op,
NHWC→NCHW after) instead of relying on `ToTosaMemoryFormatPass` for
layout conversion. This makes layout transitions visible in the graph.

### Grouped conv decomposition in NHWC

`RewriteConvPass` decomposes grouped convolutions (non-depthwise) into
per-group `TOSA.CONV2D` ops operating entirely in NHWC, with a single
input/output permute pair wrapping the whole group. Supports INT8,
INT16 (with and without bias) quantisation paths, including the full
INT16+bias chain: CONV2D → RESCALE(INT48→INT32) → ADD(bias) →
RESCALE(INT32→INT16).

### `ToTosaMemoryFormatPass` scoped down

Now only assigns non-identity dim_order to parameter/buffer
placeholders (for weight serialisation) and graph I/O. Inserts
`permute_copy` instead of `tosa.TRANSPOSE`. Skips users that already
carry a matching permute (inserted by the rewrite passes).

### TOSA dialect op metas expect NHWC

All TOSA op meta functions (`CONV2D`, `CONV3D`, `DEPTHWISE_CONV2D`,
`AVG_POOL2D`, `MAX_POOL2D`, `TRANSPOSE_CONV2D`) now assume NHWC
input layout and produce NHWC output shapes.

### Removed `tosa_dim_order` shape remapping

`tosa_shape()` no longer reorders dimensions—just resolves symints.
`_get_matching_fake_tensor()` returns `node.meta["val"]` directly.
Serialisation mapping always uses identity dim_order.

### Operator serialisation simplified

`op_amax`, `op_amin`, `op_any`, `op_cat`, `op_sum`, and `op_permute`
no longer remap reduction/concat axes through `dim_order` since
tensors are already in the layout expected by TOSA.

### Permute optimisation passes added

Six shared passes from `executorch/backends/transforms/` are now run
after TOSA lowering to fuse, cancel, and simplify the permutes
introduced above:
- `RemovePermutesAroundElementwiseOps` (extended for `RESCALE`)
- `FuseTransposeOrPermuteOpPairsPass` (extended for `RESCALE`)
- `ReplaceNopTransposeOrPermuteWithViewPass`
- `PostponePermuteOpBelowSqueezeOrUnsqueezeLikeView`
- `FuseCascadedTransposeOrPermuteOps`
- `FuseCascadedViewOps`

### Removed passes

`DecomposeConvWithInt16ActivationPass` and `DecomposeGroupedConvPass`
are removed—their logic is now handled inline by `RewriteConvPass`.
`RewriteSlicePass` is repositioned after the permute optimisations.

### Ethos-U55 partitioner simplified

The dual NCHW/NHWC permute constraint check is removed since tensors
are always in the expected layout at partition time.

Differential Revision: D100712787

ciflow/trunk/18948

Toggle ciflow/trunk/18948's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge branch 'main' into change-1241864

ciflow/trunk/18827

Toggle ciflow/trunk/18827's commit message
Add FuseConcatPass to eliminate redundant concat ops (#18827)

Summary:

Concat (torch.cat) in the Gen2 Executorch ARM/Ethos-U stack is lowered to
TOSA CONCAT, which Vela then converts to N x MemoryCopy operations — real
DMA data movement on the NPU. This pass eliminates concat operations that
can be proven unnecessary at the FX graph level, preventing Vela from
generating MemoryCopy ops entirely.

Inspired by Espresso's concat elimination techniques
(bolt/nn/espresso/transforms/remove_nops.py), three patterns are handled:

1. Single-input concat: cat([x]) is a no-op, replaced with x.
2. Concat-then-slice: if every consumer of cat([a, b, ...]) is a
   slice_copy that extracts exactly one original input, bypass both.
3. Slice-then-concat: if contiguous slices of the same tensor are
   concatenated back, the result is the original tensor.

Differential Revision: D97667069

ciflow/trunk/18758

Toggle ciflow/trunk/18758's commit message
Add multi-reader tests for Add/Sub rescale fusion (#18758)

Summary:

Add AddMultiReader and SubMultiReader test models (conv2(conv1(x)) +/- conv3(conv1(x)))
where conv1's output Rescale has two readers. These exercise the multi-reader
per-consumer fusion loop.

TOSA INT, U55 INT, U85 INT for both Add and Sub (6 new tests).

Reviewed By: digantdesai

Differential Revision: D99939008

ciflow/trunk/18285

Toggle ciflow/trunk/18285's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge branch 'main' into change-1223321

ciflow/cuda/18903

Toggle ciflow/cuda/18903's commit message
remove extra tensor move action

ciflow/cuda/18901

Toggle ciflow/cuda/18901's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge branch 'main' into per-weight-constant-cache

ciflow/cuda/18809

Toggle ciflow/cuda/18809's commit message
remove unused env var