Better AAarch64 performance

Even tough [JEP 438](https://openjdk.org/jeps/438) states that both *x64* and *AArch64* architectures should benefit from new vector api, currently performance of `simdjson-java` on M1 mac is way worse than other parsers:

```
Benchmark                                                                   Mode  Cnt     Score    Error  Units
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_fastjson        thrpt    5  1229.991 ± 39.538  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jackson         thrpt    5  1099.877 ±  9.560  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter        thrpt    5   607.902 ± 10.469  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter_scala  thrpt    5  1930.694 ± 41.766  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjson        thrpt    5    26.287 ±  0.295  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjsonPadded  thrpt    5    26.516 ±  0.686  ops/s
```

This may be due to the usage of 256 bit vectors, I have found an [thread](https://mail.openjdk.org/pipermail/panama-dev/2021-March/012386.html) which states that:

> on AArch64 NEON, the max hardware vector size is 128 bits. So for 256-bits, we are not able to intrinsify to use SIMD directly, which will fall back to Java implementation of those APIs

When running the benchmark with `'-XX:+UnlockDiagnosticVMOptions', '-XX:+PrintIntrinsics'` the following output can be observed, supporting this theory:

```
** not supported: arity=0 op=load vlen=32 etype=byte ismask=no
** not supported: arity=1 op=store vlen=32 etype=byte ismask=no
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte
** not supported: arity=0 op=load vlen=32 etype=byte ismask=no
** not supported: arity=1 op=store vlen=32 etype=byte ismask=no
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte
```

Obviously AArch64 support is not as important as x64, but it may be interesting to make the implementation flexible to support both architectures. Perhaps the [C++ implementation](https://github.com/simdjson/simdjson) can be used as a reference. 

Anyway, great work so far on the Java port, the results on x64 are very impressive!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better AAarch64 performance #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Better AAarch64 performance #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions