Even tough JEP 438 states that both x64 and AArch64 architectures should benefit from new vector api, currently performance of simdjson-java on M1 mac is way worse than other parsers:
Benchmark Mode Cnt Score Error Units
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_fastjson thrpt 5 1229.991 ± 39.538 ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jackson thrpt 5 1099.877 ± 9.560 ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter thrpt 5 607.902 ± 10.469 ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter_scala thrpt 5 1930.694 ± 41.766 ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjson thrpt 5 26.287 ± 0.295 ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjsonPadded thrpt 5 26.516 ± 0.686 ops/s
This may be due to the usage of 256 bit vectors, I have found an thread which states that:
on AArch64 NEON, the max hardware vector size is 128 bits. So for 256-bits, we are not able to intrinsify to use SIMD directly, which will fall back to Java implementation of those APIs
When running the benchmark with '-XX:+UnlockDiagnosticVMOptions', '-XX:+PrintIntrinsics' the following output can be observed, supporting this theory:
** not supported: arity=0 op=load vlen=32 etype=byte ismask=no
** not supported: arity=1 op=store vlen=32 etype=byte ismask=no
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte
** not supported: arity=0 op=load vlen=32 etype=byte ismask=no
** not supported: arity=1 op=store vlen=32 etype=byte ismask=no
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte
Obviously AArch64 support is not as important as x64, but it may be interesting to make the implementation flexible to support both architectures. Perhaps the C++ implementation can be used as a reference.
Anyway, great work so far on the Java port, the results on x64 are very impressive!
Even tough JEP 438 states that both x64 and AArch64 architectures should benefit from new vector api, currently performance of
simdjson-javaon M1 mac is way worse than other parsers:This may be due to the usage of 256 bit vectors, I have found an thread which states that:
When running the benchmark with
'-XX:+UnlockDiagnosticVMOptions', '-XX:+PrintIntrinsics'the following output can be observed, supporting this theory:Obviously AArch64 support is not as important as x64, but it may be interesting to make the implementation flexible to support both architectures. Perhaps the C++ implementation can be used as a reference.
Anyway, great work so far on the Java port, the results on x64 are very impressive!