Due to https://bugs.openjdk.org/browse/JDK-8180450 the chain of instanceof checks on HttpObjectEncoder competes to update Klass::secondary_super_cache field causing a scalability issue (very relevant with many cores and NUMA topologies).
Detection of such issue isn't easy for 3 reasons:
- performance hit won't happen in any Java code (ie no BCI - byte code index)
- updates on
Klass::secondary_super_cache field cause false sharing over other fields sitting on the same cache line used to perform other parts of the instanceof logic (Klass::secondary_supers's length read to perform MacroAssembler::check_klass_subtype_slow_path) causing other instructions to receive a performance penalty while not directly causing the issue itself
- benchmarks (and micro-benchmarks) tends to use just a single http msg type making the JIT able to NOT perform any
instanceof, but placing an uncommon trap to guard against new types used (ie by making the issue to not happen)
The 1. and 2. makes profilers very confused (see http://psy-lob-saw.blogspot.com/2018/07/how-inlined-code-confusing-profiles.html) if other Java methods are inlined near (forward/backward - the direction depends by the profiler, really) the guilty emitted JIT checks, getting all the blame. Only assembly analysis using tools like https://builds.shipilev.net/perfasm/ and perf c2c makes evident which instructions to blame.
The last reason means that validation benchmarks should consider polluting the JIT profiling data at the instanceof call-sites (see https://wiki.openjdk.org/display/HotSpot/MethodData) in order to have a synthetic way to provoke/detect it.
Sadly Netty 4.1 cannot rewrite the existing http types (that are public) in order to NOT use the type information to perform state changes on the encoder, but Netty 5 can consider a different approach to solve this performance issue.
In an ideal world where there's a single implementor for each interface type (implementing just one too), the issue shouldn't happen.
To better understand (one flavour of) this issue, an example below.
Let's consider DefaultFullHttpRequest; it implements FullHttpRequest and (transitively) ReferenceCounted too.
In a code like this:
void write(Object o) {
try {
if (o instanceof FullHttpMessage) {
foo(o);
}
// ... other logic
} finally {
if (o instanceof ReferenceCounted) {
((ReferenceCounted) o).release();
}
}
}
if the both instanceof check observes more then 1 concrete types eg DefaultFullHttpRequest and DefaultHttpRequest
then the checks will make use of the mechanism explained in https://bugs.openjdk.org/browse/JDK-8180450.
For example, if a single thread write(o) and o is a DefaultFullHttpRequest, the checks invalidate the mentioned DefaultFullHttpRequest Klass field (first check set it to FullHttpMessage, second to ReferenceCounted), that's just inefficient; but if many threads will do it, they would compete vs the same Klass field, invalidating each others and causing false sharing for surrounding Klass fields eg length of the Klass array used to perform the slow path check.
In the existing code base this is the simplest form of Klass::secondary_super_cache field invalidation, but the logic of the encoder can make that happen too.
Due to https://bugs.openjdk.org/browse/JDK-8180450 the chain of instanceof checks on
HttpObjectEncodercompetes to updateKlass::secondary_super_cache fieldcausing a scalability issue (very relevant with many cores and NUMA topologies).Detection of such issue isn't easy for 3 reasons:
Klass::secondary_super_cache fieldcause false sharing over other fields sitting on the same cache line used to perform other parts of theinstanceoflogic (Klass::secondary_supers's length read to perform MacroAssembler::check_klass_subtype_slow_path) causing other instructions to receive a performance penalty while not directly causing the issue itselfinstanceof, but placing an uncommon trap to guard against new types used (ie by making the issue to not happen)The 1. and 2. makes profilers very confused (see http://psy-lob-saw.blogspot.com/2018/07/how-inlined-code-confusing-profiles.html) if other Java methods are inlined near (forward/backward - the direction depends by the profiler, really) the guilty emitted JIT checks, getting all the blame. Only assembly analysis using tools like https://builds.shipilev.net/perfasm/ and perf c2c makes evident which instructions to blame.
The last reason means that validation benchmarks should consider polluting the JIT profiling data at the
instanceofcall-sites (see https://wiki.openjdk.org/display/HotSpot/MethodData) in order to have a synthetic way to provoke/detect it.Sadly Netty 4.1 cannot rewrite the existing http types (that are public) in order to NOT use the type information to perform state changes on the encoder, but Netty 5 can consider a different approach to solve this performance issue.
In an ideal world where there's a single implementor for each interface type (implementing just one too), the issue shouldn't happen.
To better understand (one flavour of) this issue, an example below.
Let's consider
DefaultFullHttpRequest; it implementsFullHttpRequestand (transitively)ReferenceCountedtoo.In a code like this:
if the both instanceof check observes more then 1 concrete types eg
DefaultFullHttpRequestandDefaultHttpRequestthen the checks will make use of the mechanism explained in https://bugs.openjdk.org/browse/JDK-8180450.
For example, if a single thread
write(o)andois aDefaultFullHttpRequest, the checks invalidate the mentionedDefaultFullHttpRequestKlass field (first check set it toFullHttpMessage, second toReferenceCounted), that's just inefficient; but if many threads will do it, they would compete vs the same Klass field, invalidating each others and causing false sharing for surrounding Klass fields eg length of the Klass array used to perform the slow path check.In the existing code base this is the simplest form of
Klass::secondary_super_cache fieldinvalidation, but the logic of the encoder can make that happen too.