improve performance for Java "block-to-interface" conversion#9401
improve performance for Java "block-to-interface" conversion#9401kares wants to merge 10 commits into
Conversation
headius
left a comment
There was a problem hiding this comment.
This is basically ok but there's a lot of duplicated code from elsewhere and we could consider using MethodHandle instead of reflection objects in some of these places. The logic seems sound; generate a class for Interface### to Proc dispatch and reuse it. I have some concerns about how these are being cached, for situations where many classes are encountered rarely or only once.
Overall I approve but we can chat through some improvements before merging.
I'd also like to see the benchmark code somewhere so we can continue to audit and profile those cases.
| implClass = defineImplClass(loader, interfaceType, implClassName); | ||
| } | ||
|
|
||
| constructor = (Constructor<? extends BlockInterfaceTemplate>) implClass.getConstructor(RubyProc.class); |
There was a problem hiding this comment.
Of course this stuff uses java.lang.reflect all over the place (both before and after this change) but we could be using MethodHandles for all of these and skip the overhead of a reflective invocation.
There was a problem hiding this comment.
tried unreflecting the constructor into a handle, no difference with the benchmarks mentioned in the PR.
There was a problem hiding this comment.
The handle needs a static root of some kind, either a static final field (not possible here) or by using LambdaMetaFactory to generate an interface implementation. Without those it usually is not much faster than reflection, which also uses unrooted handles internally.
| /** | ||
| * Loads {@code argIndex} from the Java arg slots and boxes it if primitive. | ||
| */ | ||
| private static void loadBoxedArg(GeneratorAdapter ga, int argIndex, Class<?> paramType) { |
There was a problem hiding this comment.
These three utility methods already exist in various forms throughout JRuby. Look at RealClassGenerator for some examples. We don't need to reimplement primitive boxing and unboxing and class literal loading again.
128b86d to
7fc73e2
Compare
| @JIT | ||
| @SuppressWarnings("unused") | ||
| protected final IRubyObject __ruby_call(final Class<?> returnType) { | ||
| return block.call(runtime.getCurrentContext()); |
There was a problem hiding this comment.
while this seem a little abandoned (could be part of the generated class); at occasions I wanted to be able to set a break point around the "block-to-interface" execution, in Java, having the "template" super-class allows for that. no hard feelings if preference is to simply get rid of this.
There was a problem hiding this comment.
I get the desire and it's nice to be able to fall back on something debuggable. I wish there were a way to specify __FILE__ and __LINE__ in Java so you could provide the original generated code lines in the stack trace.
7fc73e2 to
d1ff8e7
Compare
d1ff8e7 to
bfb3c62
Compare
Passing a Ruby block to a Java method that expects a SAM (single-abstract-method "functional interface") type has evolved into having very bad performance despite being one of the very useful JI features available in JRuby, current behavior:
RubyProc(to wrap theblock)MetaClass) for the proc objectsingletonClass.include(<interfaceModule>)-synchronized(runtime.hierarchyLock).extendedcallback re-invokessingletonClass.include(<interfaceModule>)a no-op, but
includeModulestill callsinvalidateCacheDescendantsunconditionally, taking the same lock againaddMethod("method_missing")-synchronized(hierarchyLock)addMethod("<sam-method>")-synchronized(hierarchyLock)4 acquisitions per task on a single monitor
Under load with multiple threads, the lack of caching becomes very noticeable and prevents proper concurrent execution:
The fix introduces a specialised, lock-free conversion path for blocks that originate at the Java-integration layer and are known to be on their way into just being used for block-to-interface execution, existing semantics for user-land
RubyProcs are preserved.Performance
Numbers for the included micro benchmark: https://gist.github.com/kares/c4616956570fd58515ad6f0ffd822f8f
Improvement seems to be at the order of 20-30x and with multi-threaded execution far beyond (no more contention between threads).