Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative method for single returns #359

Closed
bhelx opened this issue May 30, 2024 · 11 comments
Closed

Alternative method for single returns #359

bhelx opened this issue May 30, 2024 · 11 comments

Comments

@bhelx
Copy link
Contributor

bhelx commented May 30, 2024

I initially made the ExportFunction have a single apply method which represents the real life interface of wasm functions (so it supports multiple returns). However, 99% of the time I just pull the first element. How can we make this more ergonomic?

One idea I've had is to just add another method that returns the first element:

public interface ExportFunction {
    Value[] apply(Value... args) throws ChicoryException;
    Value applySingle(Value... args) throws ChicoryException;
}

Unsure what we'd call the method, applyOne or applySingle are my only ideas. But, perhaps there is a better way to solve this. Wanted to post this to get some suggestions.

@electrum
Copy link
Collaborator

Doing that would mean you can no longer implement the method using a lambda (it's no longer a @FunctionalInterface). One simple option is to add a new interface with an adapter method:

@FunctionalInterface
public interface ExportFunction {
    Value[] apply(Value... args) throws ChicoryException;

    static ExportFunction of(SimpleExportFunction function) {
        return args -> new Value[]{function.apply(args)};
    }

    @FunctionalInterface
    interface SimpleExportFunction {
        Value apply(Value... args) throws ChicoryException;
    }
}

But the real issue is that this interface is both inefficient and inconvenient. It's bad for performance because we box everything (AOT doesn't need boxing for arguments or single return), and it's annoying to implement because we have to adapt the arguments and return types.

We really need a low level interface based on MethodHandle that can be efficiently implemented by both users and the engine. Then we can have adapters for higher level interfaces based on annotations. For example:

@ExportFunction(args = {I32, I64, I32}, returns = {I32})
public static int myFunction(int a, long b, int c) {
    return (int) (a + b + c);
}

The JMH factorial benchmark shows a ~180x improvement for AOT vs interpreter on 1000 iterations, but only a ~60x improvement for 5 iterations. I haven't profiled it yet, but I suspect that the inefficiency of this interface is to blame.

(As I write this, I'm realizing that it would be helpful for the benchmark to have a Java implementation of the factorial function as a baseline for comparison.)

@bhelx
Copy link
Contributor Author

bhelx commented Jun 21, 2024

Thanks for the suggestions on fixing the broader problem. Initially I was thinking about addressing this as an API boundary problem and wasn't thinking so much about performance, but if we can create both a performant and ergonomic interface that works externally and internally then that's probably the way to go.

@electrum
Copy link
Collaborator

My example there is more appropriate for a host function, as the implementation of an export function will come from the engine. For export functions, one option is for the user to provide a custom interface:

public interface Factorial {
    int execute(int n);
}

Or use any appropriately typed existing interface such as IntUnaryOperator.

For the interpreter, these interfaces could be implemented using MethodHandleProxies. For AOT, we could implement these interfaces directly if desired for better performance.

We could also support interfaces with multiple methods, which could make it easier to implement larger APIs. These could be implemented using JDK proxies or code generation.

I extended the factorial JMH benchmark with a baseline and found that AOT has identical performance to Java on 1000 iterations, but is ~6x slower for 5 iterations due to boxing. This means that the export function interface is the current blocker for low-overhead calls into WASM (perhaps important for certain types of WASM-as-a-library use cases).

@andreaTP
Copy link
Collaborator

Do we have an alternative to the usage of MethodHandle ?

I might be missing something, but, maybe generating the code directly for the expected use cases is an option?
E.g. generating code for up to 20 params and 10 return values.
It's anyhow a low level interface.

All in all, I'm curious to see a strawman to better understand the invariants here!

@electrum
Copy link
Collaborator

electrum commented Jun 23, 2024

Do you have a specific objection to the usage of MethodHandle? It is a standard part of the Java platform since Java 7 and is used by the Java compiler for lambdas, string concatenation, etc. And it's even available on Android.

Generating specific interfaces is combinatorial with (4^N)*5 for a parameter arity of N with a result count of 0 or 1 (assume we will box for multi-return). So while that might be feasible for a small arity, ignoring the bloat to the distribution size, it doesn't seem like a good solution.

I can put together a prototype and see if we can get to zero overhead for AOT exports. This will also improve AOT performance for CALL_INDIRECT, which currently has to go through the boxed Machine.call() interface, even for other AOT methods.

@andreaTP
Copy link
Collaborator

So far, we have been very conscious about adding complexity to the interpreter codebase, and attempted to have 0 usage of reflection(even if modern).

It might be time to re-evaluate this decision when it provides a huge performance improvement; but let me play the devil's advocate a little more to be sure that's the best thing we can do to this codebase.

Please note that there are 0 reserves to use this technique in the AOT.

Thinking about it a little more, would a generative approach only for the most common use cases(up to 4 params and 0/1 returns) with a fallback for the remaining cases be a viable option?

@electrum
Copy link
Collaborator

The only complexity is in the adapter layer, which is used to create a nicer API for users. The interpreter could keep the same call() interface. The complexity right now is all on the user side (for both export and import functions). So this is both about performance and better APIs for users.

Up to 4 arity would be (4^4)*5 + (4^3)*5 + (4^2)*5 + (4^1)*5 + (4^0)*5 = 1705 combinations. But now that I think about it, I don't see how this would help, as you wouldn't have any way to call these from the interpreter (without generating that many call sites and a big switch or something).

and attempted to have 0 usage of reflection(even if modern)

I'd like to understand this more. Do you think it is somehow bad, or causes a problem for adoption or another goal? I don't understand why this would itself be a goal.

@bhelx
Copy link
Contributor Author

bhelx commented Jun 24, 2024

Let's pick this up in the meeting tomorrow. Might be easier to talk about in person. I don't want to speak for Andrea but I think the primary concern is on keeping the interpreter code-base straightforward and if possible, push more complexity into the AOT layer. I think it should be achievable while satisfying Andrea's concerns.

@andreaTP
Copy link
Collaborator

Sorry for the delay!

Do you think it is somehow bad, or causes a problem for adoption or another goal? I don't understand why this would itself be a goal.

Great question! And probably it's a good time to challenge those, but let me recap the reasons that we have been using so far:

  • complexity: at the moment the codebase of the interpreter is straightforward, it's easy to onboard people in just few hours, I do value simplicity. We know since long that we need to relax this constraint when performance kicks in, although, I would feel more comfortable doing it after we have a full 100% spec V1 passing (validation included, but we are not far!).
  • native-image compilation: currently, the interpreter runs with 0 configuration after being compiled with GraalVM native-image. This is great because reflection configuration is hard to set up and horrific to maintain. That said, frameworks like Quarkus might help here, and in many cases this part is handled automagically.
  • Android compatibility: we don't have a real way to test this so far ... we have just "been told" about issues. It would be great to progress a bit under this aspect, as I fully understand this argument doesn't really hold.

All in all I'm in favor of:

push more complexity into the AOT layer.

when appropriate, but, if the performance gain is huge (especially if backed by real-world examples), I'm happy to keep the conversation going and iterate over those points.

@andreaTP
Copy link
Collaborator

andreaTP commented Jun 26, 2024

If we exclude the performance discussion from this thread and we look only at the final user API I decided to dump my brain in a demo repository (since code is worth thousands of words).

This is my opinionated take on how I'm envisioning the user interaction with Chicory:
https://github.com/andreaTP/chicory-bindgen-poc

Happy to hear feedback!

@bhelx
Copy link
Contributor Author

bhelx commented Jul 23, 2024

We should probably pick up this dicscussion in #441

@bhelx bhelx closed this as completed Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants