-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Permanent Slow Casting #120
Comments
I think it could work as follows: |
@tebbi Unfortunately there are a number of assumptions in your suggestion that start to get why this is such a hard problem despite seeming straightforward.
Obviously this incurs overhead, but lets put that aside. Type-checking a virtual method (or an interface method or a closure call) at a low level like WebAssembly's is known to require existential types or something similar like type fields. A virtual-method call works by loading the method from the object's fields (or v-table), and then passing that object as the first argument to the method (i.e. as the value of the "this" pointer) along with the surface-level arguments. The issue is that a program using a low-level instruction set like WebAssembly's could pass some other value as the "this" pointer. In your example, a simple type system can only ensure that that value is some array, but not necessarily the array that the virtual method was meant to be paired with, and in particular not necessarily with the same "true array element type". That's why any simply typed MVP will necessarily have to superfluously cast the "this" pointer in all virtual/interface-method implementations - the type system cannot guarantee that the argument for "this" actually has the expected type. Existential types and type members fix this by letting values state that they have some unknown "exact" type, and by letting function pointers stored in objects demand that the "this" pointer have that same unknown "exact" type.
Here you're assuming that the Post-MVP's encoding of a type like Hopefully I managed to illustrate part of way this is so challenging, though be aware that this is still just the tip of the iceberg. That said, there are known techniques for addressing these problems in a way that is both easy to generate and easy to validate, but their first step is to switch to a more tractable nominal system as suggested in #119. And no, the utility of that suggestion does not apply to only surface languages with nominal type systems; the nominality describes the data structures used by the compiler/runtime more so than those used by the surface language, and as such applies to a wide variety of languages. |
And if Java (or any other language) were to be compiled to linear memory instead, it could use the full knowledge of any invariants its type system knows about, and thus omit unnecessary casts or range checks entirely (using only the memory range check, which is close to "free" on wasm32 in browser engines currently). Just needs some stack inspection, and host interop features instead. |
Nice test program to show the problem. |
@aardappel Theoretical such a linear memory access should be very fast. But a self written GC will not be very effizient. And such GC can not run on a different thread like any modern GC. There will breaks. The breaks will be larger if the memory grow. |
Why not? It can be custom to the language, so it has many ways in which it can be better than a generic one. It just needs a stack inspection feature to be competitive, like I mentioned.
Why not? Anything about the threads proposal that would not work for GC? |
@aardappel @Horcrux7 The last two comments seem to be on a separate topic. I think understanding casting performance is one of many items that will help inform a discussion on that topic, and I would like to focus on casting here so that we can first examine that item in more depth. |
Been mulling the micro-benchmark re type casting.
Haven't tried this, but WDYT? |
That's not how a Java cast to a (non-final) class works nor how an |
Depends. If you cast a Java object to an interface, then a search is required. You do not know at compile time which of an object's interfaces is the right one. |
That's true, which is why the microbenchmark only ever casts to classes. |
In any case, it might be better to have a 'pure' C benchmark that mimics potential behavior of WASM cast instructions than relying on actual behavior of cast in Java (say). |
I suspect a C benchmark would perform the same or worse. It would not incorporate the decades of engineering the JVM has benefited from (remember that every write to an |
Obviously some misunderstanding here. |
Just wondering what if we apply explicit inline cache mechanism. How this change benchmark results? |
Not sure. It's possible the JVM is already employing such a technique. But also, my understanding is that WebAssembly is not supposed to rely on inline caching for good performance. |
We have reached phase 3 and the design is stable. We also have good performance results in practice, although we're still much slower than the JVM on many benchmarks. Closing this since it is not actionable at this point beyond ongoing optimization work in tools and engines. |
A consequence of the issues raised in #116 is that, unless we somehow manage to solve a longstanding open research problem, we will never be able to eliminate "superfluous" casts even in extensions to the current MVP. For example, despite significant effort, no research team has been able to extend the current MVP's type system to guarantee that a
String[]
in Java contains onlyString
values, and as such a Java compiler to even future versions of WebAssembly is likely to always have to compile an access to aString[]
with a load followed by a cast toString
. Similarly, every virtual/interface method in Java/C#/Kotlin/Scala will have to cast thethis
pointer, and early every closure in OCaml/Haskell will have to cast its environment and inputs and will often have its output cast after the call returns.A consequence of the issues raised in #118 is that we will likely never be able to add more efficient/specialized casting mechanisms to the current MVP beyond
ref.cast_exact
as suggested by @tebbi here.While it's possible these issues will be addressed, they have been open issues (offline) for multiple years, and there's enough evidence that they're insurmountable that we should evaluate the current MVP with the expectation that these issues won't be resolved.
The obvious question then is what is the overhead of these casts that we can't expect to get rid of? Part of that question is hard to answer because it depends on how many superfluous casts there are relative to other computation, and that will vary by benchmark. But part of that question is easier to answer because the casting mechanism in the current MVP is the same as that for the JVM, and the JVM has already been heavily optimized for this casting mechanism. So we can at least get a sense of how much superfluous casts cost individually through a microbenchmark.
For that microbenchmark, I considered the following Java program:
where
Super
has a fieldint x
and two subclassesSub1
andSub2
(with no additional fields). (And wheresupers
is an array of 1000 distinct values, and wheremax
isInteger.MAX_VALUE
.)Although the JVM can guarantee that
supers
contains onlySuper
values and as such the access tox
needs no cast, the current proposal's type system cannot guarantee this and so the generated WebAssembly needs to insert a superfluous cast toSuper
(using its correspondingrtt
) in order to type-check. To measure this overhead, I changed the array to anObject[]
and inserted a cast toSuper
and compared the run times (across multiple iterations with warm up and so on). The inserted cast makes the program run 48% slower (with std. dev. of 1%).Now if
Super
were insteadfinal
, i.e. was guaranteed to have no subclasses, this cast can be optimized to just check that each object'srtt
array is the same as that forSuper
, rather than checking a particular element in this array. This optimization could be added to the MVP via aref.cast_exact
instruction. The JVM indeed supports this optimization. When I cast instead to afinal
class, the program ran 26% slower (with std. dev. of 1%).Then, in the spirit of #109, I encoded
Super
as anint[]
(the JVM does not support butterflies) and simply accessed the field as an array element (which the generator would know exists, though the WebAssembly engine will have to perform an array-bounds check). This version of the program ran only 16% slower (with std. dev. of 1%).I could not think of a way of measuring pointer-tagging casting mechanisms on the JVM.
Although this is not a direct measurement of casting times, it is enough to conclude that an
rtt.cast
with a statically knownrtt
likely takes at least 3x the time anassert_bounds 0 1
takes, and likely takes at least 2x the time anrtt.cast_exact
takes. Thus the current MVP is likely permanently sticking WebAssembly with the slowest-by-far casting mechanism, one which even the optimized variant of is still substantially slower than an array-bounds check.The microbenchmark I used can be found here. I ran this on my laptop a few times, after closing foreground and background processes and such. Benchmarking on the JVM is difficult, and this certainly is an imperfect experimental setup, but it did at least give consistent results across runs.
The text was updated successfully, but these errors were encountered: