-
Notifications
You must be signed in to change notification settings - Fork 27
memory.copy|fill semantics limit optimizations for short constant lengths #111
Comments
Hi, I implemented the bulk memory operations in LLVM. The way LLVM is set up now is that I have two knobs to turn: the max number of stores to use before switching to using If we don't have reliable data to inform this tuning, I could also just restore the LLVM default of maxing out at 8 loads or 4 loads when optimizing for size. Is it the case that executing |
The data that I've experience personally at least is sourced from Firefox, and AFAIK Firefox has not, before now, attempted to optimize In some sense though I don't think there's great data one way or the other of how to optimize codegen of |
One thing that I bounced off @rossberg at the last in-person meeting was a possible alternative semantics to copy/fill which makes it entirely non-deterministic which parts of the copy/fill complete in the case of an OOB trap. The memory model already makes it non-deterministic which parts of the copy/fill are visible to racing concurrent threads, so this would only affect the observable state of memory after a trap, and would make the possible observations in this case mirror the possible observations in the racing case. This has the advantage that it allows any copy/reverse-copy/ahead-of-time-check implementation to be conformant, and always allows copy/fill to be expressed in terms of Wasm load/stores, which would help the issue here. It's also still deterministic in the non-OOB case. Disadvantages are that it is more ugly to spec, and might introduce more non-determinism than is desirable if traps become easy to recover from in the future (e.g. if they become catchable). I'd definitely prefer not re-introducing guaranteed ahead-of-time OOB traps. If we end up observably exposing the copying order later on anyway (e.g. through |
Eagerly adding another source of non-determinism to wasm's very short list of non-determinism seems unfortunate. Are there any other issues with ahead-of-time OOB traps other than racy- For these two racy cases, first, I'm a bit doubtful that they will in fact be added:
But if they were added, then perhaps that is the only case where we would allow non-deterministic writes to have occurred, keeping all other cases fully deterministic? |
The only other thing I could think of off-the-cuff would be a "thread kill" operation which promises to obey Wasm's formal semantics (i.e. interrupting in-between small-step reductions). But I'm not sure whether this could be implemented in the presence of reasonable compiler optimisations. Possibly some memory mapping shenanigans could be another way the order becomes observable? I don't have any concrete ideas on that front.
Way back during one of the CG discussions on this, I floated something similar as the "irregular solution", and it seemed to be instinctively disliked because you get all-or-nothing "transactionality" in most cases, except where you have a racing I've been passionate about this in the past because I believed that |
Might an ahead-of-time trap sometimes cause the copy to be slower? Suppose you have a fast block-copy primitive a la REP MOVS and that we know definitively whether src < dest or dest < src, so that the copy direction is known (could be determined at runtime); and say the copy length is known or known to be small (ditto). Then on a system using VM tricks for bounds checking, I expect that we could probably just emit the REP MOVS inline and let the memory protection take care of bounds checks. Requiring explicit checks up-front would be slower. Even if REP MOVS is not available, some unrolled and suitably scheduled load/store pairs would do the same trick (certainly for known length). I don't mean these considerations to stand in the way of ahead-of-time OOB traps, since I think that for small known-length copies the compiler-to-wasm should just emit loads and stores anyway; for other cases, a little ahead-of-time overhead for memory.copy should be fine. (I also think |
What LLVM wants to express is "I know the size, I know the alignment, I know they don't overlap, and I just want good performance and small code size" and in reality the size is probably fairly small and divisible by the word size. We are tying ourselves in knots over shoehorning this into memory.copy, which is too general. Thus another couple of stray thoughts: One thought is that we keep saying that we'll eventually provide higher-level compression for bytecode precisely to capture (say) a sequence of loads and stores, so frankly bytecode size should not actually be much of an issue. (Though (a) that work seems to have stalled and (b) it doesn't quite address the performance argument since in principle some sequence of loads and stores can't always be compiled as cleverly as memcpy can be.) Another thought is that for the desired semantics, we could provide load-multiple and store-multiple instructions (a la 68K and ARM). Load-multiple would load k units onto the eval stack; store-multiple would store k units from it; k is constant. If we provide ascending and descending indexing the compiler can opt for the bounds checking guarantees it needs. Pairs of these ops would be easy to recognize in any JIT and good code could be generated both scheduling-wise and bounds-checking-wise on platforms with decent bounds checking. Bytecode size should be fine. If we want to be conservative we provide only 32-bit variants initially, limit k to something smallish, and require word alignment of at least the store-multiple pointer (typically requires a run-time check, but a cheap one). |
I don't think we have this data. The hard part to gathering it is that our OOL 'memory.copy' implementation is relatively slow and has some major room for improvement. Comparing unrolled wasm loads/stores against it at this point is likely to have them out perform for longer than reasonable. I'm also not sure at this point if our 'memory.copy' is faster than 'memcpy' compiled to Wasm. Certainly for large sizes, but there's some incidental overhead that we'd like to remove that could cause Wasm memcpy to be faster for some cases right now. This should be fixed in the medium term, of course. |
I should try to clarify one point I didn't really explain well. Tweaking LLVM's heuristics sounds like it could definitely help this situation, but I generally don't see the 'memory.copy' instruction (as it is specified today) as being a good choice for structure copies and other things with 'short' constant lengths. The reason for this is that we lose alignment and overlap info that we would have to recover at runtime, whereas LLVM could just emit the loads and stores using this information. 'memory.copy' is still very useful, but it should focus on larger copies where it's likely to outperform Wasm. There's a number of tweaks that could help here, I think @lukewagner's suggestion makes the most sense to me. If we could go with that, then we should be able to continue allowing LLVM to optimize for size by using memory.copy aggressively. @lars-t-hansen also has a good point that it seems like we may be trying to shoehorn this use-case into 'memory.copy' when it doesn't make sense. I'm open to alternatives. |
@conrad-watt Ah, thanks for reminding me of the CG slides; I think jetlag interacts poorly with long-term memory creation :-P I think the new information here, not present in that discussion, but which is what we'd expect to hear during the Implementation phase, is that there are non-trivial performance/complexity implications of the different trapping semantics choices. Also, in the (imho, unlikely) case we add |
Ok, it sounds like making It sounds like it would be a reasonable expectation in the medium-term that |
I would be fine with this. It has the advantage of being a reasonable balance between a neater/potentially less implementation-onerous semantics now, and a still-acceptable semantics in the (unlikely?) case where If we ran with this idea, we'd be effectively reverting the "up to the last byte" OOB behaviour changes for instantiation and bulk memory. |
Okay, I’d like to formally propose changing the trapping behavior here so that we do not partially write in the case of a range that is partially OOBs. This would effectively revert the trapping behavior back to what it originally was. To summarize this thread, this change is motivated by the discovered implementation complexity of trying to generate an inlined version of The major concern that motivated the change to allow partial writes was the interaction with a future The semantics of This would effectively be partial reverts of #50 (for bulk-mem instructions) and #57 (for active segment initialization). If there are no further arguments against this change here, I would like to add this to the next CG meeting agenda. |
I'm not sure there's any reason to change the spec proposal for this issue. The issue is that code becomes slower when using bulk memory operations for small fixed-size copies. We can fix this the hard way by changing the spec and doing a lot of optimization work in all the engines or the easy way by simply not using bulk memory operations for small fixed-size copies. |
Yes, it's true that we can resolve the immediate issue by changing LLVM to not use the current bulk memory ops for small fixed-size copies. However, it seems like we would be missing a chance to improve the specification before it ships so that it is useful for this use case. I guess it depends on the potential code size and quality improvements we can achieve, but it would be nice to be able to have an instruction that is performant for this use case. |
IIUC, |
It should be possible to implement an efficient inline copy with just reverting the partial write behavior. Apologies, I made a veiled reference to this in the OP but should have made it more explicit. Essentially, you load all of the src bytes into registers (using SSE/vector registers there should be enough storage for most inline sizes we care about), then store the src bytes from high to low. If any src byte is OOBs, you'll trap before writing. If any dest byte is OOBs, you'll trap on the first write. This was taken from [1] minus the complicated signal handling logic for handling partial writes. Additionally adding alignment information could help as well, but that's a different spec change that would need discussion. [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1570112#c23 |
The main thing that stuck in my mind after the last CG was the following. Even if we commit to the byte-wise semantics now, I wouldn't be willing to bet that all of today's implementations would still conform to those semantics if To me, this suggests that there has to be a point at which we're ok with non-determinism in interrupted All this to say, I'm starting to feel that I made a mistake by advocating for the byte-wise semantics, and I'd support going back to the bounds checked semantics, so long as people could be on board with the underlying copy/write order for bulk instructions being specified as non-deterministic, if it ever became observable (for now, it would be sufficient to specify it in the natural byte-by-byte way, since it wouldn't be observable). |
One (perhaps minor) disadvantage is that the proposed change destroys one of the properties we tried to establish with the simplifications discussed at the last F2F meeting, namely that every bulk operation is equivalent to a sequence or loop of individual loads/stores. In particular, a memory.copy where src > dst with potential overlap, such that it must copy low to high, can no longer be transformed that way, but requires at least an extra read. This also means that there will probably be a special case in the semantics. |
Sorry, was not aware of prior discussions. I feel that byte-wise semantics would be tricky to implement even if alignment is known. @eqrion I took a quick look on |
I've come around to be in favour of the proposed change. |
There is a middle ground that addresses @rossberg's concern while also enabling the proposed optimization. The semantics can be that a trap can occur only if at any point during the time of the copy there is an address in the specified range that is not allowed to be written to (by the copying thread). In such a situation, there is no guarantee as to which writes were completed. (And the instruction can only complete successfully if all writes were completed successfully.) The current proposal (i.e. the proposal before the CG meeting) and the proposed fix (during the CG meeting) are both special cases of this semantics, and they are both efforts to determinize a corner-case of the semantics. But if we take @conrad-watt's observation that non-determinism is inevitable in this situation, we see that the middle ground here is only non-deterministic (with respect to trapping) in the presence of racy changes to memory validity, and only non-deterministic (with respect to partial writes) in the presence of racy reads or if someone catches the trap. So the middle ground has some non-determinism, but it's essentially only visible to an outside observer. I'm not particularly advocating for any of the three choices - I just wanted to provide a third option in case it was helpful. |
@RossTate, AFAICS, it wouldn't be possible to unroll an instruction with such a semantics into an equivalent sequence of individual accesses (because that fully determines which writes occur before a trap). So I am not sure how this semantics would address the concern I brought up. I am also not sure how it could, technically, be defined in such a way that it actually is observably different from the completely non-deterministic semantics that @conrad-watt mentioned. |
If there's no initial bounds check, I agree that @RossTate's suggestion sounds like (#111 (comment)). This would allow unrolling, because it would just be a narrowing of the non-deterministic possibilities (i.e., instead of being allowed to trap anywhere having written "anything", the program now follows the semantics of the unrolled writes). Parroting @lukewagner (#111 (comment)), having an initial bounds check allows the semantics to be deterministic in the vast majority of cases. We don't know what form |
Aside of describing writes with overlapping source and destination, what benefits do the "sequential" semantics have? In case of out of bounds access there would be some data that did not get written, and recovering from it would not be fundamentally different with either sequential or bulk semantics. Non-determinism would not really make that easier, as it would require re-attempting the whole operation, just like in "bulk" semantics. All of the tree possible semantics would require some clarifications for the shared memory case anyway, with shared memory semantics being at least somewhat non-deterministic. IMO, having the operation complete or fail as a whole makes the implementation much more transparent. |
@penzn, one benefit is that a compiler can safely transform one form into the other and vice versa, e.g., for optimisation purposes, without having to prove the absence of (or account for) tricky corner cases. Being able to reduce a complex operation to a simple one also tends to simplify the language semantics and thus reasoning about programs. |
… r=lth This commit changes all bulk-memory instructions to perform up-front bounds checks and trap if any access would be out-of-bounds before writing. This affects: * memory.init,copy,fill * table.init,copy,fill * data segment instantiation (reduces to memory.init) * elem segment instantiation (reduces to table.init) Spec issue: WebAssembly/bulk-memory-operations#111 Differential Revision: https://phabricator.services.mozilla.com/D51755 --HG-- extra : moz-landing-system : lando
… r=lth This commit changes all bulk-memory instructions to perform up-front bounds checks and trap if any access would be out-of-bounds before writing. This affects: * memory.init,copy,fill * table.init,copy,fill * data segment instantiation (reduces to memory.init) * elem segment instantiation (reduces to table.init) Spec issue: WebAssembly/bulk-memory-operations#111 Differential Revision: https://phabricator.services.mozilla.com/D51755
… r=lth This commit changes all bulk-memory instructions to perform up-front bounds checks and trap if any access would be out-of-bounds before writing. This affects: * memory.init,copy,fill * table.init,copy,fill * data segment instantiation (reduces to memory.init) * elem segment instantiation (reduces to table.init) Spec issue: WebAssembly/bulk-memory-operations#111 Differential Revision: https://phabricator.services.mozilla.com/D51755 UltraBlame original commit: 801e6ae4efdeb516250c6800e31f969d337b8f40
… r=lth This commit changes all bulk-memory instructions to perform up-front bounds checks and trap if any access would be out-of-bounds before writing. This affects: * memory.init,copy,fill * table.init,copy,fill * data segment instantiation (reduces to memory.init) * elem segment instantiation (reduces to table.init) Spec issue: WebAssembly/bulk-memory-operations#111 Differential Revision: https://phabricator.services.mozilla.com/D51755 UltraBlame original commit: 801e6ae4efdeb516250c6800e31f969d337b8f40
… r=lth This commit changes all bulk-memory instructions to perform up-front bounds checks and trap if any access would be out-of-bounds before writing. This affects: * memory.init,copy,fill * table.init,copy,fill * data segment instantiation (reduces to memory.init) * elem segment instantiation (reduces to table.init) Spec issue: WebAssembly/bulk-memory-operations#111 Differential Revision: https://phabricator.services.mozilla.com/D51755 UltraBlame original commit: 801e6ae4efdeb516250c6800e31f969d337b8f40
Spec issue: WebAssembly#111 This commit changes the semantics of bulk-memory instructions to perform an upfront bounds check and trap if any access would be out-of-bounds without writing. This affects the following: * memory.init/copy/fill * table.init/copy (fill requires reftypes) * data segment init (lowers to memory.init) * elem segment init (lowers to table.init)
Spec issue: WebAssembly#111 This commit changes the semantics of bulk-memory instructions to perform an upfront bounds check and trap if any access would be out-of-bounds without writing. This affects the following: * memory.init/copy/fill * table.init/copy (fill requires reftypes) * data segment init (lowers to memory.init) * elem segment init (lowers to table.init)
Spec issue: WebAssembly#111 This commit changes the semantics of bulk-memory instructions to perform an upfront bounds check and trap if any access would be out-of-bounds without writing. This affects the following: * memory.init/copy/fill * table.init/copy (fill requires reftypes) * data segment init (lowers to memory.init) * elem segment init (lowers to table.init)
Spec issue: #111 This commit changes the semantics of bulk-memory instructions to perform an upfront bounds check and trap if any access would be out-of-bounds without writing. This affects the following: * memory.init/copy/fill * table.init/copy (fill requires reftypes) * data segment init (lowers to memory.init) * elem segment init (lowers to table.init)
@eqrion, I think this can be closed. |
The two primary changes involved are: 1. Removal of `assert_return_canonical_nan`/`arithetic nan` in favor of special `nan:canonical`/`nan:arithmetic` constants that can only be used in test expectations. See: WebAssembly/spec#1104 2. New trapping behaviour for bulk memory operations. Range checks are now performed up front for opterations such as memory.fill and memory.copy. See: WebAssembly/bulk-memory-operations#111 And: WebAssembly/bulk-memory-operations#123 The old behaviour is still kept around to support table.fill which is defined in reference-types proposal and has yet to be updated.
The two primary changes involved are: 1. Removal of `assert_return_canonical_nan`/`arithetic nan` in favor of special `nan:canonical`/`nan:arithmetic` constants that can only be used in test expectations. See: WebAssembly/spec#1104 2. New trapping behaviour for bulk memory operations. Range checks are now performed up front for opterations such as memory.fill and memory.copy. See: WebAssembly/bulk-memory-operations#111 And: WebAssembly/bulk-memory-operations#123 The old behaviour is still kept around to support table.fill which is defined in reference-types proposal and has yet to be updated. 3. nullref is now permitted in the text and binary format.
The two primary changes involved are: 1. Removal of `assert_return_canonical_nan`/`arithetic nan` in favor of special `nan:canonical`/`nan:arithmetic` constants that can only be used in test expectations. See: WebAssembly/spec#1104 2. New trapping behaviour for bulk memory operations. Range checks are now performed up front for opterations such as memory.fill and memory.copy. See: WebAssembly/bulk-memory-operations#111 And: WebAssembly/bulk-memory-operations#123 The old behaviour is still kept around to support table.fill which is defined in reference-types proposal and has yet to be updated. 3. nullref is now permitted in the text and binary format.
The two primary changes involved are: 1. Removal of `assert_return_canonical_nan`/`arithetic nan` in favor of special `nan:canonical`/`nan:arithmetic` constants that can only be used in test expectations. See: WebAssembly/spec#1104 2. New trapping behaviour for bulk memory operations. Range checks are now performed up front for opterations such as memory.fill and memory.copy. See: WebAssembly/bulk-memory-operations#111 And: WebAssembly/bulk-memory-operations#123 The old behaviour is still kept around to support table.fill which is defined in reference-types proposal and has yet to be updated. 3. nullref is now permitted in the text and binary format.
The two primary changes involved are: 1. Removal of `assert_return_canonical_nan`/`arithetic nan` in favor of special `nan:canonical`/`nan:arithmetic` constants that can only be used in test expectations. See: WebAssembly/spec#1104 2. New trapping behaviour for bulk memory operations. Range checks are now performed up front for opterations such as memory.fill and memory.copy. See: WebAssembly/bulk-memory-operations#111 And: WebAssembly/bulk-memory-operations#123 The old behaviour is still kept around to support table.fill which is defined in reference-types proposal and has yet to be updated. 3. nullref is now permitted in the text and binary format.
The two primary changes involved are: 1. Removal of `assert_return_canonical_nan`/`arithetic nan` in favor of special `nan:canonical`/`nan:arithmetic` constants that can only be used in test expectations. See: WebAssembly/spec#1104 2. New trapping behaviour for bulk memory operations. Range checks are now performed up front for opterations such as memory.fill and memory.copy. See: WebAssembly/bulk-memory-operations#111 And: WebAssembly/bulk-memory-operations#123 The old behaviour is still kept around to support table.fill which is defined in reference-types proposal and has yet to be updated. 3. nullref is now permitted in the text and binary format.
The two primary changes involved are: 1. Removal of `assert_return_canonical_nan`/`arithetic nan` in favor of special `nan:canonical`/`nan:arithmetic` constants that can only be used in test expectations. See: WebAssembly/spec#1104 2. New trapping behaviour for bulk memory operations. Range checks are now performed up front for opterations such as memory.fill and memory.copy. See: WebAssembly/bulk-memory-operations#111 And: WebAssembly/bulk-memory-operations#123 The old behaviour is still kept around to support table.fill which is defined in reference-types proposal and has yet to be updated. 3. nullref is now permitted in the text and binary format.
The two primary changes involved are: 1. Removal of `assert_return_canonical_nan`/`arithetic nan` in favor of special `nan:canonical`/`nan:arithmetic` constants that can only be used in test expectations. See: WebAssembly/spec#1104 2. New trapping behaviour for bulk memory operations. Range checks are now performed up front for opterations such as memory.fill and memory.copy. See: WebAssembly/bulk-memory-operations#111 And: WebAssembly/bulk-memory-operations#123 The old behaviour is still kept around to support table.fill which is defined in reference-types proposal and has yet to be updated.
As discussed in #1, the expectation is that producers will use
memory.copy|fill
for short lengths in addition to long lengths. We’ve already seen this to be the case, and have been investigating a performance regression resulting from LLVM 9 usingmemory.copy
for short constant lengths [1].Part of that regression is in a sub-optimal OOL call to the system
memmove
, but to really get performance to par, we’d like to inline these shortmemory.copy
s to loads and stores.This has turned out challenging to implement in a way that is better or equal to the Wasm loads and stores that were emitted previously.
There are several problems resulting from the following:
src
,dest
src
,dest
(1) and (2) are related. Because we don’t know the alignment of
src
ordest
we cannot use wider transfers than a single byte at a time (e.g 32bit, 64bit, or 128bit) or else we’d be at risk of the final store being partially OOB and not writing all bytes up to the boundary due to misalignment ofsrc
ordest
.The system
memmove
can work around this by aligning thesrc
/dest
pointers, using wide transfer widths, and fixing up slop afterwards. But this isn’t feasible for inlined code.The problem with (3) is that we need to generate two sequences of loads and stores. One for if
src < dest
where we need to copy fromhigh -> low
, and another forlow -> high
. This adds to code size and is a comparison that we didn’t need to do before. This could be potentially solved in a branchless way by using vector registers as a temporary buffer, but that has difficulty still with (1) and (2).There seem to be several options that could improve this situation.
We could ask LLVM to not emit
memory.copy
in these situations.memory.copy
is not equivalent tomemcpy
, it’smemmove
and has more strict semantics than LLVM requires. For example, with struct copies LLVM should know the alignment and that there is no overlap. Recovering this information at runtime is unfortunate. The downside to this is potential binary size increases, and limiting to loads and store widths that are defined in Wasm (e.g. no SIMD yet).We could modify the behavior for partially OOB’s ranges to not write any bytes at all. This would allow us to load all
src
bytes into vector registers, then store them todest
fromhigh -> low
. If there is a trap, it will happen immediately and nothing will be written. This fixes (1) and (3) by changing (2). The uncertainty here is around whether this is possible with a future ‘memory.protect’ instruction.We could add an alignment hint similar to the one used for plain loads and stores. We could then emit loads and stores at width of the alignment along with a guard checking the alignment. If the guard fails, we’d need to fall back to a slow path. If the guard succeeds, we’d have a guarantee for (2). This approach still has the problem of (3), and it doesn’t seem like adding an overlap hint would be feasible, due to the complexity of the guard required.
cc @lars-t-hansen @julian-seward1 @lukewagner
[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1570112
The text was updated successfully, but these errors were encountered: