Non-overlapping version of memory.copy? #121

penzn · 2019-10-08T22:55:21Z

C/C++ requires non-overlapping source and destination to memcpy. Rust supports both overlapping and non-overlapping copy operations. Would anybody mind a PR with non-overlapping variant of memory.copy?

The text was updated successfully, but these errors were encountered:

conrad-watt · 2019-10-08T23:28:07Z

C/C++ and Rust make the non-overlapping copy faster by specifying undefined behaviour in the case that there is actually an overlap. This doesn't really fit Wasm's design philosophy. Alternatively, if such an instruction were spec'd to trap immediately in the overlapping case, it would be hard to make it faster than the regular memory.copy (since you could just do the same check and go to a fast path instead of trapping).

AndrewScheidecker · 2019-10-09T00:38:22Z

I was thinking about this in the context of the other thread about inlining memory.copy, but the problem is that without a dynamic overlap check, it's possible to end up with a worse form of non-determinism than racing changes to memory bounds/protection: using a non-overlapping copy instruction with address ranges that actually overlap may behave differently on different runtimes, and WASM code that "works" in one runtime might not work in another.

It would be possible to avoid that inter-runtime non-determinism by defining such a non-overlapping memory.copy to be semantically equivalent to a one-byte-at-a-time copy, but that constrains the implementation to something that is inefficient for the small copies that benefit the most from getting rid of the dynamic direction check.

lars-t-hansen · 2019-10-24T13:22:19Z

C/C++ requires non-overlapping source and destination to memcpy. Rust supports both overlapping and non-overlapping copy operations. Would anybody mind a PR with non-overlapping variant of memory.copy?

What @conrad-watt said. I don't think this yields any benefits in the context of wasm, where we have to check. It's a bit like regular memory accesses: they may declare that they are properly aligned, but they could be lying (or mistaken) about that. Yet we have to make them work anyway.

binji · 2020-02-11T17:16:50Z

Seems like there's not much to be done here, closing.

laughinghan · 2020-05-21T19:52:05Z

Hi sorry, I don't understand the resolution to this thread.

@conrad-watt @lars-t-hansen My understanding is that a dynamic overlap check that traps is practically free because it's branch-predictor-friendly (because in practice, in hot code, it should never trap). Whereas dynamic overlap checks to decide whether or not to allow parallelized copy, and whether to forward- or reverse-copy, would likely see all cases and couldn't be optimized by the branch predictor.

The same consideration applies to alignment, too. I understand that for large bulk moves, branch predictor optimization would be negligible, but #1 seems to say we explicitly assume these operations will be used for small moves too. This was also discussed re alignment in #111, but that seems to have been closed after consensus was achieved on a change to bounds-checking.

Am I missing something?

conrad-watt · 2020-05-21T21:53:44Z

I have only amateur knowledge of branch prediction, but this would only be a potential issue in situations where the checks aren't inlined at the site, right? If the short moves are of constant length, I'd expect the checks to be inlined or unnecessary (based on the resolution of #111). So to see a real performance hit, you'd need a program performing a weird enough mixture of overlapping/non-overlapping small variable-size moves to defeat the branch predictor and have this be observable. My knee-jerk reaction is that such programs are unlikely to exist, and if they were common then engines would probably want to more aggressively inline anyway, to avoid paying the cost of calling an intrinsic for such small moves.

laughinghan · 2020-05-23T02:42:42Z

I am no expert either, but I don't think inlining renders the checks unnecessary, and isn't the point of this intrinsic to be faster than equivalent Wasm code, not to impose an additional cost that engines have to weigh?

I don't think programs have to be particularly weird to defeat the branch predictor (and related mechanisms like the cache prefetcher), but I'm not in a position to prove it either way. You could be right that inlining obviates the principle that a dynamic check to trap is free whereas a dynamic check that alters control flow and precludes a fast path is expensive. That wasn't the conclusion I drew from #111, where it seems like LLVM is explicitly choosing not to use memory.copy for short constant copies because of the loss of alignment and overlap information (#111 (comment)), but I should probably let implementation experts like @eqrion speak for themselves.

tlively · 2020-05-23T19:49:02Z

The resolution to this thread had nothing to do with branch predictors or inlining. Wasm cannot have a version of memory.copy that assumes the memory regions do not overlap because in practice the memory regions might overlap and WebAssembly has to define an exact behavior in that case. As @conrad-watt said in his first comment, languages that provide a version of memcpy that assumes non-overlapping memory regions make it undefined behavior to call that function with overlapping regions. That is not an option for WebAssembly because WebAssembly does not have undefined behavior.

conrad-watt · 2020-05-23T21:03:32Z

To be fair to @laughinghan, I think their comment posits that a non-overlapping version of memory.copy that does check for non-overlap and traps if that is not satisfied could still be faster than a general memory.copy that checks for non-overlap and chooses between fast path/slow path, because the latter could have problems with branch prediction. I'd still be surprised if this was an issue in real-world code.

CryZe · 2020-05-23T21:06:38Z

Well memcpy is really hot in Rust and the branch isn‘t super predictable as it gets called in so many different ways. So it might actually make a difference. I guess you would just have to define the behavior of what happens if it is indeed overlapping despite using the non-overlapping variant.

tlively · 2020-05-23T23:23:33Z

Oh, I see. Sorry for misunderstanding that. It would be interesting to get data on how common overlapping memcpy is in practice and whether a trap-on-overlap variant would be useful.

conrad-watt · 2020-05-24T00:04:41Z

If this did turn out to be a real perf concern, my preference would be that a non-overlapping-optimised variant appears in the semantics as a "hint" to memory.copy, similar to the alignment hints for load/store. So the Wasm-level semantics would still be the same, and there would still be a slow-path fallback, but an engine could see the hint and, if it wanted to optimise for it, inline the checks/use a different intrinsic/OOL.

Maybe such a "hinted variant" would have to be encoded as a separate instruction in the binary format either way, just because this proposal is so far along.

lars-t-hansen · 2020-05-26T07:17:47Z

I think that if we want a nonoverlapping variant of memory.copy, then we should have some data to back up the utility of it, and we should do this as a separate proposal, as we really need to ship this one. ie we should kick this issue up to the design repo, probably.

I probably agree with @conrad-watt that a hint is more useful than a trap-on-unaligned, and that hints could be overlapping/not + alignment of pointers. As usual, there is a memory index field in memory.copy that can be repurposed as a flag byte / indicator of additional information, and now that we are aware of this issue we can take that into account when we work on the multi-memory feature, which will need to use the memory index.

But if the purpose of this variant is performance, it would also be useful to determine (through experimentation) whether there's an edge to be had from trapping rather than hinting.

eqrion mentioned this issue Nov 27, 2019

Advance bulk-memory-operations to phase 4 #132

Closed

binji closed this as completed Feb 11, 2020

conrad-watt reopened this May 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-overlapping version of memory.copy? #121

Non-overlapping version of memory.copy? #121

penzn commented Oct 8, 2019

conrad-watt commented Oct 8, 2019

AndrewScheidecker commented Oct 9, 2019

lars-t-hansen commented Oct 24, 2019

binji commented Feb 11, 2020

laughinghan commented May 21, 2020

conrad-watt commented May 21, 2020 •

edited

Loading

laughinghan commented May 23, 2020 •

edited

Loading

tlively commented May 23, 2020

conrad-watt commented May 23, 2020 •

edited

Loading

CryZe commented May 23, 2020

tlively commented May 23, 2020

conrad-watt commented May 24, 2020 •

edited

Loading

lars-t-hansen commented May 26, 2020

Non-overlapping version of memory.copy? #121

Non-overlapping version of memory.copy? #121

Comments

penzn commented Oct 8, 2019

conrad-watt commented Oct 8, 2019

AndrewScheidecker commented Oct 9, 2019

lars-t-hansen commented Oct 24, 2019

binji commented Feb 11, 2020

laughinghan commented May 21, 2020

conrad-watt commented May 21, 2020 • edited Loading

laughinghan commented May 23, 2020 • edited Loading

tlively commented May 23, 2020

conrad-watt commented May 23, 2020 • edited Loading

CryZe commented May 23, 2020

tlively commented May 23, 2020

conrad-watt commented May 24, 2020 • edited Loading

lars-t-hansen commented May 26, 2020

conrad-watt commented May 21, 2020 •

edited

Loading

laughinghan commented May 23, 2020 •

edited

Loading

conrad-watt commented May 23, 2020 •

edited

Loading

conrad-watt commented May 24, 2020 •

edited

Loading