Skip to content
This repository was archived by the owner on Nov 3, 2021. It is now read-only.

Non-overlapping version of memory.copy? #121

Open
penzn opened this issue Oct 8, 2019 · 13 comments
Open

Non-overlapping version of memory.copy? #121

penzn opened this issue Oct 8, 2019 · 13 comments

Comments

@penzn
Copy link
Contributor

penzn commented Oct 8, 2019

C/C++ requires non-overlapping source and destination to memcpy. Rust supports both overlapping and non-overlapping copy operations. Would anybody mind a PR with non-overlapping variant of memory.copy?

@conrad-watt
Copy link
Contributor

C/C++ and Rust make the non-overlapping copy faster by specifying undefined behaviour in the case that there is actually an overlap. This doesn't really fit Wasm's design philosophy. Alternatively, if such an instruction were spec'd to trap immediately in the overlapping case, it would be hard to make it faster than the regular memory.copy (since you could just do the same check and go to a fast path instead of trapping).

@AndrewScheidecker
Copy link
Contributor

I was thinking about this in the context of the other thread about inlining memory.copy, but the problem is that without a dynamic overlap check, it's possible to end up with a worse form of non-determinism than racing changes to memory bounds/protection: using a non-overlapping copy instruction with address ranges that actually overlap may behave differently on different runtimes, and WASM code that "works" in one runtime might not work in another.

It would be possible to avoid that inter-runtime non-determinism by defining such a non-overlapping memory.copy to be semantically equivalent to a one-byte-at-a-time copy, but that constrains the implementation to something that is inefficient for the small copies that benefit the most from getting rid of the dynamic direction check.

@lars-t-hansen
Copy link
Contributor

C/C++ requires non-overlapping source and destination to memcpy. Rust supports both overlapping and non-overlapping copy operations. Would anybody mind a PR with non-overlapping variant of memory.copy?

What @conrad-watt said. I don't think this yields any benefits in the context of wasm, where we have to check. It's a bit like regular memory accesses: they may declare that they are properly aligned, but they could be lying (or mistaken) about that. Yet we have to make them work anyway.

@binji
Copy link
Member

binji commented Feb 11, 2020

Seems like there's not much to be done here, closing.

@binji binji closed this as completed Feb 11, 2020
@laughinghan
Copy link

Hi sorry, I don't understand the resolution to this thread.

@conrad-watt @lars-t-hansen My understanding is that a dynamic overlap check that traps is practically free because it's branch-predictor-friendly (because in practice, in hot code, it should never trap). Whereas dynamic overlap checks to decide whether or not to allow parallelized copy, and whether to forward- or reverse-copy, would likely see all cases and couldn't be optimized by the branch predictor.

The same consideration applies to alignment, too. I understand that for large bulk moves, branch predictor optimization would be negligible, but #1 seems to say we explicitly assume these operations will be used for small moves too. This was also discussed re alignment in #111, but that seems to have been closed after consensus was achieved on a change to bounds-checking.

Am I missing something?

@conrad-watt
Copy link
Contributor

conrad-watt commented May 21, 2020

I have only amateur knowledge of branch prediction, but this would only be a potential issue in situations where the checks aren't inlined at the site, right? If the short moves are of constant length, I'd expect the checks to be inlined or unnecessary (based on the resolution of #111). So to see a real performance hit, you'd need a program performing a weird enough mixture of overlapping/non-overlapping small variable-size moves to defeat the branch predictor and have this be observable. My knee-jerk reaction is that such programs are unlikely to exist, and if they were common then engines would probably want to more aggressively inline anyway, to avoid paying the cost of calling an intrinsic for such small moves.

@laughinghan
Copy link

laughinghan commented May 23, 2020

I am no expert either, but I don't think inlining renders the checks unnecessary, and isn't the point of this intrinsic to be faster than equivalent Wasm code, not to impose an additional cost that engines have to weigh?

I don't think programs have to be particularly weird to defeat the branch predictor (and related mechanisms like the cache prefetcher), but I'm not in a position to prove it either way. You could be right that inlining obviates the principle that a dynamic check to trap is free whereas a dynamic check that alters control flow and precludes a fast path is expensive. That wasn't the conclusion I drew from #111, where it seems like LLVM is explicitly choosing not to use memory.copy for short constant copies because of the loss of alignment and overlap information (#111 (comment)), but I should probably let implementation experts like @eqrion speak for themselves.

@tlively
Copy link
Member

tlively commented May 23, 2020

The resolution to this thread had nothing to do with branch predictors or inlining. Wasm cannot have a version of memory.copy that assumes the memory regions do not overlap because in practice the memory regions might overlap and WebAssembly has to define an exact behavior in that case. As @conrad-watt said in his first comment, languages that provide a version of memcpy that assumes non-overlapping memory regions make it undefined behavior to call that function with overlapping regions. That is not an option for WebAssembly because WebAssembly does not have undefined behavior.

@conrad-watt
Copy link
Contributor

conrad-watt commented May 23, 2020

To be fair to @laughinghan, I think their comment posits that a non-overlapping version of memory.copy that does check for non-overlap and traps if that is not satisfied could still be faster than a general memory.copy that checks for non-overlap and chooses between fast path/slow path, because the latter could have problems with branch prediction. I'd still be surprised if this was an issue in real-world code.

@CryZe
Copy link

CryZe commented May 23, 2020

Well memcpy is really hot in Rust and the branch isn‘t super predictable as it gets called in so many different ways. So it might actually make a difference. I guess you would just have to define the behavior of what happens if it is indeed overlapping despite using the non-overlapping variant.

@tlively
Copy link
Member

tlively commented May 23, 2020

Oh, I see. Sorry for misunderstanding that. It would be interesting to get data on how common overlapping memcpy is in practice and whether a trap-on-overlap variant would be useful.

@conrad-watt
Copy link
Contributor

conrad-watt commented May 24, 2020

If this did turn out to be a real perf concern, my preference would be that a non-overlapping-optimised variant appears in the semantics as a "hint" to memory.copy, similar to the alignment hints for load/store. So the Wasm-level semantics would still be the same, and there would still be a slow-path fallback, but an engine could see the hint and, if it wanted to optimise for it, inline the checks/use a different intrinsic/OOL.

Maybe such a "hinted variant" would have to be encoded as a separate instruction in the binary format either way, just because this proposal is so far along.

@conrad-watt conrad-watt reopened this May 24, 2020
@lars-t-hansen
Copy link
Contributor

I think that if we want a nonoverlapping variant of memory.copy, then we should have some data to back up the utility of it, and we should do this as a separate proposal, as we really need to ship this one. ie we should kick this issue up to the design repo, probably.

I probably agree with @conrad-watt that a hint is more useful than a trap-on-unaligned, and that hints could be overlapping/not + alignment of pointers. As usual, there is a memory index field in memory.copy that can be repurposed as a flag byte / indicator of additional information, and now that we are aware of this issue we can take that into account when we work on the multi-memory feature, which will need to use the memory index.

But if the purpose of this variant is performance, it would also be useful to determine (through experimentation) whether there's an edge to be had from trapping rather than hinting.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants