memory.copy|fill semantics limit optimizations for short constant lengths

As discussed in #1, the expectation is that producers will use `memory.copy|fill` for short lengths in addition to long lengths. We’ve already seen this to be the case, and have been investigating a performance regression resulting from LLVM 9 using `memory.copy` for short constant lengths [1].

Part of that regression is in a sub-optimal OOL call to the system `memmove`, but to really get performance to par, we’d like to inline these short `memory.copy`s to loads and stores.

This has turned out challenging to implement in a way that is better or equal to the Wasm loads and stores that were emitted previously.

There are several problems resulting from the following:

1. We must assume no alignment for `src`, `dest`
2. When a range is partially OOBs, we must write all bytes up until the region is OOBs
3. We must copy correctly when there is overlap of `src`, `dest`

(1) and (2) are related. Because we don’t know the alignment of `src` or `dest` we cannot use wider transfers than a single byte at a time (e.g 32bit, 64bit, or 128bit) or else we’d be at risk of the final store being partially OOB and not writing all bytes up to the boundary due to misalignment of `src` or `dest`.

The system `memmove` can work around this by aligning the `src`/`dest` pointers, using wide transfer widths, and fixing up slop afterwards. But this isn’t feasible for inlined code.

The problem with (3) is that we need to generate two sequences of loads and stores. One for if `src < dest` where we need to copy from `high -> low`, and another for `low -> high`. This adds to code size and is a comparison that we didn’t need to do before. This could be potentially solved in a branchless way by using vector registers as a temporary buffer, but that has difficulty still with (1) and (2).

There seem to be several options that could improve this situation.

1. We could ask LLVM to not emit `memory.copy` in these situations. `memory.copy` is not equivalent to `memcpy`, it’s `memmove` and has more strict semantics than LLVM requires. For example, with struct copies LLVM should know the alignment and that there is no overlap. Recovering this information at runtime is unfortunate. The downside to this is potential binary size increases, and limiting to loads and store widths that are defined in Wasm (e.g. no SIMD yet).

2. We could modify the behavior for partially OOB’s ranges to not write any bytes at all. This would allow us to load all `src` bytes into vector registers, then store them to `dest` from `high -> low`. If there is a trap, it will happen immediately and nothing will be written. This fixes (1) and (3) by changing (2). The uncertainty here is around whether this is possible with a future ‘memory.protect’ instruction.

3. We could add an alignment hint similar to the one used for plain loads and stores. We could then emit loads and stores at width of the alignment along with a guard checking the alignment. If the guard fails, we’d need to fall back to a slow path. If the guard succeeds, we’d have a guarantee for (2). This approach still has the problem of (3), and it doesn’t seem like adding an overlap hint would be feasible, due to the complexity of the guard required.

cc @lars-t-hansen @julian-seward1 @lukewagner 

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1570112

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

memory.copy|fill semantics limit optimizations for short constant lengths #111

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

memory.copy|fill semantics limit optimizations for short constant lengths #111

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions