Skip to content
This repository was archived by the owner on Nov 3, 2021. It is now read-only.

Expected to be used for large sizes? #1

Open
sunfishcode opened this issue Sep 8, 2017 · 14 comments
Open

Expected to be used for large sizes? #1

sunfishcode opened this issue Sep 8, 2017 · 14 comments
Labels

Comments

@sunfishcode
Copy link
Member

The tracking issue for this feature says

We expect that WebAssembly producers will use these operations when the region
size is known to be large, and will use loads/stores otherwise.

I don't see this mentioned in the Overview.md. Is this still an expectation?

@binji
Copy link
Member

binji commented Sep 8, 2017

My thought is that even if we tell folks to use it for large regions, they'll use it for small ones too, so we'll have to handle that anyway. I think @lukewagner originally suggested that the size have page units to prevent that. Is it worth it though? What's the cost to the VM to have to handle small regions?

@lukewagner
Copy link
Member

The benefit I see for clamping to page sizes is that we remove any expectation that the wasm engine might optimize move_memory/set_memory by doing either of:

  • using constant-propagation to see if the size is constant and, if so, inlining something fast
  • some sort of IC to make tiny cases super-fast (i.e., not calling out to libc)

which lets engines compile move_memory to a call to libc memmove and be done with it.

@jfbastien
Copy link
Member

Wouldn't that remove the binary size saving?

@lukewagner
Copy link
Member

That's an interesting point, but I wasn't aware that this feature was expected to reduce binary sizes by any significant amount in any case. It would certainly change the nature of the feature (and what engines needed to do) if move_memory was used aggressively for this purpose.

@titzer
Copy link

titzer commented Sep 25, 2017

I think clamping to page sizes would cripple this feature and result in a proliferation of user code that tries to divide original requests into a page-multiple-sized chunk followed by cleanup code. That's a classic abstraction inversion.

@lukewagner
Copy link
Member

Why would there not be a single implementation of memcpy in libc? In general, we haven't used "toolchains will have to implement" as an argument to include things in wasm (e.g., trig).

@binji
Copy link
Member

binji commented Oct 26, 2017

Just coming back to this...

It seems like the wasm page size is bit too large of a granularity -- the microbenchmark shows benefits for sizes < 64K.

In general, we haven't used "toolchains will have to implement" as an argument to include things in wasm (e.g., trig).

True, though we also seem to have assumed a mostly symbiotic relationship with producers, where they'll produce good code so the VM doesn't have to perform complex optimizations. I think it's reasonable to assume the same here -- if we give guidelines for the producer (TBD 😉) then can the VM assume that it isn't going to have to optimize a constant 4 byte memcpy that should have just been a load/store pair?

@jfbastien
Copy link
Member

if we give guidelines for the producer (TBD 😉) then can the VM assume that it isn't going to have to optimize a constant 4 byte memcpy that should have just been a load/store pair?

I'd hope so.

@julian-seward1
Copy link

I think clamping to page sizes would cripple this feature and result in a proliferation of user code that tries to divide original requests into a page-multiple-sized chunk followed by cleanup code.

I agree. I think it would be safer to leave it at the byte granularity, assume that these operations will get used at both big and small sizes, and leave it to implementations to decide how (if at all) they want to optimise the small-size cases.

@lukewagner
Copy link
Member

But in practice, if every wasm engine doesn't reliably optimize the small-constant-size case (which, from what I understand, is very commonly used) there will be a significant perf cliff which will require the toolchain (to provide reliable perf to its users) to do the lowering to loads anyway. With page-size-quanta, the responsibility for who does what is clear.

I don't see how this cripples the feature since this is an advanced optimization emitted only in special cases by compilers, not something anyone writes by hand in the source language.

@julian-seward1
Copy link

Well, it will force producers to produce sequences that mix calls and inline code, which will be verbose and will also inherently not be optimised for more than one target processor (how do you make the unroll vs vectorise vs unrolled-and-vectorised vs call-out tradeoffs if you don't know what you're running on?) I'm also not convinced that the small-constant-size case is uncommon: I frequently see a lot of bits of memcopy/memmove being called from compiled Rust, when profiling natively compiled Rust.

I do understand what you're getting at though. Would it be feasible and/or helpful to add to the spec, an advisory section that states a minimum set of copy/fill cases that an optimising Wasm implementation can reasonably expect to do well and in-line? That is to say, add some kind of quasi performance-guarantees to the contract?

@lukewagner
Copy link
Member

Yeah, I suppose a non-normative note that states the contract, even if informally, could effectively make it the browser's "fault" if they didn't optimize appropriately, so producers could feel confident in always emitting mem.copy/mem.set.

Also, thinking more about what a producer would need to do to optimally use a page-quanta mem.copy/mem.set, it does seem suboptimal. In particular, if we use the existing wasm 64kb page size, then that means up to (128kb - 2) bytes of suboptimal copying (possibly significantly suboptimal if the producer doesn't do the extra work to use 64-bit copies (and, later, 128-bit)). If we use a non-wasm-page-size (<64kb) quanta, it'll feel rather arbitrary and probably look increasingly silly as CPUs evolve. Also, a fully-optimized memcpy wasm impl might cost a few hundred bytes which adds to the fixed runtime overhead which we'd generally like to avoid for webby use cases.

I'm fine with byte-granularity, then.

@jfbastien
Copy link
Member

@lukewagner would you rather also have an alignment hint, so you can do fancy stuff on top?

@lukewagner
Copy link
Member

If you're talking about "page-aligned" hint, I don't think it would help (the case browsers would have to specially-optimize is when the size was small and constant; for all others we'd just call out to the libc memmove).

Or perhaps you mean 1/2/4/8/16-byte alignment? When coupled with a constant size, such that the engine is inlining a straight sequence of load/stores, I guess I could see this being useful for the same reason that the alignment hint is present on scalar loads/stores, but that is a separate point from the one I made/rescinded above.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants