-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What about subnormals? #148
Comments
For @jfbastien searchability, I'll say the word "denormals" too :-). But IEEE754-2008 is about 7 years ago, so it's time to be up to date :-). |
I would argue for IEEE 754 compliance from the beginning. The rationale is that hardware and software all tends toward the standard over time. I think we're uncorking a long-term bottle of annoyance and incompatibility to deviate. As for subnormals, it seems the same cycle has repeated multiple times in floating point history. It's been tempting for hardware to cut corners to either do FTZ or something else, and has always tended back to IEEE 754 full compliance. Even GPUs are implementing full IEEE now. Of our tier 1 platforms, the only one I am aware of where subnormals are not implemented at all is Float32x4 on Arm NEON (i.e. SIMD). Float64x2 on NEON is fully IEEE compliant. Scalar arithmetic on arm is of course compliant. The subnormal situation seems to come down to SIMD, specifically the case of Arm NEON above. Based on conversations with hardware designers, microarchitectures have gotten so much better at superscalar floating point, even Arm cores, that it might be acceptable to spec Float32 as IEEE as well as Float32 SIMD operations as IEEE, and then simply not do Float32x4 as SIMD on arm. But this is something we should measure and motivate when SIMD is coming into WebAsm. At that point, if the performance really justifies weakening the spec, we can weaken the spec. Otherwise, it would be hard to tighten up the spec later. |
My objection to that is: which developers care about denormals? Developers usually learn about them when they have stray denormals in their compute kernel and code goes orders of magnitude slower. I'm still looking for someone who wants denormal support. |
If we specify FTZ as a developer-controlled mode, we're not weakening the spec (as in, we're not making semantics any looser); we're giving strictly more power to developers and this is something that they are specifically asking for (in our discussions with asm.js-using gamedevs). If we consider that wasm will always be run on emerging platforms (which often start w/ terrible denormal perf) and very old platforms, then this will be a consistent feature request, not one that will get definitely fixed. Furthermore, from my discussions with Intel, even with new, optimized denormals, they're not equivalent in speed to normal numbers and they're also not optimized for all ops. This is why setting FTZ is standard practice; it's just one less perf cliff to worry about. |
Setting DAZ/FTZ should be a global property that's set once for the entire wasm application, though?
|
I would expect it to be a global flag on a wasm module that cannot be toggled and can thus be assumed for AOT/cached compilation. That still leaves questions w/ dynamic linking but I'd default to the simple option of: if you try to dynamically link a wasm module w/ a different flag than you, loading fails. |
I want to see data, and the burden of proof for violating IEEE 754 is very I think the phase where we gather sufficient data to motivate a digression I think mandating FTZ is a no-go, since mode-switching seems to be really On Thu, Jun 11, 2015 at 12:14 PM, Luke Wagner notifications@github.com
|
To be clear, it's not nondeterminism that is being discussed: it's (deterministically) flushing denormals to zero. Also, we have had reports (e.g. and these guys iirc) specifically about people hitting denormal perf problems in JS. That setting FTZ (globally, not toggling dynamically) is standard practice for whole domains (games, signal processing) demonstrates that this is something developers expect. |
I agree with titzer. I don’t think there is much long-term value to specifying floating point behavior that is not IEEE. -Filip
|
I think the vast majority of developers that encounter denormals discover them while troubleshooting unexpected performance hits. In the DSP world it tends to be a very common performance gotcha so I would advocate defaulting FTZ with an optional global IEEE-compliant mode. |
My preference would be full IEEE 754 support, or barring that to use DAZ. Undefined behavior would be a very terrible decision in my opinion. Inconsistent semantics make it very difficult to implement algorithms from computational geometry, which require exactly computing things like the sign of a determinant for example. A common technique to speed up these calculations is to use a floating point filter as a quick check before falling back to a more expensive exact arithmetic test. If the floating point behavior is not specified, then it becomes much more difficult (and in some cases impossible) to construct such a filter. |
Here is a discussable proposal which I believe gives most people what they want, though it makes some tradeoffs (as any proposal must):
The following questions seem interesting: Is "standard" the right default? Losing Float32x4 on 32-bit NEON by default is not pretty (though compilers and tools could help detect problems and guide developers to solutions). Compiler flags are a nuisance. However, abrupt underflow is also sometimes problematic, and it's non-IEEE, so it's a question of priorities and perhaps also short-term versus long-term. Is function-body the right scope for mode switching? It's somewhat fine-grained, but also gives implementations a natural optimization boundary, because dealing with mode changes in the middle of a function is awkward. Inlining can blur such boundaries, but optimizers would at least have the option of declining to do inlining (or other interprocedural optimizations) across boundaries where the modes differ. And implementations might be able to avoid the cost of mode switching across many function boundaries when the mode doesn't actually change. Is "maybe_flush" what we want, or would a straight "flush" be better? "maybe_flush" avoids requiring CPUs to have both DAZ and FTZ flags. And, some implementations may wish to stay in "standard" mode in some cases. But, it does introduce nondeterminism which could lead to different problems. |
What is the advantage of hardcoding in the spec that mode switching must 2015-07-02 3:22 GMT+03:00 Dan Gohman notifications@github.com:
|
This makes perfect sense to me. -Fil
|
Structured mode switching, rather than just arbitrary dynamic mode switching, means that one can always statically determine the mode for any operation, which is an important property. Putting mode switches at function boundaries achieves this, though another option would be to have a mode-switch AST node which would be like a block node but would set the mode within its lexical extent. Between function attributes and AST nodes, I chose function attributes because it gives implementations a few more options for avoiding mode switching costs. However, AST nodes would give applications some more flexibility, so we can consider both choices here. |
This proposal sounds pretty reasonable to me. I prefer the "opt-in" to On Thu, Jul 2, 2015 at 5:32 PM, Dan Gohman notifications@github.com wrote:
|
Should it be and AST node, or a per-operation property? That won't cause code bloat because of the way we specify operations. |
I'd also be fine(r) if this was a SIMD only feature, where vector On Thu, Jul 2, 2015 at 6:27 PM, JF Bastien notifications@github.com wrote:
|
That sounds OK, though I would stick with IEEE and DontCare. Having an FTZ attribute would mean mode switching in embedded scenarios - like if a native app uses JavaScriptCore.framework and then the JS code goes and loads some wasm (this is something I that I think we’ll eventually want to support). The rest of the app will almost certainly be IEEE. -Filip
|
For usecases like games I think DontCate isn't sufficient: folks actually want Fastest. If the HW makes denorms free then great, but otherwise they want FTZ. |
Right, they want Fastest. FTZ won’t be Fastest if you have to mode-switch on every native API boundary. How about rename DontCare to Fastest? The point is: “I care less about the semantics of denorms than I care about how fast my code runs”. -Filip
|
This is sounding pretty convincing. Question: is the only available method of battling this slow down to enable FTZ, or are there other tricks that people use also? -Filip
|
@pizlonator I was thinking about in-browser usecases that are fully wasm, with little to no JS glue around. Yes, out-of-browser is also a usecase to which my argument applies, but I agree we can mostly ignore it for this discussion. |
If it you can't enable FTZ (eg when doing DSP in a Java VM) there are several tricks you can use such as testing for and flushing denormal values explicitly or injecting inaudible noise into filters to keep them out of the denormal range. |
To be clear though, do you want a fast-math flag in the style of Java, a fast-math flag in the style of C compilers, or a fast-math flag that just means FTZ? -Filip
|
Thought about this argument more, and I’m no longer so happy with deterministic FTZ even if it’s module-wide. I expect that wasm users will modularize their code. This is tendency in any language that supports modules: you create separate modules for separate things. A deterministic module-wide FTZ setting will make cross-module calls slower in cases where there is a settings mismatch. I’m still not convinced about this FTZ thing. Another thing that occurred to me about empirically observed FTZ slow-downs is that they may be due to the presence of denormals changing the convergence characteristics of a numerical fixpoint - that is the fixpoint may take longer to converge. Of course it’s sad when native code exhibits different behavior in wasm than it did natively, but that ship has already sailed. I still don’t see evidence that the lack of FTZ prevents people from writing performant code; it feels like a nice-to-have. And having an FTZ setting that sometimes makes fine-grained cross-module calls slow seems broken. -Filip
|
I was thinking more in line with C compilers where IEEE compliance is not guaranteed and denormal support may be disabled. Thinking about it more however, it would probably need to be more explicit. |
What @davidsehr proposed is to formalize a full math model, with more than just control on denormals. If we're doing scoped attributes then allow developers to wrap regions where reassociation is OK, where FP contraction can be done, and so on. I suggest we discuss denormals in this issue, and figure out a wider math model in another issue, potentially punting to post-MVP: figuring out the denormal default matters IMO, but I think we can agree on other math behavior for MVP (essentially, not fast math). |
@pizlonator I see the theoretical multi-module app situation you're talking about (once we have dynamic linking, that is), but if we have a clear default mode (as a non-normative note in the spec and in llvm-wasm) then 99% of modules will all have that default mode. It would also make sense to issue a console warning when dynamically linking heterogeneous ftz-mode modules. |
I'd argue concerns about the cost of the FTZ switch at module boundaries are also less relevant since the cost of calling out of/into an asm.js module is already elevated in SpiderMonkey (or was the last time I checked, anyhow). The overhead there can be aggressively optimized over time, but you're still going to effectively be transitioning between runtime environments, which means argument values (unless they're ints or floats in registers) are being marshaled into/out of the heap and various other setup is happening. I suspect there will always be some overhead involved here, so the introduction of more in the case of FTZ state mismatch is reasonable given the upside (superior, predictable performance in applications that need FTZ). There are definitely scenarios where people will want to call into/out of wasm a lot, and in those cases we'll want to strongly discourage the use of FTZ. But the same is true for many existing native APIs - IIRC DirectX on Win32 is rather opinionated about x87 modes etc and it's just something game developers deal with. |
@jfbastien To address the need for a full math model, I created #260. |
I think it's silly to design wasm just for games, and to have a module mode flag that masquerades as a performance feature but could cause slow downs due to mode switching. -Filip
|
How is that different from the myriad of existing performance techniques, though? Tools like PGO can reduce your performance or break your application if guided by bad data/configured incorrectly. You might opt to use a lookup table in a scenario where it's actually more expensive than the computation due to memory characteristics. You might hand-inline some logic into your JavaScript, pushing its size over a threshold and causing some JS engines not to optimize it (FWIW, this can happen in .NET too). In the bad old days on x86, MMX and x87 shared registers so if you mixed those two you paid an enormous mode switch cost to bounce between them. Threading a performance-sensitive algorithm can reduce performance if it ends up highly contended on a lock or atomic. There are very few optimizations you can make thoughtlessly that have no chance of hurting performance. Optimization is something that has to be an informed decision. FTZ is the same. AFAIK we're talking about an optional FTZ flag that defaults to off, so the vast majority of developers will be fine with the default and not turn it on. Many of those developers will be turning it on because their native application already had FTZ enabled, so they were paying that cost to begin with. FWIW my SpiderMonkey example was not to imply that JSC is exactly like SpiderMonkey, but to imply that there will probably be some sort of overhead for JS<->WASM or Module<->Module transitions in most engines (eventually, if not right when the MVP is implemented). The design is already making various performance sacrifices for good reasons. We could always make FTZ an advisory flag so it's spec-compliant for JSC to ignore it, and then we'll find out whether users care or not :-) |
|
I think the larger issue here is that dealing with FTZ modes feels somewhat On Sat, Jul 11, 2015 at 2:25 PM, pizlonator notifications@github.com
|
Consistent complaints from people who work on realtime audio and multimedia software are not 'hearsay' and FTZ is only controversial if you're talking about wanting to leave it out of an environment to simplify things. Mind you, simplification is a noble goal. But please don't miscategorize an important feature for real-world workloads, heavily used in existing production applications, as a 'controversial performance feature that could fail to pan out.' If it's an advisory flag that people only use if they need it, the only way it would cause us grief is if applications ship with it for measurable real performance gains and then somehow we end up with architectural reasons to regret it later (like because we implemented it wrong). I'm not sure how badly we could mess up a module-wide FTZ flag. If we're concerned, we can punt with an explicit statement that we will 'do it right' post-MVP. |
Exactly. The course that would make me happiest is to do it right post-MVP, and not mention FTZ in the MVP. The downside of adding FTZ to the MVP in the currently proposed forms is: Downside of a Nondeterministic FTZ flag: it’s nondeterministic, which can lead to divergence between implementations. My own experience with FTZ is that some codes unexpectedly require either the presence of FTZ or the lack of it because it influences how some numeric fixpoint converges. Downside of a Deterministic FTZ flag: it cannot be polyfilled and we can’t ever kill it. It also raises the bar for how much work is needed to achieve a compliant implementation. I think I understand your argument in favor of FTZ: it is something that is beneficial to enable per-process in native apps that do audio, and those who do it feel strongly about it. I take it as a given that they feel strongly about it because they know things about this that I don’t. But I also know that it’s not the only way to get good performance in such code - you can chop away the denormals yourself if you really care, and people sometimes do this. This makes me suspect that FTZ may be more of a convenience nice-to-have than a performance showstopper. Also, arguments about the performance of FTZ in native code aren’t directly transferable to wasm given wasm’s early state, for the following reasons:
-Filip |
I expect FTZ isn't something we're going to see broadly across brenchmarks; it'll have 0 impact on 99.9% of apps and a 2x slowdown on the .1% of apps that happen to run into slow denormal ops on hot paths. But this is exactly the description of a post-MVP feature, so maybe that is the right path. Starting with less nondeterminism, one less thing to implement, and more polyfill fidelity in v.1 is a good consolation prize. For now or post-MVP, I had one idea for a refinement on how to define FTZ: give modules a list of global options which are ignored if not known to the browser. Make "FTZ" an optional feature (one that could be permanently not-implemented while still being conforming). Engines which want "fastest" could unconditionally set-and-forget "FTZ". Codes that really want mandatory FTZ could feature test and, if FTZ wasn't present, take a different code path that did explicit denormal flushing. I wonder if llvm-wasm could even include a flag that did all this automatically (scoped or globally). |
For reference, I dug up a few instances where FTZ & denormals crossed the Web Audio mailing list: |
As discussed today: we'll wait for data before coming to a conclusion. Leave bug #148 open, don't change FAQ with #260 just yet. Re-discuss when @titzer and @pizlonator can discuss over higher-throughput medium than github issues. |
So until this morning I thought this conversation was about some mode for allowing FTZ, with subnormals by default. And I think we can get data and implement something good, so I wasn't too worried about this. But having FTZ by default would incur a nontrivial penalty to every call across the FFI. If you are very adamant about FTZ, then maybe we should move up the mode switching to an MVP issue, which could allow us to skirt the whole debate about defaults. |
I am currently proposing we fix this with #271. |
#271 is now merged. |
Forking from: #141 (comment)
Should denormal be:
We should probably let ourselves change this from developer feedback, but I'd like to make some decision for MVP. I suggested on esdiscuss that JavaScript go full unspecified, and just do DAZ/FTZ because it's often faster. Yes, x86 is better than it used to be but that's not universal, ignores current hardware, and doesn't look towards what new hardware will do. I like leaving the door open :-)
For @sunfishcode searchability, I'll use the word "subnormals" too :-)
The text was updated successfully, but these errors were encountered: