Make dlmalloc and emmalloc align to max_align_t #10110

Akaricchi · 2019-12-25T14:19:53Z

emmalloc also had a small bug: it could return misaligned pointers if base alignment is greater than 8. I fixed that as well.

Akaricchi · 2019-12-25T15:05:08Z

Looks like the problem with emmalloc lies deeper than I thought.

kripken

In general this looks good, but we do need to estimate the cost of this change, which I'm not sure how best to do.

@juj is working on a large emmalloc refactor so we may want to wait for that to land first.

juj · 2020-01-07T19:24:54Z

Perhaps we can resolve this with expanding to -s MALLOC=dlmalloc-16bytes and -s MALLOC=emmalloc-16bytes options? Then developers can choose (I'd be happy for the default to be dlmalloc-16bytes if it's contentious otherwise).

sbc100

@juj do you have a use case in mind where using the wider/correct alignment would be a significant regression. I'd rather not add more options unless there is a real need.

sbc100 · 2020-01-07T19:52:33Z

tests/malloc_alignment.c

@@ -0,0 +1,21 @@
+/*


Put this file in tests/core?

Akaricchi · 2020-01-08T03:07:41Z

@juj is working on a large emmalloc refactor so we may want to wait for that to land first.

I may just revert the emmalloc changes here and fix dlmalloc only, until the new emmalloc is in.

Required by C and C++ standards. emmalloc change incoming.

kripken · 2020-07-13T23:03:52Z

The underlying issue came up again, so I remembered we didn't resolve this.

It might be simplest to just switch to 16-byte alignment for now, and if there are perf issues, to consider a malloc variation for it. But hopefully that's not an issue.

We should verify at least the havlak and lua benchmarks, which are very malloc heavy, that they don't regress. I can do that if you want @Akaricchi

juj · 2020-07-27T11:29:08Z

@juj do you have a use case in mind where using the wider/correct alignment would be a significant regression. I'd rather not add more options unless there is a real need.

I may be in the minority with the memory layouting concerns.. I can localmod this to 8 bytes in our project to avoid the trouble.

kripken · 2020-07-27T17:50:11Z

Benchmarking this, it regresses Havlak by 5%, and also 1-2% on a few others like lua-binarytrees and poppler. That makes sense I guess since Havlak does tons of tiny allocations, and this increases the effective sizes of those. Given that, I think we maybe want to add this as an option.

One possible way to go is to keep dlmalloc as it is as the default, and add a dlmalloc-16 option. Not sure what to do with emmalloc - as it is focused on code size anyhow and not speed, maybe the speed issue matters less and we can just make the change? On the other hand, keeping emmalloc as it is may be fine too, as if someone needs 16-byte alignment they can use dlmalloc.

trybka · 2020-08-11T17:10:03Z

I wouldn't be opposed to a separate dlmalloc, but having it just do the right thing would be better, IMO.

AFAICT this is blocking some UBSan usage as it triggers in lots of places.
If I patch this in and rebuild they go away.

kripken · 2020-08-12T22:03:05Z

@trybka I think the regression is big enough that we shouldn't do this by default. But if it blocks something like ubsan specifically then perhaps we can enable it by default there (those wouldn't be production builds anyhow). That would add some complexity though.

But we can figure out those details later. I think the first step is to land something for this that is off by default. Specifically, a new variation on dlmalloc, and with no changes to emmalloc for now.

@Akaricchi does that make sense? And would you have time to update this PR for that? (If not I can look into it.)

Akaricchi · 2020-08-14T09:15:45Z

@kripken if you ask me, no, that does not make sense. It does not make any sense to intentionally violate the standard by default in the name of performance. There is a reason why stuff like -ffast-math is not enabled by default (except maybe on braindead compilers like ICC) - it breaks perfectly valid programs. There should be an option, but the non-conformant behavior has to be opt-in, not opt-out. This follows the principle of least surprise, is consistent with other similar options, and still allows programs that don't care about standard compliance to benefit.

The real proper fix would've been to make long double not require a 16-byte alignment. It's completely unnecessary for WASM. But I guess pushing an ABI change onto LLVM is not easy.

But if it blocks something like ubsan specifically then perhaps we can enable it by default there (those wouldn't be production builds anyhow)

This is a horrible idea. By doing this, you're allowing a class of bugs to sneak past debug builds undetected, and possibly make it into production builds. UBSan is supposed to make debugging easier by exposing problems with your code, not harder by masking them. The fact that UBSan complains about the current behavior is in itself evidence that this behavior is wrong.

kripken · 2020-08-14T13:57:26Z

@Akaricchi

Fair points! Especially about UBSan.

The real proper fix would've been to make long double not require a 16-byte alignment.

I think we also need it for v128? If not, and if long double is the only reason, then I think this might be a good enough reason to reconsider our long double policy. We compromised there because we thought the only downside was some extra tooling work for printf of long doubles etc., but if it slows us down by up to 5% on malloc-heavy benchmarks, I think that's the wrong tradeoff.

(except maybe on braindead compilers like ICC)

I don't know that much about ICC, but please let's not be so negative about things. What I mean is, let's talk about other projects in a way that wouldn't insult them if they read what we say. (I think I might be insulted if things were reversed - not in a big way, but it's unnecessary.)

Akaricchi · 2020-08-14T14:19:26Z

I think we also need it for v128?

No, malloc only needs to guarantee sufficient alignment for any of the basic C scalar types. Non-standard stuff like SIMD types aren't taken into account; you're expected to use aligned_alloc or similar for those. I suppose SIMD-compatible malloc() alignment sounds like a nice property to have, but portable code shouldn't rely on such assumptions anyway.

Also, just to clarify (as there's been minor confusion over this in another discussion before): long double can keep the 16-byte size, it's just the alignment that needs to be reduced. There shouldn't be any problem with that, since it's not even a native wasm type. It's not unprecedented either, e.g. on the 32-bit x86 LLVM target long double is 12 bytes but aligned to only 4.

Akaricchi · 2020-08-14T14:41:18Z

In any case, aligning to max_align_t by default seems like the right thing to do, whether we keep max_align_t alignment 16 bytes or not. I'll see about updating this PR later today or maybe tomorrow. The dlmalloc change is trivial, but IIRC I didn't quite get emmalloc to work correctly, because of some intrinsic assumptions about 8-byte alignment in the code. I think emmalloc has been rewritten since I last touched this, though, so that might be fixed. I'll see what I can do.

kripken · 2020-08-14T15:26:35Z

@Akaricchi I see, thanks!

Given that, I think we should reduce the alignment of long doubles in the LLVM target. That seems like the best option by far. But let's see what LLVM people think, cc @dschuff @tlively @aheejin @sbc100

Separately from that, making them align to max_align_t sounds like the right thing to do, I agree.

tlively · 2020-08-14T18:11:09Z

Changing long double alignment to 8 sounds fine to me, but I'm not sure how we feel about changing our published ABI like that. @dschuff, what do you think? If we can't make this ABI change now, we could tack it on to a future multivalue ABI.

dschuff · 2020-08-14T18:50:41Z

I agree that we shouldn't keep malloc's alignment as less than max_align_t, for the stated reasons. I don't even particularly like the idea of offering a convenient option for users to ask the compiler to break them by opting into the current behavior (which seems to be what happened in #11544).

On the face of it, aligning long doubles to 8 seems ok, since they are software-implemented there's no particular need for them to be further aligned. As @tlively says, that would be an ABI break though.
What would the consequences be? Primarily it could change the layout of any struct containing long doubles. It wouldn't change the calling conventions. Anything else I'm missing?

As a side note, if we are going to be offering compiler flags, maybe it makes more sense to have one that changes the alignment of long double (e.g. like the x86 -malign-double flag). This is still in some sense an invitation to be broken by ABI differences, but at least it's more similar to flags that already exist.

sunfishcode · 2020-08-18T13:06:53Z

One of the considerations when this came up before was simd128. It doesn't require 16-byte alignment for correctness, but many hardware architectures have slowdowns when operating on misaligned data. Does Emscripten's benchmark suite include any SIMD benchmarks?

tlively · 2020-08-18T17:16:27Z

One of the considerations when this came up before was simd128. It doesn't require 16-byte alignment for correctness, but many hardware architectures have slowdowns when operating on misaligned data. Does Emscripten's benchmark suite include any SIMD benchmarks?

No, we don't have any real-world SIMD benchmarks in the repo. @Akaricchi was saying above that max_align_t only considers standard C types and that SIMD programmers should use aligned_alloc if they need to allocate space for vectors. I don't know how often that is done in practice, but I wouldn't have a problem making this change and sticking a note about it in the SIMD docs.

kripken · 2020-08-18T17:40:03Z

Documenting this for SIMD makes sense I think. If we go that route, we should probably have a (default-off) build option to increase the default alignment of malloc (so that people have a way that doesn't require changing lots of code).

tlively · 2020-08-18T18:34:21Z

Documenting this for SIMD makes sense I think. If we go that route, we should probably have a (default-off) build option to increase the default alignment of malloc (so that people have a way that doesn't require changing lots of code).

Considering this change is standards compliant, I don't think folks will have to change much code. And even if they do, I don't think this is worth adding an option for.

kripken · 2020-08-18T18:41:48Z

I don't feel strongly either way, but I am a little worried that while this would be standards-compliant, most platforms have max_align_t == 16 anyhow, so wasm would be the exception that hits SIMD unalignment issues.

Akaricchi · 2020-08-18T22:58:40Z

most platforms have max_align_t == 16 anyhow, so wasm would be the exception that hits SIMD unalignment issues.

max_align_t == 8 isn't that uncommon, especially for 32-bit targets. 32-bit x86 and ARM have it under GCC and clang. Also, emscripten has always had a 8-bytes aligning malloc, so nothing really changes for SIMD code here?

kripken · 2020-08-18T23:12:18Z

@Akaricchi Yeah, you're right, I wasn't being precise I guess. I was thinking of 64-bit platforms on the desktop mostly.

It's true nothing changes for SIMD code in emscripten, but I worry about new codebases being ported. But your point is valid that maybe that risk is low.

So maybe we'll never need an option for 16. We can change to 8, and wait to see if there's a need I guess.

Akaricchi · 2020-08-18T23:13:54Z

FWIW a comment about the alignment of malloc in the documentation definitely wouldn't hurt.

sbc100 · 2020-08-18T23:38:53Z

most platforms have max_align_t == 16 anyhow, so wasm would be the exception that hits SIMD unalignment issues.

max_align_t == 8 isn't that uncommon, especially for 32-bit targets. 32-bit x86 and ARM have it under GCC and clang. Also, emscripten has always had a 8-bytes aligning malloc, so nothing really changes for SIMD code here?

Hmm, are you sure? That contradicts my experiments, but maybe I am not measuring the correct thing:

$ clang -m32 -x c /dev/null -dM -E  -o - | grep ALIGN
#define __BIGGEST_ALIGNMENT__ 16
$ clang -m64 -x c /dev/null -dM -E  -o - | grep ALIGN
#define __BIGGEST_ALIGNMENT__ 16
gcc -m64 -x c /dev/null -dM -E  -o - | grep ALIGN
#define __BIGGEST_ALIGNMENT__ 16
gcc -m32 -x c /dev/null -dM -E  -o - | grep ALIGN
#define __BIGGEST_ALIGNMENT__ 16

Is this different from the alignment of max_align_t?

Akaricchi · 2020-08-19T08:33:23Z

@sbc100 I'm not sure what __BIGGEST_ALIGNMENT__ is for, but it is indeed different from alignof(max_align_t): https://godbolt.org/z/5scPa7

sbc100 · 2020-09-15T19:05:55Z

tests/test_core.py

+    self.set_setting('MALLOC', malloc)
+    self.emcc_args += ['-std=gnu11']
+    src = open(path_from_root('tests', 'core', 'test_malloc_alignment.c')).read()
+    self.do_run(src, '', basename='src.c')


These days you can just write self.do_runf(path_from_root('tests', 'core', 'test_malloc_alignment.c'))

aardappel · 2020-09-15T20:55:41Z

Just going to throw my uninformed opinion in here and say that allocators that align to <16 in the days of SIMD and 64-bit seems like a really bad idea.

In a lot of code it may be unpractical to guarantee all memory you use comes from aligned_alloc or whatever, with SIMD values inline in all sorts of objects. Having the allocator do this right by default would save a lot of trouble. I know everyone can use their own allocator anyway, but it be a good default to have.

I've spent waaaay too many unnecessary cycles on various game engines crashing on win 32-bit 8-byte aligned SIMD values in a past life. :)

dschuff · 2020-09-17T21:24:53Z

@aardappel Do you think that would still be an issue (e.g. in game engines) if your past life was now? Alignment issues aren't what they used to be; e.g. a very concrete example, on Intel the movdqa instruction (which faults on unaligned accesses) used to be noticeably faster than movdqu (the unaligned version) so it was in common use. On recent hardware there isn't a performance difference and the recommendation is just to always use movdqu. I know less about the situation on ARM, but when we originally discussed how wasm was supposed to treat alignment (i.e. unaligned accesses are always allowed) the reasoning was that on the vast majority of interesting hardware it wasn't a significant issue.
That reasoning mostly applied to 8-byte types of course, I wonder if 16-byte is different. How many commonly-used SIMD instructions on e.g. ARM/x86 will fault?
/cc @ngzhian

aardappel · 2020-09-17T22:17:52Z

@dschuff those exact issues wouldn't be a problem anymore since we wouldn't ship a 32-bit Windows binary anymore :) But if emscripten defaults to an 8-byte aligned allocator and a SIMD value on such a boundary is way slower (in the worst case a trap that has to be caught), then the problem remains the same.

In a game engine, it is common that things like positions, directions, colors, velocities of things are types that underneath become SIMD types. These kinds of values are put in all sorts of objects, and objects inside other objects etc, to the point that knowing which objects requires 16-bit alignment becomes very opaque. Mix in there are variety of allocators and ways objects are allocated, makes any 8-byte alignment a minefield.

In some sense it is worse in Wasm because some things just being slower is even harder to figure out. Your game could run 10% faster if you just changed the allocator, but you have no way of knowing, and you just shrug and assume it's normal.

ngzhian · 2020-09-18T02:47:47Z

Re instructions that will fault:
For x86, depends on whether the instruction is SSE or AVX.
SSE: instructions fault on unaligned memory access unless explicitly allowed (e.g. movdqu)
AVX: instructions can access unaligned, unless explicitly required (e.g. movdqa)
For more details see Intel SDM Vol. 1 Page 14-33, Table 14-21. to 14-23.

ARMv7 and ARMv8 (both 32 and 64) supports unaligned loads and stores of SIMD register (depends on SCTLR bit that can be controlled).

arasouli91 · 2021-06-15T03:45:34Z

Is there any hope that this can get finished? Been waiting a year unable to build my project.

@Akaricchi