Fix layout of non-power-of-two length vectors #422

calebzulawski · 2024-06-03T21:59:04Z

Fixes #63, fixes #319

programmerjake

nice! once the test failures are fixed, feel free to merge

calebzulawski · 2024-06-04T00:39:18Z

Is this failure a codegen problem? A handful of architectures work (unfortunately I can't replicate right now, I'm on mac where it passes)

programmerjake · 2024-06-04T01:46:07Z

looks like aarch64 cross failed due to OOM...

calebzulawski · 2024-06-04T01:48:56Z

I guess testing every length is excessive, I'll reduce it

programmerjake · 2024-06-04T01:52:08Z

looks like aarch64 cross failed due to OOM...

it could be an aarch64 backend bug in llvm, since so far only aarch64 linux/mac has OOM-ed.

programmerjake · 2024-06-04T01:55:08Z

I guess testing every length is excessive, I'll reduce it

iirc I tried to pick lengths that are around powers of 2 and around 3 * powers of 2...so like 15,16,17,23,24,25,31,32,33...

programmerjake · 2024-06-04T09:12:08Z

well, looks like powerpc-unknown-linux-gnu has a non-trivial failure (not OOM):
https://github.com/rust-lang/portable-simd/actions/runs/9359687687/job/25763803238?pr=422

programmerjake · 2024-06-05T21:15:28Z

if you can minimize the bugs you encountered with aarch64 and powerpc, I think submitting a bug report to LLVM would be good!

workingjubilee · 2024-06-06T01:25:08Z

The PowerPC errors are genuine.

workingjubilee · 2024-06-06T01:25:44Z

I don't know if they're actually incorrect, however, as they are likely to be an endianness problem.

calebzulawski · 2024-06-06T01:30:42Z

There are basically 3 different classes of errors here:

random crashes, I think this is OOM etc due to too many tests, reduced by the second commit
failures on most (but not all) architectures for bitmask vectors for non-powers-of-two. I'm not sure if this is llvm or rustc, but I worked around it by extending to powers of two
the powerpc bitmask vector error. this is with the workaround, looks like endianness, but the code does account for endianness.

All of these should be "fixed" now, since we've removed the bitmask vectors. I am curious what was causing the second error but not sure I'll get the chance to look into it yet

.github/workflows/ci.yml

programmerjake · 2024-06-06T02:47:44Z

(I attempted to quote reply, but accidentally edited your comment instead, sorry. replied this time)

There are basically 3 different classes of errors here:

random crashes, I think this is OOM etc due to too many tests, reduced by the second commit

we're still getting SIGKILL on aarch64 -- it could be too many tests, it could also be an excessive memory usage bug for non-excessive input code with weird vector lengths in the aarch64 llvm backend.

calebzulawski · 2024-06-06T02:56:01Z

I'm seeing that too. I can replicate it if I build for aarch64. It seems to be an infinite loop. Even building with --emit=llvm-ir I can't get it to complete (the tests that fail are cast, u8_ops, and i8_ops).
I do have a stack trace I was able to extract:

314.44 Gc  100.0%	-	 	rustc (22655)
314.44 Gc  100.0%	-	 	 thread_start
314.44 Gc  100.0%	-	 	  _pthread_start
314.44 Gc  100.0%	-	 	   std::sys::pal::unix::thread::Thread::new::thread_start::h3d442a96f4a94842
314.44 Gc  100.0%	-	 	    _RNSNvYNCINvMNtCsemj25UseQJj_3std6threadNtBa_7Builder16spawn_unchecked_NCINvXs0_CsfrAtrMCWRw_18rustc_codegen_llvmNtB1f_18LlvmCodegenBackendNtNtNtCsdjrb6H688DD_17rustc_codegen_ssa6traits7backend19ExtraBackendMethods18spawn_named_threadNCINvNtNtB2i_4back5write10spawn_workB1M_E0uE0uEs0_0INtNtNtCs9mSAhCB19GO_4core3ops8function6FnOnceuE9call_once6vtableB1f_
314.44 Gc  100.0%	-	 	     _RINvNtNtCsemj25UseQJj_3std10sys_common9backtrace28___rust_begin_short_backtraceNCINvXs0_CsfrAtrMCWRw_18rustc_codegen_llvmNtB1o_18LlvmCodegenBackendNtNtNtCsdjrb6H688DD_17rustc_codegen_ssa6traits7backend19ExtraBackendMethods18spawn_named_threadNCINvNtNtB2r_4back5write10spawn_workB1V_E0uE0uEB1o_
314.44 Gc  100.0%	-	 	      _RINvNtNtCsdjrb6H688DD_17rustc_codegen_ssa4back5write24finish_intra_module_workNtCsfrAtrMCWRw_18rustc_codegen_llvm18LlvmCodegenBackendEB1g_
314.44 Gc  100.0%	-	 	       _RNvNtNtCsfrAtrMCWRw_18rustc_codegen_llvm4back5write7codegen
314.44 Gc  100.0%	-	 	        _RNvNtNtCsfrAtrMCWRw_18rustc_codegen_llvm4back5write17write_output_file
314.44 Gc  100.0%	-	 	         LLVMRustWriteOutputFile
314.44 Gc  100.0%	-	 	          llvm::legacy::PassManagerImpl::run(llvm::Module&)
314.44 Gc  100.0%	-	 	           llvm::FPPassManager::runOnModule(llvm::Module&)
314.44 Gc  100.0%	-	 	            llvm::FPPassManager::runOnFunction(llvm::Function&)
314.44 Gc  100.0%	-	 	             llvm::MachineFunctionPass::runOnFunction(llvm::Function&)
314.44 Gc  100.0%	-	 	              llvm::Legalizer::runOnMachineFunction(llvm::MachineFunction&)
314.19 Gc   99.9%	7.12 Gc	 	               llvm::Legalizer::legalizeMachineFunction(llvm::MachineFunction&, llvm::LegalizerInfo const&, llvm::ArrayRef<llvm::GISelChangeObserver*>, llvm::LostDebugLocObserver&, llvm::MachineIRBuilder&, llvm::GISelKnownBits*)
114.05 Gc   36.2%	3.68 Gc	 	                llvm::LegalizerHelper::moreElementsVector(llvm::MachineInstr&, unsigned int, llvm::LLT)
107.85 Gc   34.2%	9.40 Gc	 	                llvm::LegalizationArtifactCombiner::tryCombineInstruction(llvm::MachineInstr&, llvm::SmallVectorImpl<llvm::MachineInstr*>&, llvm::GISelObserverWrapper&)
50.57 Gc   16.0%	2.71 Gc	 	                llvm::eraseInstrs(llvm::ArrayRef<llvm::MachineInstr*>, llvm::MachineRegisterInfo&, llvm::LostDebugLocObserver*)
17.07 Gc    5.4%	1.52 Gc	 	                llvm::LegalizerHelper::legalizeInstrStep(llvm::MachineInstr&, llvm::LostDebugLocObserver&)
13.21 Gc    4.2%	7.18 Gc	 	                llvm::isTriviallyDead(llvm::MachineInstr const&, llvm::MachineRegisterInfo const&)
1.82 Gc    0.5%	1.09 Gc	 	                llvm::LostDebugLocObserver::checkpoint(bool)
1.01 Gc    0.3%	1.01 Gc	 	                llvm::detail::DenseMapPair<llvm::MachineInstr*, unsigned int>* llvm::DenseMapBase<llvm::DenseMap<llvm::MachineInstr*, unsigned int, llvm::DenseMapInfo<llvm::MachineInstr*, void>, llvm::detail::DenseMapPair<llvm::MachineInstr*, unsigned int> >, llvm::MachineInstr*, unsigned int, llvm::DenseMapInfo<llvm::MachineInstr*, void>, llvm::detail::DenseMapPair<llvm::MachineInstr*, unsigned int> >::InsertIntoBucket<llvm::MachineInstr* const&, unsigned long>(llvm::detail::DenseMapPair<llvm::MachineInstr*, unsigned int>*, llvm::MachineInstr* const&, unsigned long&&)
537.93 Mc    0.1%	537.93 Mc	 	                free
408.48 Mc    0.1%	-	 	                0xfffffffffffffffe
363.53 Mc    0.1%	-	 	                llvm::SmallVectorBase<unsigned int>::grow_pod(void*, unsigned long, unsigned long)
124.90 Mc    0.0%	124.90 Mc	 	                default_zone_free_definite_size
31.51 Mc    0.0%	31.51 Mc	 	                llvm::saveUsesAndErase(llvm::MachineInstr&, llvm::MachineRegisterInfo&, llvm::LostDebugLocObserver*, llvm::GISelWorkList<4u>&)
10.62 Mc    0.0%	10.62 Mc	 	                llvm::allocate_buffer(unsigned long, unsigned long)
6.45 Mc    0.0%	6.45 Mc	 	                DYLD-STUB$$operator new(unsigned long)
6.20 Mc    0.0%	6.20 Mc	 	                std::__1::__tree<llvm::DebugLoc, std::__1::less<llvm::DebugLoc>, std::__1::allocator<llvm::DebugLoc> >::destroy(std::__1::__tree_node<llvm::DebugLoc, void*>*)
2.02 Mc    0.0%	2.02 Mc	 	                operator delete(void*)
1.07 Mc    0.0%	1.07 Mc	 	                DYLD-STUB$$free
31.51 Kc    0.0%	31.51 Kc	 	                DYLD-STUB$$free
15.15 Kc    0.0%	15.15 Kc	 	                DYLD-STUB$$operator delete(void*)
223.93 Mc    0.0%	-	 	               0xfffffffffffffffe
30.24 Mc    0.0%	30.24 Mc	 	               llvm::LegalizerHelper::legalizeInstrStep(llvm::MachineInstr&, llvm::LostDebugLocObserver&)

programmerjake · 2024-06-06T03:13:29Z

(apparently I can't click in the right spot today, since I edited your comment again)

I'm seeing that too. I can replicate it if I build for aarch64. It seems to be an infinite loop. Even building with --emit=llvm-ir I can't get it to complete (the tests that fail are cast, u8_ops, and i8_ops).

can you get llvm-ir with --emit=llvm-ir -O -C no-prepopulate-passes -C codegen-units=1?
since once you have llvm ir, it may be easier to try to reduce it to a minimal llvm test case. rustc may even be generating invalid llvm ir.

calebzulawski · 2024-06-06T05:07:12Z

Looks like only -0 was necessary to get it to emit LLVM. I played around with using the pass arguments to opt but it doesn't seem to accept the flags rust is emitting from -Zprint-llvm-passes, I'm probably doing something wrong

programmerjake · 2024-06-06T05:24:04Z

Looks like only -0 was necessary to get it to emit LLVM.

I think you meant -O (letter O, not zero)

I played around with using the pass arguments to opt but it doesn't seem to accept the flags rust is emitting from -Zprint-llvm-passes, I'm probably doing something wrong

try running opt with just the input file and --verify, this will run LLVM's module verification pass which will tell you if you gave it invalid LLVM IR. if that passes, you can also try running opt with -O2 --verify-each, which runs opt's default optimization pipeline and verifies the LLVM IR after every pass.

if the llvm-ir is small enough, it would be great if you would put it in llvm.godbolt.org and share it here, which would let us try and figure it out without having to compile everything locally, plus Compiler Explorer has nice features for showing what changed in which passes (technically you can do that with the command line, but the website is much more friendly).

calebzulawski · 2024-06-06T13:13:14Z

So it looks like the problem is only with opt-level=0, 1+ works fine, so the problem is probably more related to lowering/isel than optimizations. This is the closest I've come to replicating it, though I'm not 100% sure it's the same cause: https://llvm.godbolt.org/z/c8xGsaqdn

programmerjake · 2024-06-06T22:13:29Z

ok, I reduced the problem: https://llvm.godbolt.org/z/1Ehsh97nP

programmerjake · 2024-06-06T22:43:34Z

so, maybe add a workaround for fp to int only on aarch64 that expands element count to the next power of two? idk if that will fix it, the backend bug may also occur for other ops.

calebzulawski · 2024-06-07T01:12:47Z

That did fix it, there was also another codegen bug--Rem has a failure only for i8 and u8, non-powers-of-two, when the second argument is all 0s (it works fine for not all zeros, only the rem_zero_panic test fails)

RalfJung · 2024-06-08T15:01:37Z

crates/core_simd/src/vector.rs

@@ -99,7 +99,7 @@ use crate::simd::{
 // directly constructing an instance of the type (i.e. `let vector = Simd(array)`) should be
 // avoided, as it will likely become illegal on `#[repr(simd)]` structs in the future. It also
 // causes rustc to emit illegal LLVM IR in some cases.
-#[repr(simd)]
+#[repr(simd, packed)]


What is the plan for simd without packed? Miri currently ICEs when such a type is used with a simd intrinsic and the size is not a power of 2. If portable-simd doesn't need support for that then do we need to have it at all? Can we just make simd itself have the behavior that simd, packed now has?

simd without packed is used by stdarch. portable-simd might use it in the future too, though imo probably won't.

AFAIK stdarch only uses power-of-2 vectors, where packed makes no difference?

RalfJung · 2024-06-08T15:14:37Z

crates/test_helpers/src/lib.rs

@@ -639,43 +627,30 @@ macro_rules! test_lanes_panic {
                    core_simd::simd::LaneCount<$lanes>: core_simd::simd::SupportedLaneCount,
                $body

+                // test some odd and even non-power-of-2 lengths on miri


This doesn't have any even non-power-of-2. Maybe replace 5 by 6?

(Though I am also not sure why odd vs even would be an interesting difference here.)

if you look down further it tests length 3 on miri, the idea is we want to catch bugs caused by repr(simd, packed) having alignment smaller than repr(simd) which only happens for non-power-of-2 sizes. even non-power-of-2 sizes cover where the alignment is in between the element alignment and the non-packed alignment. @calebzulawski can you add back in length 6 since that's the smallest length where that occurs?

Yeah 3 and 6 would probably be reasonable then. To cut down on CI times I'd remove 5.

RalfJung · 2024-06-08T17:53:32Z

That did fix it, there was also another codegen bug--Rem has a failure only for i8 and u8, non-powers-of-two, when the second argument is all 0s (it works fine for not all zeros, only the rem_zero_panic test fails)

Is there an issue for that?

the powerpc bitmask vector error. this is with the workaround, looks like endianness, but the code does account for endianness.

I think the code did account for endianess in the wrong way, see rust-lang/rust#126171.

RalfJung · 2024-08-07T09:14:35Z

crates/core_simd/src/ops.rs

-            unsafe { core::intrinsics::simd::$simd_call($lhs, rhs) }
+
+            // aarch64 div fails for arbitrary `v % 0`, mod fails when rhs is MIN, for non-powers-of-two
+            // these operations aren't vectorized on aarch64 anyway


These are LLVM backend bugs, right? simd_div/simd_rem still should work the same on all targets?

That seems worth tracking somewhere, having subtly buggy intrinsics is no good.

also, theoretically LLVM should be able to generate SIMD code for division/remainder by a constant, by using the exact same fancy math as it would use for scalars (which it unfortunately currently does after scalarization of div ops for non-power-of-2 vectors), so once LLVM's bugs are fixed, I think we should switch back to generating SIMD ops.

https://clang.godbolt.org/z/MxK47TWGs

Yes, these are definitely backend bugs

crates/test_helpers/src/lib.rs

Co-authored-by: Ralf Jung <post@ralfj.de>

calebzulawski · 2024-08-10T01:01:47Z

Any ideas why the proptest variable doesn't seem to make it into cross? Or maybe it is, but it's not the number of cases that make the tests slow?

programmerjake · 2024-08-10T03:08:30Z

using the github actions feature that shows a timestamp for each line of output (get to it by clicking the settings gear on that actions job page), it looks like running the debug tests is taking almost all of the time...

calebzulawski · 2024-08-10T05:21:07Z

That gave me an idea, turns out you can set the optimization level of all dependencies outside of the workspace. I suspected maybe proptest itself was the slow part. If we're okay with this change, it dramatically improves test times

calebzulawski requested review from programmerjake and workingjubilee June 3, 2024 22:00

programmerjake approved these changes Jun 3, 2024

View reviewed changes

calebzulawski mentioned this pull request Jun 5, 2024

Implement special swizzles for masks and remove {to,from}_bitmask_vector #423

Merged

calebzulawski force-pushed the non-power-of-two-layout branch from dbe18a4 to e3dabf5 Compare June 6, 2024 01:23

programmerjake reviewed Jun 6, 2024

View reviewed changes

.github/workflows/ci.yml Show resolved Hide resolved

programmerjake mentioned this pull request Jun 6, 2024

aarch64 backend OOM on -O0 when compiling llvm.fptosi.sat.v3i32.v3f32 llvm/llvm-project#94694

Closed

calebzulawski force-pushed the non-power-of-two-layout branch from d513647 to e8a56e4 Compare June 7, 2024 01:15

RalfJung reviewed Jun 8, 2024

View reviewed changes

RalfJung mentioned this pull request Jun 8, 2024

Test codegen for repr(packed,simd) -> repr(simd) rust-lang/rust#125904

Merged

Fix layout of non-power-of-two length vectors

227a9d9

calebzulawski force-pushed the non-power-of-two-layout branch 2 times, most recently from 6e03d63 to ce73c96 Compare June 23, 2024 19:23

calebzulawski added 3 commits August 7, 2024 01:14

Add aarch64 workarounds

f336406

Perform aarch64 div/rem as scalar op

9f7fec8

Swap lanes tested on miri

a49f77e

calebzulawski force-pushed the non-power-of-two-layout branch from 98f923e to a49f77e Compare August 7, 2024 05:24

RalfJung reviewed Aug 7, 2024

View reviewed changes

crates/test_helpers/src/lib.rs Show resolved Hide resolved

Update crates/test_helpers/src/lib.rs

751c3b5

Co-authored-by: Ralf Jung <post@ralfj.de>

calebzulawski mentioned this pull request Aug 8, 2024

simd_div and simd_rem ICEs for non-power-of-two length vectors on Aarch64 #427

Open

calebzulawski added 2 commits August 7, 2024 23:17

Disable testing most lanes to improve CI times

7f6a981

Reduce proptest iterations

2a3b8ad

calebzulawski force-pushed the non-power-of-two-layout branch from 400e6e8 to 2a3b8ad Compare August 9, 2024 01:14

Build test dependencies with optimization

d7d060a

calebzulawski merged commit 283acf4 into master Aug 13, 2024
57 checks passed

jieyouxu mentioned this pull request Jan 24, 2025

Meta Tracking Issue for LLVM workarounds rust-lang/rust#135981

Open

programmerjake mentioned this pull request Jan 24, 2025

Workaround for llvm bug for f32x3 saturating fp->int opt-level=0 on aarch64 is fixed in LLVM 20 rust-lang/rust#135982

Open

programmerjake mentioned this pull request Mar 7, 2025

Divisibility is not properly optimized on aarch64 #453

Open

Fix layout of non-power-of-two length vectors #422

Fix layout of non-power-of-two length vectors #422

Uh oh!

Conversation

calebzulawski commented Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

programmerjake left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

calebzulawski commented Jun 4, 2024

Uh oh!

programmerjake commented Jun 4, 2024

Uh oh!

calebzulawski commented Jun 4, 2024

Uh oh!

programmerjake commented Jun 4, 2024

Uh oh!

programmerjake commented Jun 4, 2024

Uh oh!

programmerjake commented Jun 4, 2024

Uh oh!

programmerjake commented Jun 5, 2024

Uh oh!

workingjubilee commented Jun 6, 2024

Uh oh!

workingjubilee commented Jun 6, 2024

Uh oh!

calebzulawski commented Jun 6, 2024 • edited by programmerjake Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

programmerjake commented Jun 6, 2024

Uh oh!

calebzulawski commented Jun 6, 2024 • edited by programmerjake Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

programmerjake commented Jun 6, 2024

Uh oh!

calebzulawski commented Jun 6, 2024

Uh oh!

programmerjake commented Jun 6, 2024

Uh oh!

calebzulawski commented Jun 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

programmerjake commented Jun 6, 2024

Uh oh!

programmerjake commented Jun 6, 2024

Uh oh!

calebzulawski commented Jun 7, 2024

Uh oh!

RalfJung Jun 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

programmerjake Jun 9, 2024

Choose a reason for hiding this comment

Uh oh!

RalfJung Jun 9, 2024

Choose a reason for hiding this comment

Uh oh!

RalfJung Jun 8, 2024

Choose a reason for hiding this comment

Uh oh!

programmerjake Jun 9, 2024

Choose a reason for hiding this comment

Uh oh!

RalfJung Jun 9, 2024

Choose a reason for hiding this comment

Uh oh!

RalfJung commented Jun 8, 2024

Uh oh!

RalfJung Aug 7, 2024

Choose a reason for hiding this comment

Uh oh!

programmerjake Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

calebzulawski commented Jun 3, 2024 •

edited

Loading

programmerjake left a comment •

edited

Loading

calebzulawski commented Jun 6, 2024 •

edited by programmerjake

Loading

calebzulawski commented Jun 6, 2024 •

edited by programmerjake

Loading

calebzulawski commented Jun 6, 2024 •

edited

Loading

RalfJung Jun 8, 2024 •

edited

Loading

programmerjake Aug 7, 2024 •

edited

Loading