-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinite loop #566
Comments
That is very peculiar. Are you sure that it's not just the allocation itself that's failing? I.e., does Since by default, |
The allocation itself succeeds (I can log the allocated pointer from kmalloc easily), so it's not that, but the part that actually fails is indeed the insert(). |
Is there any trick based on using unused pointer bits in hashbrown ? The addresses range used in kernel space is different from the one in userspace AFAIR. Other than that, it should be a run-of-the-mill no_std environment. |
The stored pointer isn't at the beginning of the allocation, but other than that, it shouldn't affect things. The empty map is stored as static data, but since you got to the allocation, that shouldn't have been the source of the issue, since the code would have read the data before that point. Since you have the ability to log the pointer allocation, I'd honestly recommend just trying to build the crate yourself and injecting logging statements to debug things if you can, since that'd be the best way to figure out what the actual issue is. Besides that, there really isn't much to go on since this feels pretty specific to your issue. |
This is almost certainly due to the use of NEON instructions in hashbrown, which aren't normally allowed in the kernel. Are you using the aarch64-unknown-none-softfloat target which specifically disables the use of FP/SIMD instructions? |
I'll try to add some debug prints when I figured how to make a local version of the crate.
It's using a custom target.json, derived from
So there should not be any NEON instruction in the binary. I haven't checked exhaustively the disassembly though, so maybe it could still sneak-in via intrinsics (?) or EDIT: figured how to build my version of hashbrown with debug prints, let's see what happens when I get the pixel 6 back to experiment. |
I managed to find the problematic spot: when a print is added at the beginning of this loop, the issue disappears. So there probably is something unsound somewhere there that is exposed by optimizations (I'm using rust 1.81.0). EDIT: I also confirm that this is the loop that becomes infinite in the broken case. |
Hmm, that is very strange, since we've thoroughly tested this and run it through miri without any undefined behaviour triggering. Since you're running on a target without NEON, you're almost certainly using the generic, not-SIMD implementation (which is also what we run via MIRI), which is very weird. |
To be honest, I'm not entirely a fan of the way we currently do the probe sequence, since like you're experiencing, it doesn't have any guarantee to end. This means that a potentially malicious hash or a bad implementation could lead to an infinite loop. The one thing that immediately jumps to mind is that we have an explicit test for the target pointer width when factoring in the cache: if the pointer width is 32, then we take bits 31-25 of the hash for the 7-bit control code, whereas if the pointer width is 64, we take bits 63-57. It could be that the hash is still ending up as a 32-bit number and thus the control code is always zero, but since zero is a valid hash, that still feels unlikely. |
Are you absolutely sure that the pointer returned by the allocator is correctly aligned? This is a problem that several people have encountered when using custom allocators. It might be worth dropping an assert there to be sure.
Because the table is a power of 2, a quadratic probe sequence is guaranteed to visit each group exactly once before looping. The loop is then guaranteed to terminate as long as there is at least one empty bucket, which is guaranteed by the load factor. The hash only selects the starting position, all groups are still visited. |
I'll add some asserts. Currently the code is like that, in which I assume the address will be correctly aligned under this condition: align <= 8 || (size.is_power_of_two() && align <= size) This is based on the kernel doc:
(and on arm64, ARCH_KMALLOC_MINALIGN is 8).
I'm not sure of what you mean here by "factoring in the cache" but those bits will be in use for kernel space pointers on arm64, as can be seen here. |
Assert added and it's still landing on the infinite loop without panicking first. |
I'm honestly out of ideas at this point. I would recommend setting up kernel debugging to figure out exactly which loop it's getting stuck in. |
@Amanieu that's figured already, see #566 (comment) |
I mean single-stepping that loop with a debugger to figure out why it's not exiting on the first iteration since there's only one element in the table. |
Single stepping is going to be really challenging. Even installing a debugger is going to be a problem (it's Android, not a regular distro with a gdb package). On top of that the binary is mixed language (it's a C kernel module with one object file coming from Rust). I can however apply any patch you want to hashbrown, so if there is things you want me to print and add asserts that's straightforward. The only thing to keep in mind is that extra printing in the loop leads to the issue disappearing, similarly to compiling with debug, so it's quite likely there is something unsound quite "local" to that spot (or anything inlined there). The SAFETY comment of that snippet lists 4 points, maybe some of them can be turned into
|
I added the following obvious asserts according to SAFETY and they passed fine, with the infinite loop still happening: assert_eq!(self.bucket_mask, self.buckets() - 1);
assert!(probe_seq.pos <= self.bucket_mask); |
Maybe you could try sprinkling some You could also add an assert to check that all bits are set in the bitmask returned by In fact can you also add an assert that the group load returns |
Was thinking the same so I tried with
I realized the issue is currently manifesting itself with this snippet: let mut mymap = hashbrown::HashMap::new();
mymap.insert(left, right);
let val = mymap.get(&left).unwrap(); and it's looping in Considering the extra assert added in the allocator did not fire, it seems pretty likely the issue is somewhere in hashbrown. I'm happy to try alternative branches with extra asserts etc to help figuring that issue but for now I don't have any code actually depending on hashbrown (it was preliminary tests), so I'm going to go with the std BTreeMap since it's fine too for the use case. |
You should still be able to dump the contents of |
So I tried adding Note that I validated the print itself by printing successfully |
So there really is something going with the pointer obtained via |
Could this be because the kernel doesn't support unaligned memory access for some reason? |
Maybe but I'd expect a fault leading to a panic. The only print thing that has worked is turning the ptr into a slice of length 8 and printing that:
|
Given that this only happens in release mode, not debug, and that added prints and such perturb the issue, it sounds like there could be a codegen bug here. |
Right, that's the expected result. The first control byte represents the item you are searching for and the rest represent If this really is a codegen bug then there isn't much we can do without directly looking at the generated assembly code. |
Actually I don't think so: printing the slice as-is worked, but calling that locks up:
So I doubt it's kernel-related. It looks like |
I'll see if I can de-inline the whole find_inner() function. Interestingly, I ran again and got a different output, is that expected ?
|
Here is the disassembly of
|
Can you remove all the |
@Amanieu I updated the asm in the previous comment and it's indeed much, much shorter. |
Can you try removing |
Still crashing, but the disassembly is ~30% shorter with
|
Also I tried on the emulator platform where everything seems to works (never locked up) and the disassembly is identical (as expected, since it's the exact same toolchain) |
So I think inlining played some tricks again: the problem seems to now appear in insert() rather than get(). It may have been in insert() all along, and the adding/removing of get() that seemingly triggered the bug maybe just influenced the inlined blob, with the issue actually being in insert() ... This thing is a nightmare. That being said, it's not totally surprising since the loop we have been looking at seems to be almost copy/pasted in a few functions. |
The fact that this doesn't reproduce in QEMU despite being the exact same assembly code strongly suggests that the issue is not from hashbrown but rather from something specific to the kernel and/or hardware. Even if you can't get kgdb working, surely you should be able to get a register state or at least a PC from a kernel thread stuck in a loop? A register state should be able to give enough information to figure out what is going on. |
I don't think we can state it is the exact same assembly. The issue has been "moving" around due to inlining and I can't just disable it fully otherwise the problem is hidden. So I while I was looking at one function, the problem was materialized somewhere else. When the lockup happens, the CPU get stuck for 10s, then a watchdog detects it and the only info printed is a back trace with a single entry. That entry is either a kernel function involved in syscalls or the scheduler. However, removing the snippet from the module fixes the issue, so it definitely has something to do with it (and it's even clearer there is a problem in the code or compiler given that it depends on rust optimization level). |
I think that it should be mostly possible to verify that it's the same assembly if you use the same custom target for both and ensure that |
A very simple snippet lands on an infinite loop in some environment it seems:
This works everywhere except in a very specific place: in a kernel module of Android 6.1 kernel (Pixel 6 device), compiled in release mode:
Unfortunately, I was not able to get a useful backtrace out of the kernel for whatever reason, which is a real shame
EDIT: added no_std
The text was updated successfully, but these errors were encountered: