-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run-make-fulldeps/pgo-branch-weights fails on AArch64 Linux #78226
Comments
@rustbot ping arm |
Hey ARM Group! This bug has been identified as a good "ARM candidate". cc @JamieCunliffe @joaopaulocarreiro @raw-bin @Stammark @vigoux |
Looks like my rollup (#78212) has the cause. |
@JohnTitor: Thanks for the tip! We're looking into this. |
Based on that I reverted #77554 ( |
This seems to be fixed now, CI is passing again. |
Reopening this issue as it's hitting us (again) on the beta promotion PR, #86413. I've tried various things to debug but so far haven't really arrived at any conclusions. It doesn't seem immediately related to anything in the PR itself, which should in theory have minimal effect on runtime behavior here. @rustbot ping arm -- since this is only failing on the aarch64 builder, I am currently presuming some relationship, but it's not clear. The test does not fail on the only aarch64 machine I currently readily have access to when I tried to run it indirectly (i.e., copying the input files and simulating the make file via some local manual invocations)> |
Error: Parsing ping command in comment failed: ...'t ping arm' | error: expected end of command at >| ' -- since '... Please let |
@rustbot ping arm |
Hey ARM Group! This bug has been identified as a good "ARM candidate". cc @adamgemmell @hug-dev @jacobbramley @JamieCunliffe @joaopaulocarreiro @raw-bin @Stammark @vigoux |
@Mark-Simulacrum It's not obvious what's going on here, but we'll look into it. Thanks for the ping. |
I've disabled the test on #86413 so that we can get a beta out, but I would like to see a fix prior to the 1.54 release (in ~5 weeks), so P-critical and milestoned at 1.54. |
@jacobbramley are you willing to own this issue, so that we have some tracking of who is looking into it before the 1.54 release happens? |
Currently, @JamieCunliffe (also in the Arm team) is working on it so he's perhaps a better choice. |
@Mark-Simulacrum mentioned they couldn't reproduce this locally, I have managed to reproduce it locally with the following config which I got by running one of the CI scripts.
That fails on Changing the I'll continue investigating this though. |
Thanks for the bisection! I suspect the best way is to try to land a revert (of the revert? not sure how revert-y we've gotten there) on the beta branch, while enabling this test (reverting 793b005). It seems plausible at least that this is not an entirely correct bisection in terms of underlying root cause based on a cursory inspection of these PRs. My guess is something subtle within the linker or so is at play... but I'm OK closing this issue if we can get the test re-enabled on beta. However, I suspect we'll not want to backport #86143 to beta -- it's a pretty large PR. Checking whether it fixes things by temporary landing it on the beta branch (or with a try build on the aarch64 builder) seems OK though. We can use that information to determine what the best next step is. One thought on a potential cause is that if PGO is affected (and I'd guess it is) by the codegen-units and relative ordering of symbols, then #85891 certainly seems like it might've influenced both of those... cc @bjorn3 @michaelwoerister -- any thoughts on how #85891 might've caused this bug? |
It changes symbol names and probably the order of some hashmaps. This might have changed the codegened code just enough to expose a pre-existing bug I would guess. |
The linker certainly can be a source for PGO related errors, e.g. if the profiling runtime doesn't get linked in correctly. |
It does look like it's to do with the linker. It looks like the profile data section isn't being included correctly in the binary, the counts and other sections appear to be correct though. The data section size on a working one is almost the same as the difference we are seeing on the CI. When using gold as the linker for the test it passes but the default doesn't, which could explain why we aren't seeing this on x86. When using If that seems reasonable I can submit a PR for not using |
FWIW we try to use gold on the linker on non-x86 I think rust/src/test/run-make-fulldeps/pgo-branch-weights/Makefile Lines 24 to 28 in 0cd0709
I'm not sure if it's the right fix to always not gc-sections but I think it's likely a good idea for now -- seems pretty harmless, too; profile-generate isn't really deployed much AFAIK, only run on particular benchmarks or so. |
To be clear, hacking What's not clear to me is what the correct behaviour is, and which tool is misbehaving (if any). @Mark-Simulacrum That test is, confusingly, the other way around, and applies only to x86 targets. |
…-Simulacrum Don't use gc-sections with profile-generate. When building with profile-generate don't call gc_sections as this can can sometimes strip out profile data. This missing information in the prof files can then result in missing functions when using the profile information. rust-lang#78226 r? `@Mark-Simulacrum`
T-compiler declined beta backport of #87004, but given that we are currently of the opinion that the particular test failing is somewhat unlikely to be a "new" bug, I'm going to close this issue as fixed by #87004. In particular, it is my current understanding that this issue could have arisen on any past release with the right perturbations to the linker input (largely driven by hashes differing due to rustc version amongst other details), so is not a true regression. |
The most recent builds on AArch64 Linux are failing on this test:
The text was updated successfully, but these errors were encountered: