-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failure JIT\\HardwareIntrinsics\\General\\Vector256\\Vector256_r\\Vector256_r.cmd #76280
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsRun: runtime-coreclr jitstress 20220926.3 Failed test:
Error message:
|
The error is:
with:
|
@tannergooding, please check if this needs to be backported to 7.0. |
The lowest 32-bits is being corrupted somehow.
|
Notably this is only happening on the "upper half" of the Vector256. |
I can't actually repro this. It also isn't reproducing in CI anymore as of the latest run. There notably isn't anything "obvious" in the commit range (of last failing CI run to latest CI run which passes) either: 6c2cfa4...789b420 If this reproduces again, I'll take another look. |
@tannergooding A similar case failed again: coreclr windows x64 Checked zapdisable @ Windows.10.Amd64.Open
|
Could you remind me what It's very odd this still isn't repro'ing locally. Perhaps some determinism issue with where the test runs in relation to the RNG it uses for inputs. It's interesting it changed to a |
From the console log you can see:
I'm not sure if |
Still not able to repro this locally. Have tried 5 different full runs and over 20 different runs of just |
Actually looks like the testcase failing for my PR was different:
|
Copying from previous failure before logs go away:
|
I hit failures in another configuration. This one is ildasm/ilasm roundtripping for an ilasm change, so theoretically it could be the change but it seems unlikely. This uses
|
Failed again in: runtime-coreclr jitstress 20221022.1 Failed test:
Error message:
|
This seem to be failing only in |
Details from the latest hit (build 62054): Failing test: Expected: Actual: Environment:
|
Is there any way we can pull this explicit machine from the pool to do manual testing on? Given we've only seen this for Vector256 and only with the upper half my presumption is that there is either a bug with save/restore upper halves logic (either in the JIT or in the thread save/resume logic) -or- its something like the microcode patch that was called out above. We could also try to get more info out of My current install has |
Yes, these details are masked out. Helix runs on Azure VMs so you cannot be even sure that the test runs on the same physical machine the whole time. The running VM can be migrated to a different physical machine in the middle of the test. I have asked on the eng system support chat about interactive session on Helix VM. |
@tannergooding You are cced on the Teams discussion with the eng team about getting access to the Helix VM. |
A recent job failed with newly added FailFast instrumentation:
|
Here are the corrupted values from last 4 crashes (
|
The consistent pattern is that the low 23 bits of the higher 2 qwords are zeroed out. 23 bits is unusual number. Where can it come from? |
I was thinking that it is half of the double mantissa and then immediately realized that my math is wrong (2 * 23 != 52). Good point about single precision mantissa! |
There is no good way to get remote access to the exact hardware that this is failing on. Azure uses multiple different process models for the machine category used by Helix VMs. Creating Helix VM of the same machine category tend to give you different processor model (I have tried multiple times). If we need to gather more information about the machine config, the best way to do that is to add extra logging before the temporary FailFast in Vector256 test. |
My current concern is that if this is a microcode issue, then no amount of logging will provide the required information. The only real way to validate is likely to get a machine, reliably repro, patch, and then try to repro again. |
What would you do if you got a VM that reproduces it semi-reliably? We should be able to do the same in the CI, just the feedback loop would be slower. (I am running out of ideas on what to do to diagnose this further.) |
I'd likely attach Intel VTune or AMD uProf and collect a system wide trace that includes when the process yields the timeslice. From what we've seen from the dumps, all the disassembly looks correct and the input values are being corrupted somewhere between when the correct result is computed and the validation happens. So, my guess is that it's either some state save/restore issue -or- something like what Egor linked above. Given this is effectively only happening in Vector256_r, I'd speculate the combination of "debug" (and therefore frequently spilling/loading) is causing heavy enough Vector256 usage that it triggers the issue Egor had found. In which case we'd patch the machine (install latest Windows/Microsoft Updates to start) and see if it continues reproing. |
To re-iterate, this is what we know:
This is the Window where the corruption occurs:
|
Run: runtime-coreclr jitstress 20220926.3
Failed test:
Error message:
Report
Summary
The text was updated successfully, but these errors were encountered: