Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jitstress] HardwareIntrinsics_ro fails with "process cannot access the file" error #83298

Closed
jakobbotsch opened this issue Mar 11, 2023 · 16 comments
Assignees
Labels
area-Infrastructure-coreclr blocking-clean-ci-optional Blocking optional rolling runs Known Build Error Use this to report build issues in the .NET Helix tab
Milestone

Comments

@jakobbotsch
Copy link
Member

jakobbotsch commented Mar 11, 2023

Seems to be failing widely.

01:45:37.661 Passed test: global::JIT.HardwareIntrinsics.X86._Ssse3.Program.SignInt32()
01:45:37.661 Running test: JIT/HardwareIntrinsics/General/Vector256_1/Vector256_1_ro/Vector256_1_ro.cmd
App Exit Code: 259
Expected: 100
Actual: 259
END EXECUTION - FAILED
FAILED
Unhandled exception. System.IO.IOException: The process cannot access the file 'C:\h\w\AE5609E1\w\B7AD09EA\e\JIT\HardwareIntrinsics\HardwareIntrinsics_ro\HardwareIntrinsics_ro.testStats.csv' because it is being used by another process.
   at Microsoft.Win32.SafeHandles.SafeFileHandle.CreateFile(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options)
   at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable`1 unixCreateMode)
   at System.IO.Strategies.OSFileStreamStrategy..ctor(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable`1 unixCreateMode)
   at System.IO.Strategies.FileStreamHelpers.ChooseStrategyCore(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable`1 unixCreateMode)
   at System.IO.StreamReader.ValidateArgsAndOpenPath(String path, Encoding encoding, Int32 bufferSize)
   at System.IO.File.ReadLines(String path)
   at XUnitLogChecker.Main(String[] args) in D:\a\_work\1\s\src\tests\Common\XUnitLogChecker\XUnitLogChecker.cs:line 77

@ivdiazsa Could this be related to #78742? Could also likely be #83261.

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=201133
Build error leg or test failing: HardwareIntrinsics_ro.WorkItemExecution
Pull request: #83005

Error Message

Fill the error message using known issues guidance.

{
  "ErrorMessage": "csv' because it is being used by another process",
  "ExcludeConsoleLog": false
}

Report

Build Definition Step Name Console log Pull Request
204788 dotnet/runtime Send tests to Helix (Unix) Log #83427
204784 dotnet/runtime Send tests to Helix (Unix) Log #83050
204543 dotnet/runtime coreclr Pri0 Runtime Tests Run osx x64 checked Log #82970
204345 dotnet/runtime Send tests to Helix (Unix) Log
204427 dotnet/runtime Send tests to Helix (Unix) Log #83413
204428 dotnet/runtime Send tests to Helix (Unix) Log
204376 dotnet/runtime Send tests to Helix (Unix) Log #83412
204340 dotnet/runtime Send tests to Helix (Unix) Log #83411
204044 dotnet/runtime Send tests to Helix (Unix) Log #83354
203973 dotnet/runtime coreclr Pri0 Runtime Tests Run osx arm64 checked Log #79283
203855 dotnet/runtime Send tests to Helix (Unix) Log #83050
203876 dotnet/runtime osx-x64 Release AllSubsets_Mono_Interpreter_RuntimeTests monointerpreter Log #83337
203547 dotnet/runtime Send tests to Helix (Unix) Log #83380
203800 dotnet/runtime coreclr Pri0 Runtime Tests Run linux arm checked Log #83388
203765 dotnet/runtime Send tests to Helix (Unix) Log #83299
203621 dotnet/runtime coreclr Pri0 Runtime Tests Run linux arm checked Log #80154
203583 dotnet/runtime coreclr Pri1 Runtime Tests Run linux x64 release Log
203537 dotnet/runtime Send tests to Helix (Unix) Log #82827
203341 dotnet/runtime Send tests to Helix (Unix) Log #83354
203507 dotnet/runtime Send tests to Helix (Unix) Log #82954
203473 dotnet/runtime coreclr Pri0 Runtime Tests Run osx x64 checked Log #80154
203478 dotnet/runtime Send tests to Helix (Unix) Log #83371
203421 dotnet/runtime Send tests to Helix (Unix) Log
203382 dotnet/runtime coreclr Pri1 Runtime Tests Run linux x64 release Log
203384 dotnet/runtime coreclr Pri0 Runtime Tests Run Composite osx x64 checked Log
203348 dotnet/runtime Send tests to Helix (Unix) Log
203352 dotnet/runtime coreclr Pri0 Runtime Tests Run linux arm checked Log #80154
203328 dotnet/runtime coreclr Pri0 Runtime Tests Run linux arm64 checked Log #79334
203275 dotnet/runtime Send tests to Helix (Unix) Log #82045
203108 dotnet/runtime Send tests to Helix (Unix) Log #82779
202951 dotnet/runtime Build Tests Log #83356
202862 dotnet/runtime Send tests to Helix (Unix) Log #82045
202821 dotnet/runtime coreclr Pri0 Runtime Tests Run osx x64 checked Log #80154
202706 dotnet/runtime Send tests to Helix (Unix) Log
202648 dotnet/runtime Send tests to Helix (Unix) Log #83334
202622 dotnet/runtime coreclr Pri0 Runtime Tests Run osx arm64 checked Log #82045
202588 dotnet/runtime coreclr Pri0 Runtime Tests Run osx arm64 checked Log #79283
202432 dotnet/runtime Build Tests Log #79169
202394 dotnet/runtime Send tests to Helix (Unix) Log #79169
202373 dotnet/runtime Send tests to Helix (Unix) Log #83335
202281 dotnet/runtime Send tests to Helix (Unix) Log #83300
202216 dotnet/runtime Send tests to Helix (Unix) Log #83253
202211 dotnet/runtime Send tests to Helix (Unix) Log #82049
202132 dotnet/runtime Send tests to Helix (Unix) Log #83094
202065 dotnet/runtime Send tests to Helix (Unix) Log
202081 dotnet/runtime Send tests to Helix (Unix) Log
202049 dotnet/runtime Send tests to Helix (Windows) Log
202048 dotnet/runtime coreclr Pri1 Runtime Tests Run linux arm64 release Log
202046 dotnet/runtime coreclr Pri0 Runtime Tests Run Composite osx arm64 checked Log
202034 dotnet/runtime Send tests to Helix (Unix) Log
Build Definition Test Pull Request
205173 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution
205061 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution
205036 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution
204901 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution
204829 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution #83402
204444 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution #83413
204050 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution #83354
204232 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution
204227 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution #83402
203486 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution
203487 dotnet/runtime HardwareIntrinsics_r.WorkItemExecution
203369 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution #83372
203349 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution
203347 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution #83354
203308 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution
203201 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution
203096 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution #83354
203089 dotnet/runtime global::JIT.HardwareIntrinsics.X86._Avx512F.Program.XorInt64() #83354
202848 dotnet/runtime HardwareIntrinsics_r.WorkItemExecution
202653 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution
202124 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution
202123 dotnet/runtime HardwareIntrinsics_r.WorkItemExecution
202035 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution
202030 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution
201969 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution
201955 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution
201953 dotnet/runtime HardwareIntrinsics_ro.WorkItemExecution
201952 dotnet/runtime HardwareIntrinsics_r.WorkItemExecution #83250
201883 dotnet/runtime HardwareIntrinsics_r.WorkItemExecution #82744

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
18 79 79
@jakobbotsch jakobbotsch added the tenet-performance Performance related issue label Mar 11, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 11, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Mar 11, 2023
@ghost
Copy link

ghost commented Mar 11, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

This seems to be failing widely in jitsress runs now: https://dev.azure.com/dnceng-public/public/_build/results?buildId=201133&view=ms.vss-test-web.build-test-results-tab

Author: jakobbotsch
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

@ghost
Copy link

ghost commented Mar 11, 2023

Tagging subscribers to this area: @hoyosjs
See info in area-owners.md if you want to be subscribed.

Issue Details

Seems to be failing widely.

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=201133
Build error leg or test failing: HardwareIntrinsics_ro.WorkItemExecution
Pull request: #83005

Error Message

Fill the error message using known issues guidance.

{
  "ErrorMessage": "HardwareIntrinsics_",
  "BuildRetry": false,
  "ErrorPattern": "",
  "ExcludeConsoleLog": false
}
01:45:37.661 Passed test: global::JIT.HardwareIntrinsics.X86._Ssse3.Program.SignInt32()
01:45:37.661 Running test: JIT/HardwareIntrinsics/General/Vector256_1/Vector256_1_ro/Vector256_1_ro.cmd
App Exit Code: 259
Expected: 100
Actual: 259
END EXECUTION - FAILED
FAILED
Unhandled exception. System.IO.IOException: The process cannot access the file 'C:\h\w\AE5609E1\w\B7AD09EA\e\JIT\HardwareIntrinsics\HardwareIntrinsics_ro\HardwareIntrinsics_ro.testStats.csv' because it is being used by another process.
   at Microsoft.Win32.SafeHandles.SafeFileHandle.CreateFile(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options)
   at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable`1 unixCreateMode)
   at System.IO.Strategies.OSFileStreamStrategy..ctor(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable`1 unixCreateMode)
   at System.IO.Strategies.FileStreamHelpers.ChooseStrategyCore(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable`1 unixCreateMode)
   at System.IO.StreamReader.ValidateArgsAndOpenPath(String path, Encoding encoding, Int32 bufferSize)
   at System.IO.File.ReadLines(String path)
   at XUnitLogChecker.Main(String[] args) in D:\a\_work\1\s\src\tests\Common\XUnitLogChecker\XUnitLogChecker.cs:line 77

@ivdiazsa Could this be related to #78742?

Author: jakobbotsch
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, area-Infrastructure-coreclr, untriaged

Milestone: -

@jakobbotsch jakobbotsch added blocking-clean-ci-optional Blocking optional rolling runs and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner tenet-performance Performance related issue labels Mar 11, 2023
@jakobbotsch jakobbotsch changed the title [jitstress] HardwareIntrinsics_ro fails [jitstress] HardwareIntrinsics_ro fails with "process cannot access the file" error Mar 11, 2023
@EgorBo
Copy link
Member

EgorBo commented Mar 13, 2023

@ivdiazsa could you please take a look if that was your PR, this broke many outerloop pipelines

@ivdiazsa
Copy link
Member

I just checked the failures and the Linux ones seem to be because of the test taking too long. Do we expect these tests to last more than 5 minutes? If yes, then that's an adjustment to make to the watcher. If not, then we can confirm the watcher is doing its job of preventing potential freezes.

As for the Windows ones, I have no idea. Seems like potentially a problem with Windows liking to hold on to resources for whatever reason. However, it seems to be working on my machine so far. I will continue trying to repro it.

@JulieLeeMSFT
Copy link
Member

CC @trylek @hoyosjs PTAL.

@trylek
Copy link
Member

trylek commented Mar 13, 2023

I think that we have two different issues here:

  1. In general 5 minutes sounds like not a lot especially in JIT stress mode; IIRC @davidwrighton originally implemented these tests and we actually recently discussed their performance on another issue thread but I think 5 minutes is just too drastic. In the YAML pipelines there used to be this property TimeoutPerTestInMinutes that ultimately translates to the environment variable __TestTimeout in Helix, ideally we should fix the watcher to use that instead of hard-coding the timeout as @BruceForstall pointed out in the PR thread [Merge-on-Red] - Implement Test Process Watcher #78742.

  2. While the test timeout may have been caused by too aggressive setting of the timeout value, the infrastructure should be still capable of dealing with such real-life situations. It seems to me that the error message quoted in the title of this issue got hit due to the fact that, as the app was forcibly terminated due to the presumed timeout, the *.testStats.csv file wasn't properly closed at the test side and subsequently the XUnit log checker failed to open it for reading. My initial thinking is that we should either allow read-share for the file write in the Roslyn-generated wrapper assuming incomplete reads from the file are treated as benign or introduce some back-off logic that would wait for the file handle to be released in a couple of seconds as part of "process GC" in Windows somewhat akin to directory deletion which is also asynchronous in Win32.

Hope that helps

Tomas

@ivdiazsa
Copy link
Member

ivdiazsa commented Mar 13, 2023

I've been trying to repro this all day, but to no avail. I think @trylek's explanation makes total sense so what I will do next is to take his suggestion of making the watcher use that environment variable to determine the timeouts if set.

As for the problem with actual timeouts:

back-off logic that would wait for the file handle to be released in a couple of seconds as part of "process GC" in Windows

Could you give more details regarding this suggestion Tomas? It sounds a reasonable way of fixing it. Do you mean like, have the log checker wait a couple seconds before executing on Windows?

@BruceForstall
Copy link
Member

In the merged tests case, like HardwareIntrinsics_ro, all the tests run under a single invocation of corerun.exe, so it's not surprising at all that the total time would exceed 5 minutes.

Do you mean like, have the log checker wait a couple seconds before executing on Windows?

That make sense to me: if the log read fails, then wait a few seconds, try again, wait a few seconds, try again. Maybe wait as much as a minute or more?

@ivdiazsa
Copy link
Member

In the merged tests case, like HardwareIntrinsics_ro, all the tests run under a single invocation of corerun.exe, so it's not surprising at all that the total time would exceed 5 minutes.

Do you mean like, have the log checker wait a couple seconds before executing on Windows?

That make sense to me: if the log read fails, then wait a few seconds, try again, wait a few seconds, try again. Maybe wait as much as a minute or more?

Got it! Yes, I also think 1 minute of retrying every couple seconds or so to read the log is reasonable. If we still can't read it after that then maybe it's safe to assume something went terribly wrong that would require a developer to look at.

@trylek
Copy link
Member

trylek commented Mar 13, 2023

@ivdiazsa - for debugging the locked csv file issue I bet it should suffice to set the watcher timeout to something ridiculous like 1 second, then you should be surely able to hit the timeout and forced app termination and I assume you'll immediately also see the locked file issue, I'm hitting it locally all the time on my Windows machine whenever I forcibly kill an app, all files it held on to remain locked for something like10~30 more seconds (subjective guess, I haven't actually measured it).

@trylek
Copy link
Member

trylek commented Mar 13, 2023

For the directory deletion example I mentioned, this is how I implemented it in R2RTest (by porting my previous fix for .NET Native where it also used to crash lab runs about 8 years back):

private static async Task<bool> DeleteSubtree(this string folder)

@trylek
Copy link
Member

trylek commented Mar 13, 2023

(It's a tad overcomplicated due to using the async logic to delete subtrees concurrently, that's something you thankfully don't need.)

@jakobbotsch jakobbotsch added the Known Build Error Use this to report build issues in the .NET Helix tab label Mar 13, 2023
@JulieLeeMSFT JulieLeeMSFT added this to the 8.0.0 milestone Mar 14, 2023
@BruceForstall
Copy link
Member

@ivdiazsa @trylek @hoyosjs Given that this is causing all "outerloop" runs to fail, and many other pipelines as well, can we immediately revert the watcher change (or disable it, if that is trivial to do)?

@hoyosjs
Copy link
Member

hoyosjs commented Mar 14, 2023

Revert PR is up - it doesn't make sense to block others.

@hoyosjs
Copy link
Member

hoyosjs commented Mar 15, 2023

I updated the issue - it was running against a pretty broad string.

@ivdiazsa
Copy link
Member

Closing after being addressed by PR #83427.

@ghost ghost locked as resolved and limited conversation to collaborators Apr 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Infrastructure-coreclr blocking-clean-ci-optional Blocking optional rolling runs Known Build Error Use this to report build issues in the .NET Helix tab
Projects
None yet
Development

No branches or pull requests

7 participants