Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISC-V] Disable robust mutexes in PAL on Qemu #106123

Closed
wants to merge 1 commit into from

Conversation

yurai007
Copy link
Contributor

@yurai007 yurai007 commented Aug 8, 2024

Before this change due to https://gitlab.com/qemu-project/qemu/-/issues/2446 we get lot of System.Threading.Tests/System.IO.IsolatedStorage.Tests errors like:

System.Threading.Tests.MutexTests.AbandonExisting(name: "Local\\0908829282ce40db90b073ccd04b4c62", waitType: WaitAny, waitCount: 1, notAbandonedWaitIndex: 0, isNotAbandonedWaitObjectSignaled: False, abandonDuringWait: False) [FAIL]
      System.AggregateException : One or more errors occurred. (Not enough storage is available to process this command. : 'Local\0908829282ce40db90b073ccd04b4c62'. One or more system calls failed: pthread_mutex_init(...) == ENOTSUP/EOPNOTSUPP;)
      ---- System.IO.IOException : Not enough storage is available to process this command. : 'Local\0908829282ce40db90b073ccd04b4c62'. One or more system calls failed: pthread_mutex_init(...) == ENOTSUP/EOPNOTSUPP;

In this patch we force PAL to use file locks which fixes System.Threading.Tests/System.IO.IsolatedStorage.Tests on Qemu.

Part of #84834, cc @dotnet/samsung

Before this change due to https://gitlab.com/qemu-project/qemu/-/issues/2446
we get lot of System.Threading.Tests/System.IO.IsolatedStorage.Tests errors like:

System.Threading.Tests.MutexTests.AbandonExisting(name: "Local\\0908829282ce40db90b073ccd04b4c62", waitType: WaitAny, waitCount: 1, notAbandonedWaitIndex: 0, isNotAbandonedWaitObjectSignaled: False, abandonDuringWait: False) [FAIL]
      System.AggregateException : One or more errors occurred. (Not enough storage is available to process this command. : 'Local\0908829282ce40db90b073ccd04b4c62'. One or more system calls failed: pthread_mutex_init(...) == ENOTSUP/EOPNOTSUPP;)
      ---- System.IO.IOException : Not enough storage is available to process this command. : 'Local\0908829282ce40db90b073ccd04b4c62'. One or more system calls failed: pthread_mutex_init(...) == ENOTSUP/EOPNOTSUPP;

In this patch we force PAL to use file locks which fixes System.Threading.Tests/System.IO.IsolatedStorage.Tests on Qemu.
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Aug 8, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Aug 8, 2024
@jkotas jkotas added area-PAL-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Aug 8, 2024
@jkotas
Copy link
Member

jkotas commented Aug 8, 2024

This change means that the runtime build for QEMU is going to be different from the regular runtime build. I am not sure we want to take this path. It is different from the earlier uses of DOTNET_RUNNING_UNDER_QEMU that just disabled specific tests on QEMU. It would eventually mean that QEMU needs its own RID (runtime identifier). Should this be a runtime check instead so that the same binary works both on QEMU and non-QEMU?

@jkotas jkotas requested a review from kouvel August 8, 2024 13:28
@yurai007
Copy link
Contributor Author

yurai007 commented Aug 8, 2024

This change means that the runtime build for QEMU is going to be different from the regular runtime build. I am not sure we want to take this path. It is different from the earlier uses of DOTNET_RUNNING_UNDER_QEMU that just disabled specific tests on QEMU. It would eventually mean that QEMU needs its own RID (runtime identifier). Should this be a runtime check instead so that the same binary works both on QEMU and non-QEMU?

In alternative solution we may special-case only RISC-V and force cmake to perform robust mutex runtime checks on config time (without using hardcoded values from tryrun.cmake). Then I believe relying on Qemu shouldn't be longer needed.

@kouvel
Copy link
Member

kouvel commented Aug 8, 2024

In alternative solution we may special-case only RISC-V and force cmake to perform robust mutex runtime checks on config time (without using hardcoded values from tryrun.cmake). Then I believe relying on Qemu shouldn't be longer needed.

Wouldn't that mean a build produced under QEMU would always use file locks instead of pthread robust mutexes? I wonder if we should include both the pthread robust mutex and file lock implementations into the build and choose one at runtime based on the not-supported error. Then regardless of how the build is produced it would have consistent behavior when running under QEMU or not.

@clamp03 clamp03 added the arch-riscv Related to the RISC-V architecture label Aug 8, 2024
@yurai007
Copy link
Contributor Author

yurai007 commented Aug 9, 2024

Wouldn't that mean a build produced under QEMU would always use file locks instead of pthread robust mutexes?

Given that https://gitlab.com/qemu-project/qemu/-/issues/2424 is really "won't fix" and assuming glibc pthread_mutex_init implementation won't change much in nearest future I think answer is "yes" :)

I wonder if we should include both the pthread robust mutex and file lock implementations into the build and choose one at runtime based on the not-supported error. Then regardless of how the build is produced it would have consistent behavior when running under QEMU or not.

If I understand your idea correctly it would require PAL mutex adjustment, I think at least PAL_CreateMutexW. I can explore this direction and check if that's enough or require more work.

Copy link
Member

@sirntar sirntar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yurai007
Copy link
Contributor Author

In mutex.cpp pthread robust mutex code and file lock specific code are scattered across file. Choosing one at runtime instead compile time NAMED_MUTEX_USE_PTHREAD_MUTEX macro need more adjustment than only in InitializeProcessSharedRobustRecursiveMutex/NamedMutexSharedData. I'm not convinced it's good idea.
From the other hand change I meant previously would be something as simple as: yurai007@ca004b1 which would force cmake to perform runtime checks on RISC-V excluding Tizen. It should have effect same as original PR change and doesn't need information about whether we are on Qemu.

@risc-vv
Copy link

risc-vv commented Aug 13, 2024

RISC-V Release-CLR-QEMU: 9389 / 9390 (99.99%)
=======================
      passed: 9389
      failed: 1
     skipped: 108
      killed: 0
------------------------
  TOTAL libs: 9498
 TOTAL tests: 9498
   REAL time: 48min 15s 333ms
=======================

Release-CLR-QEMU.md, Release-CLR-QEMU.xml, testclr_output.tar.gz

RISC-V Release-FX-QEMU: 703807 / 713575 (98.63%)
=======================
      passed: 703807
      failed: 401
     skipped: 1661
      killed: 9367
------------------------
  TOTAL libs: 253
 TOTAL tests: 715236
   REAL time: 2h 35min 20s 849ms
=======================

Release-FX-QEMU.md, Release-FX-QEMU.xml, testfx_output.tar.gz

Build information and links

GIT: f28142d61efb233688e8c0496cb2441cfdcdf4e2
CI: 45830772e7f8774d6f0963cd88c4d1cb1454299a
REPO: dotnet/runtime
BRANCH: main
CONFIG: Release
LIB_CONFIG: Release

# CORE_LIBS_BUILD_CMD
runtime/build.sh --arch riscv64 --cross -c Release -s libs /p:EnableSourceLink=false
# CORE_BUILD_CMD
runtime/build.sh --arch riscv64 --cross -c Release -s clr+libs+host /p:EnableSourceLink=false

# TESTCLR_BUILD_CMD
runtime/src/tests/build.sh -riscv64 -cross -Release -priority1 -p:UseLocalAppHostPack=true
# TESTCLR_CMD
python3 riscv-CI/goci/agent/TestRunner/run.py --core_root ./coreclr.Release/Tests/Core_Root --testhost ./testhost.Release --atest ./coreclr.Release --test ./ --log_dir ./logs  --timeout 2700 --log_level DEBUG
# TESTCLR_RUN
/godata/pipelines/Release-CLR-QEMU/logs/run_tests.log
cd "/_PATH_/_WITH_/_TEST_" && ROOTFS_DIR=/crossrootfs/riscv64 QEMU_LD_PREFIX=/crossrootfs/riscv64 __TestDotNetCmd=/godata/pipelines/Release-CLR-QEMU/testhost.Release/dotnet CORE_ROOT=/godata/pipelines/Release-CLR-QEMU/coreclr.Release/Tests/Core_Root  /usr/bin/time -f "exec_time: %e" ./_TEST_BINARY_

# TESTFX_BUILD_CMD
runtime/build.sh --arch riscv64 --cross -c Release -rc Release -hc Release -lc Release -s libs.tests --testscope innerloop /p:EnableSourceLink=false /p:UseLocalAppHostPack=true
# TESTFX_CMD
python3 riscv-CI/goci/agent/TestRunner/run.py --corefx --testhost ./testhost.Release --atest ./corefx.Release --log_dir ./logs  --timeout 6000 --memlimit 4096 --jobs 16 --log_level DEBUG
# TESTFX_RUN
/go-agent/pipelines/Release-FX-QEMU/logs/run_tests.log
cd "/_PATH_/_WITH_/_TEST_" && ROOTFS_DIR=/crossrootfs/riscv64 QEMU_LD_PREFIX=/crossrootfs/riscv64 __TestDotNetCmd=/go-agent/pipelines/Release-FX-QEMU/testhost.Release/dotnet  /usr/bin/time -f "exec_time: %e" /go-agent/pipelines/Release-FX-QEMU/testhost.Release/dotnet exec   xunit.console.dll _TEST_BINARY_ -nologo -nocolor -notrait category=failing

@kouvel
Copy link
Member

kouvel commented Aug 14, 2024

In mutex.cpp pthread robust mutex code and file lock specific code are scattered across file. Choosing one at runtime instead compile time NAMED_MUTEX_USE_PTHREAD_MUTEX macro need more adjustment than only in InitializeProcessSharedRobustRecursiveMutex/NamedMutexSharedData.

Yes, it would involve switching a number of compile-time checks to run-time checks such that both implementations are included in the build on Linux.

From the other hand change I meant previously would be something as simple as: yurai007@ca004b1 which would force cmake to perform runtime checks on RISC-V excluding Tizen. It should have effect same as original PR change and doesn't need information about whether we are on Qemu.

With that it sounds like a build produced under QEMU would differ in behavior from a build produced on similar non-emulated hardware, as the former would use the file lock implementation and the latter would use pthread robust mutex. Ideally, the produced build should have consistent behavior. For instance, it would be problematic to use QEMU to cross-build a patch and apply it on non-emulated hardware, if the shipped binaries were produced on non-emulated hardware. Maybe this kind of issue already exists to some degree, but perhaps it should be fixed.

A safer approach may be to always use the file lock implementation based on arch, but that would be unfortunate, as it's less efficient. It's also very difficult to change the behavior between releases for compat reasons. I'm still favoring run-time checks over build-time checks.

@yurai007
Copy link
Contributor Author

With that it sounds like a build produced under QEMU would differ in behavior from a build produced on similar non-emulated hardware, as the former would use the file lock implementation and the latter would use pthread robust mutex. Ideally, the produced build should have consistent behavior. For instance, it would be problematic to use QEMU to cross-build a patch and apply it on non-emulated hardware, if the shipped binaries were produced on non-emulated hardware. Maybe this kind of issue already exists to some degree, but perhaps it should be fixed.

Thanks, now with this explanation solution based on run-time checks makes more sense to me.

@BruceForstall
Copy link
Member

@kouvel @yurai007 Based on the above conversation, can this PR be closed? (It sounds like a new, different, dynamic checking implementation would be preferred.)

@yurai007 yurai007 closed this Sep 23, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Oct 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-riscv Related to the RISC-V architecture area-PAL-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants