Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

membarrier(REGISTER_PRIVATE_EXPEDITED) waits through an unnecessary RCU grace period during Linux process startup #106722

Closed
harisokanovic opened this issue Aug 20, 2024 · 0 comments · Fixed by #106724 or #107100
Labels
area-PAL-coreclr in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue

Comments

@harisokanovic
Copy link
Contributor

harisokanovic commented Aug 20, 2024

Dotnet runtime uses membarrier() syscalls in the Linux implementation of FlushProcessWriteBuffers(). An initialization call to membarrier(MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED) can run substantially longer in a process with more than one thread, by bypassing this fast-path (mm->mm_users > 1) in the kernel.

PAL_InitializeCoreCLR() hits the slow path by initializing membarrier() after launching a sync manager worker thread. Startup time can be improved by reordering membarrier init ahead of thread creation.

Potential fix in runtime PR 106724.


The issue can be demonstrated in this simple C program:

// membarrier(REGISTER_PRIVATE_EXPEDITED) init demo
// 1) Install tools: sudo apt install gcc libc6-dev hyperfine
// 2) Build test program: gcc -o mbdemo mbdemo.c -lpthread
// 3) Slow: hyperfine --style basic --time-unit millisecond "./mbdemo n"
// 4) Fast: hyperfine --style basic --time-unit millisecond "./mbdemo y"

#include <linux/membarrier.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <pthread.h>
#include <assert.h>
#include <stdio.h>

static void* worker_funct(void* param) {
  printf("worker done\n");
  return param;
}

int main(int argc, const char** argv) {
  if (argc >= 2 && argv[1][0] == 'y') {
    // init before thread
    assert(syscall(SYS_membarrier, MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, 0, 0) == 0);
  }

  pthread_t worker_thread = {0};
  assert(pthread_create(&worker_thread, NULL, &worker_funct, NULL) == 0);

  // init after thread
  assert(syscall(SYS_membarrier, MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, 0, 0) == 0);

  assert(pthread_join(worker_thread, NULL) == 0);

  printf("main done\n");
  return 0;
}

~11ms difference on a 16-core arm64 system (AWS r7g.4xlarge):

$ hyperfine --style basic --time-unit millisecond "./mbdemo n"
Benchmark 1: ./mbdemo n
  Time (mean ± σ):      11.5 ms ±   3.0 ms    [User: 0.8 ms, System: 0.0 ms]
  Range (min … max):     5.5 ms …  23.5 ms    496 runs

$ hyperfine --style basic --time-unit millisecond "./mbdemo y"
Benchmark 1: ./mbdemo y
  Time (mean ± σ):       0.5 ms ±   0.0 ms    [User: 0.5 ms, System: 0.3 ms]
  Range (min … max):     0.5 ms …   0.7 ms    2992 runs

~8ms difference on 16-core x86_64 (AWS r6i.4xlarge):

ubuntu@ip-172-31-41-194:~$ hyperfine --style basic --time-unit millisecond "./mbdemo n"
Benchmark 1: ./mbdemo n
  Time (mean ± σ):       8.7 ms ±   2.0 ms    [User: 0.5 ms, System: 0.0 ms]
  Range (min … max):     5.5 ms …  16.5 ms    335 runs

ubuntu@ip-172-31-41-194:~$ hyperfine --style basic --time-unit millisecond "./mbdemo y"
Benchmark 1: ./mbdemo y
  Time (mean ± σ):       0.5 ms ±   0.0 ms    [User: 0.3 ms, System: 0.2 ms]
  Range (min … max):     0.4 ms …   1.5 ms    3157 runs

A workaround can be implemented with an LD_PRELOAD shared library calling membarrier(REGISTER_PRIVATE_EXPEDITED) before the first dotnet thread:

// ld preload hack to run `membarrier(MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITE)` at process startup, before first thread is created.
// 1) Build: gcc -shared -o mbhack.so mbhack.c -lpthread
// 2) Run: LD_PRELOAD=/path/to/mbhack.so some-dotnet-binary

#include <linux/membarrier.h>
#include <sys/syscall.h>
#include <unistd.h>

__attribute__((constructor))
static void mbhack_init()
{
  syscall(SYS_membarrier, MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, 0, 0);
}

~15ms difference on a 16-core arm64 system (AWS r7g.4xlarge):

$ hyperfine --style basic --time-unit millisecond ./bin/Release/net9.0/hello-world-dotnet9 
Benchmark 1: ./bin/Release/net9.0/hello-world-dotnet9
  Time (mean ± σ):      46.3 ms ±   4.1 ms    [User: 22.9 ms, System: 7.9 ms]
  Range (min … max):    35.3 ms …  55.3 ms    62 runs

$ LD_PRELOAD=$HOME/mbhack.so hyperfine --style basic --time-unit millisecond ./bin/Release/net9.0/hello-world-dotnet9 
Benchmark 1: ./bin/Release/net9.0/hello-world-dotnet9
  Time (mean ± σ):      30.1 ms ±   0.3 ms    [User: 22.9 ms, System: 8.0 ms]
  Range (min … max):    29.5 ms …  30.9 ms    95 runs

~10ms difference on 16-core x86_64 (AWS r6i.4xlarge):

ubuntu@ip-172-31-41-194:~/hello-world-dotnet9$ hyperfine --style basic --time-unit millisecond ./bin/Release/net9.0/hello-world-dotnet9
Benchmark 1: ./bin/Release/net9.0/hello-world-dotnet9
  Time (mean ± σ):      36.6 ms ±   2.0 ms    [User: 19.3 ms, System: 6.4 ms]
  Range (min … max):    31.4 ms …  42.6 ms    82 runs

ubuntu@ip-172-31-41-194:~/hello-world-dotnet9$ LD_PRELOAD=$HOME/mbhack.so hyperfine --style basic --time-unit millisecond ./bin/Release/net9.0/hello-world-dotnet9
Benchmark 1: ./bin/Release/net9.0/hello-world-dotnet9
  Time (mean ± σ):      26.5 ms ±   0.9 ms    [User: 20.4 ms, System: 6.2 ms]
  Range (min … max):    25.5 ms …  29.2 ms    109 runs
@harisokanovic harisokanovic added the tenet-performance Performance related issue label Aug 20, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Aug 20, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Aug 20, 2024
harisokanovic pushed a commit to harisokanovic/dotnet_runtime that referenced this issue Aug 20, 2024
Refactor InitializeFlushProcessWriteBuffers(): Split membarrier()
initialization into a new InitializeMembarrier() helper function.

InitializeMembarrier() earlier before first thread is created to improve
process start time on Linux. More details can be found in issue 106722.

Fixes dotnet#106722
@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Aug 20, 2024
@filipnavara filipnavara added area-PAL-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Aug 20, 2024
harisokanovic pushed a commit to harisokanovic/dotnet_runtime that referenced this issue Aug 21, 2024
… improve start time

InitializeFlushProcessWriteBuffers() initializes expedited membarrier()
syscall on Linux, which is much slower when called in a multi-thread
process. Move this init earlier to improve dotnet process start time.
A detailed explanation can be found in issue 106722.

Fixes dotnet#106722
harisokanovic pushed a commit to harisokanovic/dotnet_runtime that referenced this issue Aug 21, 2024
… improve start time

InitializeFlushProcessWriteBuffers() initializes expedited membarrier()
syscall on Linux, which is much slower when called in a multi-thread
process. Move this init earlier to improve dotnet process start time.
A detailed explanation can be found in issue 106722.

Fixes dotnet#106722
@jkotas jkotas closed this as completed in 27ee590 Aug 22, 2024
@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Aug 22, 2024
github-actions bot pushed a commit that referenced this issue Aug 22, 2024
… improve start time

InitializeFlushProcessWriteBuffers() initializes expedited membarrier()
syscall on Linux, which is much slower when called in a multi-thread
process. Move this init earlier to improve dotnet process start time.
A detailed explanation can be found in issue 106722.

Fixes #106722
jkotas pushed a commit that referenced this issue Aug 22, 2024
… improve start time (#106836)

InitializeFlushProcessWriteBuffers() initializes expedited membarrier()
syscall on Linux, which is much slower when called in a multi-thread
process. Move this init earlier to improve dotnet process start time.
A detailed explanation can be found in issue 106722.

Fixes #106722

Co-authored-by: Haris Okanovic <harisokn@amazon.com>
harisokanovic pushed a commit to harisokanovic/dotnet_runtime that referenced this issue Aug 28, 2024
…ialize()

A fixup of commit 27ee590 that's broken on platforms which don't
support membarrier() syscall: GetVirtualPageSize() is called in the
fallback path of InitializeFlushProcessWriteBuffers() and attempts to
mmap() zero bytes.

Move InitializeFlushProcessWriteBuffers() after VIRTUALInitialize() but
before the first thread is created.

Fixes dotnet#106892
Fixes dotnet#106722
harisokanovic pushed a commit to harisokanovic/dotnet_runtime that referenced this issue Aug 28, 2024
…ialize()

A fixup of commit 27ee590 that's broken on platforms which don't
support membarrier() syscall: GetVirtualPageSize() is called in the
fallback path of InitializeFlushProcessWriteBuffers() and attempts to
mmap() zero bytes.

Move InitializeFlushProcessWriteBuffers() after VIRTUALInitialize() but
before the first thread is created.

Fixes dotnet#106892
Fixes dotnet#106722
janvorli pushed a commit that referenced this issue Aug 28, 2024
…ialize() (#107100)

A fixup of commit 27ee590 that's broken on platforms which don't
support membarrier() syscall: GetVirtualPageSize() is called in the
fallback path of InitializeFlushProcessWriteBuffers() and attempts to
mmap() zero bytes.

Move InitializeFlushProcessWriteBuffers() after VIRTUALInitialize() but
before the first thread is created.

Fixes #106892
Fixes #106722

Co-authored-by: Haris Okanovic <harisokn@amazon.com>
github-actions bot pushed a commit that referenced this issue Aug 28, 2024
…ialize()

A fixup of commit 27ee590 that's broken on platforms which don't
support membarrier() syscall: GetVirtualPageSize() is called in the
fallback path of InitializeFlushProcessWriteBuffers() and attempts to
mmap() zero bytes.

Move InitializeFlushProcessWriteBuffers() after VIRTUALInitialize() but
before the first thread is created.

Fixes #106892
Fixes #106722
jkotas pushed a commit that referenced this issue Aug 29, 2024
…ialize() (#107114)

A fixup of commit 27ee590 that's broken on platforms which don't
support membarrier() syscall: GetVirtualPageSize() is called in the
fallback path of InitializeFlushProcessWriteBuffers() and attempts to
mmap() zero bytes.

Move InitializeFlushProcessWriteBuffers() after VIRTUALInitialize() but
before the first thread is created.

Fixes #106892
Fixes #106722

Co-authored-by: Haris Okanovic <harisokn@amazon.com>
jtschuster pushed a commit to jtschuster/runtime that referenced this issue Sep 17, 2024
…ialize() (dotnet#107100)

A fixup of commit 27ee590 that's broken on platforms which don't
support membarrier() syscall: GetVirtualPageSize() is called in the
fallback path of InitializeFlushProcessWriteBuffers() and attempts to
mmap() zero bytes.

Move InitializeFlushProcessWriteBuffers() after VIRTUALInitialize() but
before the first thread is created.

Fixes dotnet#106892
Fixes dotnet#106722

Co-authored-by: Haris Okanovic <harisokn@amazon.com>
@github-actions github-actions bot locked and limited conversation to collaborators Sep 28, 2024
mikelle-rogers pushed a commit to mikelle-rogers/runtime that referenced this issue Dec 10, 2024
… improve start time (dotnet#106724)

InitializeFlushProcessWriteBuffers() initializes expedited membarrier()
syscall on Linux, which is much slower when called in a multi-thread
process. Move this init earlier to improve dotnet process start time.
A detailed explanation can be found in issue 106722.

Fixes dotnet#106722

Co-authored-by: Haris Okanovic <harisokn@amazon.com>
mikelle-rogers pushed a commit to mikelle-rogers/runtime that referenced this issue Dec 10, 2024
…ialize() (dotnet#107100)

A fixup of commit 27ee590 that's broken on platforms which don't
support membarrier() syscall: GetVirtualPageSize() is called in the
fallback path of InitializeFlushProcessWriteBuffers() and attempts to
mmap() zero bytes.

Move InitializeFlushProcessWriteBuffers() after VIRTUALInitialize() but
before the first thread is created.

Fixes dotnet#106892
Fixes dotnet#106722

Co-authored-by: Haris Okanovic <harisokn@amazon.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-PAL-coreclr in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue
Projects
None yet
2 participants