-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
high overhead on blue gene #85
Comments
Happens on mira/cetus (blue gene) with WALLCLOCK@8500 and Does NOT happen on theta (x86, Cray XYZ) with REALTIME@8500. Does NOT happen on biou (power7) with REALTIME@8500, only 2% overhead. I could try poman (power 8), but if it doesn't happen on biou, then |
I tried inserting PAPI interrupts into openmp regions directly, So, it's not interrupts breaking something in the MPI or openmp This suggests that there's something inside the hpcrun interrupt But the big mystery remains why this happens on blue gene but not on |
Now fixed in commit ccb6bf7, at least for blue gene and powerpc. Turns out that inside the unwinder, libunwind was calling mmap() which But I can tell there is a similar problem to a smaller degree on theta |
Please remove me from this list.
Thank you.
Dung X Nguyen
… On Jun 7, 2018, at 2:01 PM, Mark W. Krentel ***@***.***> wrote:
I tried inserting PAPI interrupts into openmp regions directly,
outside of hpcrun. With PAPI_TOT_CYC at 8,000,000 (200/sec) and
16 threads, I get 1-2% overhead.
So, it's not interrupts breaking something in the MPI or openmp
synchronization.
This suggests that there's something inside the hpcrun interrupt
handler that's taking too long. Maybe something in the concurrent
skip list, maybe something that synchronizes between threads.
But the big mystery remains why this happens on blue gene but not on
power7 and not on KNL. Some key part of the bug must be blue gene
specific.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#85 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHnBIKdIzS64LIMiJSoE0z_LSSXwLcF5ks5t6XhsgaJpZM4UGYuo>.
|
Hpcrun seems to add a high overhead on Blue Gene. Master adds more
than 2x for the openmp solve phase in amg2006. The ompt-tr4 branch
with llvm libomp runtime adds even more.
This is with AMG 2006 on mira/cetus at ANL, 8 nodes, 8 MPI ranks,
16 openmp threads, problem size (-r) 16,16,16. AMG compiled with gnu,
flags '-g -O2', run with WALLCLOCK at 8500 (118 samples/sec).
AMG 2006 native, no toolkit.
wall clock time = 13.350482 seconds
wall clock time = 205.818907 seconds
wall clock time = 16.934752 seconds
Toolkit master, regular libgomp.
wall clock time = 31.799200 seconds
wall clock time = 241.473654 seconds
wall clock time = 43.120992 seconds
Branch ompt-tr4 with llvm libomp runtime and OMP_IDLE.
wall clock time = 35.795240 seconds
wall clock time = 247.430433 seconds
wall clock time = 72.394108 seconds
That's about 2.5x for phases 1 and 3 with master and over 4x for ompt.
The text was updated successfully, but these errors were encountered: