Make nvToolsExt conditional on WITH_CUDA_PROFILING #428

haampie · 2021-02-05T08:09:18Z

As pointed out by @hfp in #425

alazzaro · 2021-02-05T08:12:22Z

retest this please

jenkins-cscs · 2021-02-05T08:13:04Z

Can one of the admins verify this patch?

codecov · 2021-02-05T08:13:19Z

Codecov Report

Merging #428 (054f4c6) into develop (f3f60cf) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           develop    #428   +/-   ##
=======================================
  Coverage     63.1%   63.1%           
=======================================
  Files           86      86           
  Lines        25625   25625           
=======================================
  Hits         16190   16190           
  Misses        9435    9435

Flag	Coverage Δ
unittests	`63.1% <ø> (ø)`
with-blas	`63.1% <ø> (ø)`
with-libxsmm	`63.2% <ø> (ø)`
with-mpi	`63.6% <ø> (ø)`
with-openmp	`62.3% <ø> (ø)`
without-mpi	`59.4% <ø> (ø)`
without-openmp	`62.3% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f3f60cf...9e210b6. Read the comment docs.

haampie · 2021-02-05T09:00:21Z

GNU test by hand:

100% tests passed, 0 tests failed out of 23

Total Test time (real) = 2254.39 sec

That's +6% up from https://object.cscs.ch/v1/AUTH_40b5d92b316940098ceb15cf46fb815e/dbcsr-artifacts/logs/build-679/gnu.test.out, not sure if significant

haampie · 2021-02-05T09:57:26Z

From CI: 2783.59s 😕 +30%

haampie · 2021-02-05T10:01:18Z

Curious, cause I ran it like this:

$ WORKSPACE=/apps/daint/SSL/hstoppel/dbcsr/dbcsr BUILD_TAG=dbcsr STAGE_NAME=dbcsr_build sbatch -Acsstaff /apps/daint/SSL/hstoppel/dbcsr/dbcsr/.ci/daint.cscs.ch/gnu.build.sh 
$ ... wait ... 
$ WORKSPACE=/apps/daint/SSL/hstoppel/dbcsr/dbcsr BUILD_TAG=dbcsr STAGE_NAME=dbcsr_build sbatch -Acsstaff /apps/daint/SSL/hstoppel/dbcsr/dbcsr/.ci/daint.cscs.ch/gnu.test.sh

and

$ cd /apps/daint/SSL/hstoppel/dbcsr/dbcsr
$ git log
commit a14e7b18df4a7e535b58ea77e65345884e2d6e55 (HEAD -> fix-nvtools-link, origin/fix-nvtools-link)

seems to be on the right commit.

alazzaro · 2021-02-05T10:02:54Z

From CI: 2783.59s 😕 +30%

Indeed, but can we remove now the output and see if it goes down? (Hans trick)
At this point, it is important to keep this number to the minimum. With the #427 we can think to parallelize the test on multiple ranks...

alazzaro · 2021-02-05T10:11:44Z

Wait a second, all tests are now slower. Comparison at

old: https://object.cscs.ch/v1/AUTH_40b5d92b316940098ceb15cf46fb815e/dbcsr-artifacts/logs/build-679/gnu.test.out
new: https://object.cscs.ch/v1/AUTH_40b5d92b316940098ceb15cf46fb815e/dbcsr-artifacts/logs/build-689/gnu.test.out

For example, the dbcsr_perf:inputs/test_rect1_dense.perf takes >15x more...

haampie · 2021-02-05T12:24:53Z

Can you trigger CI on an old commit (e.g. #679) @alazzaro just to be sure it's not an issue with daint itself (unlikely... but possible)

alazzaro · 2021-02-05T12:32:08Z

No, I can't... But I start to think it is a Daint issue too...
What I suggest is to run manually the two versions of the code on the same daint allocation only for a representative test. This will tell us if there is a performance regression...

haampie · 2021-02-05T12:44:12Z

What comes to mind is a recent bug I've seen in slurm after the latest Daint upgrade (fixed on Dom, but not yet on Daint) about thread pinning when using --exclusive and --hint=nomultithread together. It'll run processes on the same CPUs:

$ srun -Acsstaff -Cgpu --exclusive --hint=nomultithread -n4 -c3 bash -c 'taskset -pc $$'
pid 14458's current affinity list: 0-2
pid 14459's current affinity list: 3-5
pid 14460's current affinity list: 0,6,7
pid 14461's current affinity list: 1-3

However, I couldn't trigger that bug with --exclusive --ntasks-per-core=1 like we do here in CI:

$ srun -Acsstaff -Cgpu --exclusive --ntasks-per-core=1 -n4 -c3 bash -c 'taskset -pc $$'
srun: job 29139366 queued and waiting for resources
srun: job 29139366 has been allocated resources
pid 30342's current affinity list: 0,1,12,13
pid 30343's current affinity list: 2,3,14,15
pid 30344's current affinity list: 4,5,16,17
pid 30345's current affinity list: 6,7,18,19

alazzaro · 2021-02-05T13:54:04Z

Well, this is an interesting consideration (affinity). We actually tested it years ago...
I've just retested with

#!/bin/bash -l

#SBATCH --export=ALL
#SBATCH --exclusive
#SBATCH --constraint="gpu"
#SBATCH -A cray
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=3
#SBATCH --ntasks-per-core=1 # 1=no HT, 2=HT

export OMP_PROC_BIND=TRUE
export OMP_NUM_THREADS=3

srun -n 4 ./xthi

(xthi does print the affinity mask)

and I get:

Hello from rank 0, thread 0, on nid03431. (core affinity = 0)
Hello from rank 0, thread 1, on nid03431. (core affinity = 12)
Hello from rank 0, thread 2, on nid03431. (core affinity = 1)
Hello from rank 1, thread 0, on nid03431. (core affinity = 2)
Hello from rank 1, thread 1, on nid03431. (core affinity = 14)
Hello from rank 1, thread 2, on nid03431. (core affinity = 3)
Hello from rank 2, thread 0, on nid03431. (core affinity = 4)
Hello from rank 2, thread 1, on nid03431. (core affinity = 16)
Hello from rank 2, thread 2, on nid03431. (core affinity = 5)
Hello from rank 3, thread 0, on nid03431. (core affinity = 6)
Hello from rank 3, thread 1, on nid03431. (core affinity = 18)
Hello from rank 3, thread 2, on nid03431. (core affinity = 7)

This is wrong! We run on the HT cores.
Thanks for suggesting this problem!
We can fix it with this change:

#SBATCH --cpus-per-task=6

Instead of 3.

export OMP_PROC_BIND=close
export OMP_PLACES=cores

instead of export OMP_PROC_BIND=TRUE

Could you do that? Let's see if we can go faster...
BTW, since you are there, can you address this comment on -Werror?

haampie · 2021-02-05T14:40:19Z

Ok, I'm back at this again.

We run on the HT cores.

How do you know which affinity numbers correspond to which physical core?

alazzaro · 2021-02-05T14:47:06Z

Ok, I'm back at this again.

We run on the HT cores.

How do you know which affinity numbers correspond to which physical core?

0-11 are the physical core, 12-23 are the HT cores. Therefore, 0 and 12 are sitting on the same core.
You can check this via:

srun -n1 --constraint="gpu" lstopo

where you get something like:

L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#12)

Sor for Core L#0 you have P#0 and P#12...

alazzaro · 2021-02-05T14:47:55Z

The code I use to check the affinity (xthi) is at https://github.com/olcf/XC30-Training/blob/master/affinity/Xthi.c

haampie · 2021-02-05T14:53:29Z

Right. I've checked on Dom which runs slurm 20.11.3 and contains the fix for my bug report, and this is what I get:

$ srun -Acsstaff -Cgpu --exclusive --hint=nomultithread -n4 -c3 bash -c 'taskset -pc $$'
pid 17933's current affinity list: 0-2
pid 17934's current affinity list: 3-5
pid 17935's current affinity list: 6-8
pid 17936's current affinity list: 9-11

indeed no multithreading. But

$ srun -Acsstaff -Cgpu --exclusive --ntasks-per-core=1 -n4 -c3 bash -c 'taskset -pc $$'
pid 19565's current affinity list: 0,1,12,13
pid 19566's current affinity list: 2,3,14,15
pid 19567's current affinity list: 4,5,16,17
pid 19568's current affinity list: 6,7,18,19

is not the way to do it apparently.

On Daint I can do this:

$ srun -Acsstaff -Cgpu --hint=nomultithread -n4 -c3 bash -c 'taskset -pc $$'
pid 7998's current affinity list: 0-2
pid 7999's current affinity list: 3-5
pid 8001's current affinity list: 9-11
pid 8000's current affinity list: 6-8

and it behaves as expected (dropping --exclusive because of the bug).

I think we can just drop --exclusive and replace --ntasks-per-core=1 with --hint=nomultithread. The tests are executed sequentially, so no need for --exclusive.

alazzaro · 2021-02-05T15:03:38Z

retest this please

haampie · 2021-02-05T15:07:39Z

Should I open another issue with cray / slurm people about --ntasks-per-core=1 hyperthreading? This is a bug too, right?

This reverts commit a6fa6a8.

alazzaro · 2021-02-05T15:16:05Z

retest this please

alazzaro · 2021-02-05T15:16:38Z

Should I open another issue with cray / slurm people about --ntasks-per-core=1 hyperthreading? This is a bug too, right?

This is a known problem, sorry for that...

… up with cp2k#428.

haampie · 2021-02-05T16:12:07Z

2377.19 sec for GNU now, that looks better, but it's still +12% w.r.t. https://object.cscs.ch/v1/AUTH_40b5d92b316940098ceb15cf46fb815e/dbcsr-artifacts/logs/build-679/gnu.test.out...

haampie · 2021-02-05T16:14:14Z

Let me try 21dae0f by hand with the current state of daint for refence.

haampie · 2021-02-05T16:58:59Z

To wrap this up: --ntasks-per-core=1 doesn't have anything to do with hyperthreading since --threads-per-core must exist for a reason (note that --threads-per-core=1 is just to filter cpus, it's not to disable hyperthreading for the job), --hint=nomultithread is the way to disable hyperthreading. Also the cscsci partition is configured to be exclusive across different jobs, the only reason you'd want to use --exclusive it is if you run multiple srun steps in a single job asynchronously.

I've run the CI script once more by hand on the latest commit from this PR for GNU, and I'm getting 2155s, which is in 2% of the previous CI runs, so seems fine.

And the same for the old commit 21dae0f with the current state of Daint gives me 2235s with some tests that are just flaky (maybe bad thread pinning then)

alazzaro · 2021-02-08T08:02:30Z

As a conclusion, the execution of the CI on Daint is still slower than before.
For example:

Old: 1/23 Test #1: dbcsr_perf:inputs/test_H2O.perf ....................... Passed 4.67 sec
New: 1/23 Test #1: dbcsr_perf:inputs/test_H2O.perf ....................... Passed 7.31 sec

I've opened #429 to track the issue. I will investigate...

Make nvToolsExt conditional on WITH_CUDA_PROFILING

a14e7b1

haampie force-pushed the fix-nvtools-link branch from d2bb331 to a14e7b1 Compare February 5, 2021 08:11

haampie added 4 commits February 5, 2021 15:58

no need for these flags on mc partition w/o calls to srun

a6fa6a8

Use --hint=nomultithread and work around --exclusive bug

d317ff2

make mkdir consistent with other test.sh files

4081e8d

Bring back -Werror in Release

34e0283

haampie added 2 commits February 5, 2021 16:13

Revert "no need for these flags on mc partition w/o calls to srun"

5f75549

This reverts commit a6fa6a8.

Make the build flags similar to the run flags again

9e210b6

hfp added a commit to hfp/dbcsr that referenced this pull request Feb 5, 2021

Try to avoid MPS issue (temporarily) testing with only one rank. Sync…

b5cb129

… up with cp2k#428.

hfp merged commit dcbc5f6 into cp2k:develop Feb 8, 2021

hfp added a commit to hfp/dbcsr that referenced this pull request Feb 8, 2021

Incorporated cp2k#428.

d17f03f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make nvToolsExt conditional on WITH_CUDA_PROFILING #428

Make nvToolsExt conditional on WITH_CUDA_PROFILING #428

haampie commented Feb 5, 2021

alazzaro commented Feb 5, 2021

jenkins-cscs commented Feb 5, 2021

codecov bot commented Feb 5, 2021 •

edited

Loading

haampie commented Feb 5, 2021

haampie commented Feb 5, 2021 •

edited

Loading

haampie commented Feb 5, 2021 •

edited

Loading

alazzaro commented Feb 5, 2021

alazzaro commented Feb 5, 2021

haampie commented Feb 5, 2021

alazzaro commented Feb 5, 2021

haampie commented Feb 5, 2021 •

edited

Loading

alazzaro commented Feb 5, 2021

haampie commented Feb 5, 2021

alazzaro commented Feb 5, 2021

alazzaro commented Feb 5, 2021

haampie commented Feb 5, 2021

alazzaro commented Feb 5, 2021

haampie commented Feb 5, 2021 •

edited

Loading

alazzaro commented Feb 5, 2021

alazzaro commented Feb 5, 2021

haampie commented Feb 5, 2021

haampie commented Feb 5, 2021 •

edited

Loading

haampie commented Feb 5, 2021 •

edited

Loading

alazzaro commented Feb 8, 2021

Make nvToolsExt conditional on WITH_CUDA_PROFILING #428

Make nvToolsExt conditional on WITH_CUDA_PROFILING #428

Conversation

haampie commented Feb 5, 2021

alazzaro commented Feb 5, 2021

jenkins-cscs commented Feb 5, 2021

codecov bot commented Feb 5, 2021 • edited Loading

Codecov Report

haampie commented Feb 5, 2021

haampie commented Feb 5, 2021 • edited Loading

haampie commented Feb 5, 2021 • edited Loading

alazzaro commented Feb 5, 2021

alazzaro commented Feb 5, 2021

haampie commented Feb 5, 2021

alazzaro commented Feb 5, 2021

haampie commented Feb 5, 2021 • edited Loading

alazzaro commented Feb 5, 2021

haampie commented Feb 5, 2021

alazzaro commented Feb 5, 2021

alazzaro commented Feb 5, 2021

haampie commented Feb 5, 2021

alazzaro commented Feb 5, 2021

haampie commented Feb 5, 2021 • edited Loading

alazzaro commented Feb 5, 2021

alazzaro commented Feb 5, 2021

haampie commented Feb 5, 2021

haampie commented Feb 5, 2021 • edited Loading

haampie commented Feb 5, 2021 • edited Loading

alazzaro commented Feb 8, 2021

codecov bot commented Feb 5, 2021 •

edited

Loading

haampie commented Feb 5, 2021 •

edited

Loading

haampie commented Feb 5, 2021 •

edited

Loading

haampie commented Feb 5, 2021 •

edited

Loading

haampie commented Feb 5, 2021 •

edited

Loading

haampie commented Feb 5, 2021 •

edited

Loading

haampie commented Feb 5, 2021 •

edited

Loading