Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make nvToolsExt conditional on WITH_CUDA_PROFILING #428

Merged
merged 7 commits into from
Feb 8, 2021

Conversation

haampie
Copy link
Contributor

@haampie haampie commented Feb 5, 2021

As pointed out by @hfp in #425

@alazzaro
Copy link
Member

alazzaro commented Feb 5, 2021

retest this please

@jenkins-cscs
Copy link
Collaborator

Can one of the admins verify this patch?

@codecov
Copy link

codecov bot commented Feb 5, 2021

Codecov Report

Merging #428 (054f4c6) into develop (f3f60cf) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           develop    #428   +/-   ##
=======================================
  Coverage     63.1%   63.1%           
=======================================
  Files           86      86           
  Lines        25625   25625           
=======================================
  Hits         16190   16190           
  Misses        9435    9435           
Flag Coverage Δ
unittests 63.1% <ø> (ø)
with-blas 63.1% <ø> (ø)
with-libxsmm 63.2% <ø> (ø)
with-mpi 63.6% <ø> (ø)
with-openmp 62.3% <ø> (ø)
without-mpi 59.4% <ø> (ø)
without-openmp 62.3% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f3f60cf...9e210b6. Read the comment docs.

@haampie
Copy link
Contributor Author

haampie commented Feb 5, 2021

GNU test by hand:

100% tests passed, 0 tests failed out of 23

Total Test time (real) = 2254.39 sec

That's +6% up from https://object.cscs.ch/v1/AUTH_40b5d92b316940098ceb15cf46fb815e/dbcsr-artifacts/logs/build-679/gnu.test.out, not sure if significant

@haampie
Copy link
Contributor Author

haampie commented Feb 5, 2021

From CI: 2783.59s 😕 +30%

@haampie
Copy link
Contributor Author

haampie commented Feb 5, 2021

Curious, cause I ran it like this:

$ WORKSPACE=/apps/daint/SSL/hstoppel/dbcsr/dbcsr BUILD_TAG=dbcsr STAGE_NAME=dbcsr_build sbatch -Acsstaff /apps/daint/SSL/hstoppel/dbcsr/dbcsr/.ci/daint.cscs.ch/gnu.build.sh 
$ ... wait ... 
$ WORKSPACE=/apps/daint/SSL/hstoppel/dbcsr/dbcsr BUILD_TAG=dbcsr STAGE_NAME=dbcsr_build sbatch -Acsstaff /apps/daint/SSL/hstoppel/dbcsr/dbcsr/.ci/daint.cscs.ch/gnu.test.sh

and

$ cd /apps/daint/SSL/hstoppel/dbcsr/dbcsr
$ git log
commit a14e7b18df4a7e535b58ea77e65345884e2d6e55 (HEAD -> fix-nvtools-link, origin/fix-nvtools-link)

seems to be on the right commit.

@alazzaro
Copy link
Member

alazzaro commented Feb 5, 2021

From CI: 2783.59s 😕 +30%

Indeed, but can we remove now the output and see if it goes down? (Hans trick)
At this point, it is important to keep this number to the minimum. With the #427 we can think to parallelize the test on multiple ranks...

@alazzaro
Copy link
Member

alazzaro commented Feb 5, 2021

Wait a second, all tests are now slower. Comparison at

old: https://object.cscs.ch/v1/AUTH_40b5d92b316940098ceb15cf46fb815e/dbcsr-artifacts/logs/build-679/gnu.test.out
new: https://object.cscs.ch/v1/AUTH_40b5d92b316940098ceb15cf46fb815e/dbcsr-artifacts/logs/build-689/gnu.test.out

For example, the dbcsr_perf:inputs/test_rect1_dense.perf takes >15x more...

@haampie
Copy link
Contributor Author

haampie commented Feb 5, 2021

Can you trigger CI on an old commit (e.g. #679) @alazzaro just to be sure it's not an issue with daint itself (unlikely... but possible)

@alazzaro
Copy link
Member

alazzaro commented Feb 5, 2021

No, I can't... But I start to think it is a Daint issue too...
What I suggest is to run manually the two versions of the code on the same daint allocation only for a representative test. This will tell us if there is a performance regression...

@haampie
Copy link
Contributor Author

haampie commented Feb 5, 2021

What comes to mind is a recent bug I've seen in slurm after the latest Daint upgrade (fixed on Dom, but not yet on Daint) about thread pinning when using --exclusive and --hint=nomultithread together. It'll run processes on the same CPUs:

$ srun -Acsstaff -Cgpu --exclusive --hint=nomultithread -n4 -c3 bash -c 'taskset -pc $$'
pid 14458's current affinity list: 0-2
pid 14459's current affinity list: 3-5
pid 14460's current affinity list: 0,6,7
pid 14461's current affinity list: 1-3

However, I couldn't trigger that bug with --exclusive --ntasks-per-core=1 like we do here in CI:

$ srun -Acsstaff -Cgpu --exclusive --ntasks-per-core=1 -n4 -c3 bash -c 'taskset -pc $$'
srun: job 29139366 queued and waiting for resources
srun: job 29139366 has been allocated resources
pid 30342's current affinity list: 0,1,12,13
pid 30343's current affinity list: 2,3,14,15
pid 30344's current affinity list: 4,5,16,17
pid 30345's current affinity list: 6,7,18,19

@alazzaro
Copy link
Member

alazzaro commented Feb 5, 2021

Well, this is an interesting consideration (affinity). We actually tested it years ago...
I've just retested with

#!/bin/bash -l

#SBATCH --export=ALL
#SBATCH --exclusive
#SBATCH --constraint="gpu"
#SBATCH -A cray
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=3
#SBATCH --ntasks-per-core=1 # 1=no HT, 2=HT

export OMP_PROC_BIND=TRUE
export OMP_NUM_THREADS=3

srun -n 4 ./xthi

(xthi does print the affinity mask)

and I get:

Hello from rank 0, thread 0, on nid03431. (core affinity = 0)
Hello from rank 0, thread 1, on nid03431. (core affinity = 12)
Hello from rank 0, thread 2, on nid03431. (core affinity = 1)
Hello from rank 1, thread 0, on nid03431. (core affinity = 2)
Hello from rank 1, thread 1, on nid03431. (core affinity = 14)
Hello from rank 1, thread 2, on nid03431. (core affinity = 3)
Hello from rank 2, thread 0, on nid03431. (core affinity = 4)
Hello from rank 2, thread 1, on nid03431. (core affinity = 16)
Hello from rank 2, thread 2, on nid03431. (core affinity = 5)
Hello from rank 3, thread 0, on nid03431. (core affinity = 6)
Hello from rank 3, thread 1, on nid03431. (core affinity = 18)
Hello from rank 3, thread 2, on nid03431. (core affinity = 7)

This is wrong! We run on the HT cores.
Thanks for suggesting this problem!
We can fix it with this change:

#SBATCH --cpus-per-task=6

Instead of 3.

export OMP_PROC_BIND=close
export OMP_PLACES=cores

instead of export OMP_PROC_BIND=TRUE

Could you do that? Let's see if we can go faster...
BTW, since you are there, can you address this comment on -Werror?

@haampie
Copy link
Contributor Author

haampie commented Feb 5, 2021

Ok, I'm back at this again.

We run on the HT cores.

How do you know which affinity numbers correspond to which physical core?

@alazzaro
Copy link
Member

alazzaro commented Feb 5, 2021

Ok, I'm back at this again.

We run on the HT cores.

How do you know which affinity numbers correspond to which physical core?

0-11 are the physical core, 12-23 are the HT cores. Therefore, 0 and 12 are sitting on the same core.
You can check this via:

srun -n1 --constraint="gpu" lstopo

where you get something like:

L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#12)

Sor for Core L#0 you have P#0 and P#12...

@alazzaro
Copy link
Member

alazzaro commented Feb 5, 2021

The code I use to check the affinity (xthi) is at https://github.com/olcf/XC30-Training/blob/master/affinity/Xthi.c

@haampie
Copy link
Contributor Author

haampie commented Feb 5, 2021

Right. I've checked on Dom which runs slurm 20.11.3 and contains the fix for my bug report, and this is what I get:

$ srun -Acsstaff -Cgpu --exclusive --hint=nomultithread -n4 -c3 bash -c 'taskset -pc $$'
pid 17933's current affinity list: 0-2
pid 17934's current affinity list: 3-5
pid 17935's current affinity list: 6-8
pid 17936's current affinity list: 9-11

indeed no multithreading. But

$ srun -Acsstaff -Cgpu --exclusive --ntasks-per-core=1 -n4 -c3 bash -c 'taskset -pc $$'
pid 19565's current affinity list: 0,1,12,13
pid 19566's current affinity list: 2,3,14,15
pid 19567's current affinity list: 4,5,16,17
pid 19568's current affinity list: 6,7,18,19

is not the way to do it apparently.

On Daint I can do this:

$ srun -Acsstaff -Cgpu --hint=nomultithread -n4 -c3 bash -c 'taskset -pc $$'
pid 7998's current affinity list: 0-2
pid 7999's current affinity list: 3-5
pid 8001's current affinity list: 9-11
pid 8000's current affinity list: 6-8

and it behaves as expected (dropping --exclusive because of the bug).

I think we can just drop --exclusive and replace --ntasks-per-core=1 with --hint=nomultithread. The tests are executed sequentially, so no need for --exclusive.

@alazzaro
Copy link
Member

alazzaro commented Feb 5, 2021

retest this please

@haampie
Copy link
Contributor Author

haampie commented Feb 5, 2021

Should I open another issue with cray / slurm people about --ntasks-per-core=1 hyperthreading? This is a bug too, right?

@alazzaro
Copy link
Member

alazzaro commented Feb 5, 2021

retest this please

@alazzaro
Copy link
Member

alazzaro commented Feb 5, 2021

Should I open another issue with cray / slurm people about --ntasks-per-core=1 hyperthreading? This is a bug too, right?

This is a known problem, sorry for that...

hfp added a commit to hfp/dbcsr that referenced this pull request Feb 5, 2021
@haampie
Copy link
Contributor Author

haampie commented Feb 5, 2021

2377.19 sec for GNU now, that looks better, but it's still +12% w.r.t. https://object.cscs.ch/v1/AUTH_40b5d92b316940098ceb15cf46fb815e/dbcsr-artifacts/logs/build-679/gnu.test.out...

@haampie
Copy link
Contributor Author

haampie commented Feb 5, 2021

Let me try 21dae0f by hand with the current state of daint for refence.

@haampie
Copy link
Contributor Author

haampie commented Feb 5, 2021

To wrap this up: --ntasks-per-core=1 doesn't have anything to do with hyperthreading since --threads-per-core must exist for a reason (note that --threads-per-core=1 is just to filter cpus, it's not to disable hyperthreading for the job), --hint=nomultithread is the way to disable hyperthreading. Also the cscsci partition is configured to be exclusive across different jobs, the only reason you'd want to use --exclusive it is if you run multiple srun steps in a single job asynchronously.

I've run the CI script once more by hand on the latest commit from this PR for GNU, and I'm getting 2155s, which is in 2% of the previous CI runs, so seems fine.

And the same for the old commit 21dae0f with the current state of Daint gives me 2235s with some tests that are just flaky (maybe bad thread pinning then)

@hfp hfp merged commit dcbc5f6 into cp2k:develop Feb 8, 2021
@alazzaro
Copy link
Member

alazzaro commented Feb 8, 2021

As a conclusion, the execution of the CI on Daint is still slower than before.
For example:

Old: 1/23 Test #1: dbcsr_perf:inputs/test_H2O.perf ....................... Passed 4.67 sec
New: 1/23 Test #1: dbcsr_perf:inputs/test_H2O.perf ....................... Passed 7.31 sec

I've opened #429 to track the issue. I will investigate...

hfp added a commit to hfp/dbcsr that referenced this pull request Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants