Analyse, tune and debug 'launch' tests (automatic comparison scripts; use fewer events; etc...) #711

valassi · 2023-06-16T05:38:38Z

Hi @oliviermattelaer @roiser @hageboeck @zeniheisser @Jooorgen I have finally done a few systematic 'launch' tests (using the scripts lauX.sh of #683). This is really ./bin/generate_events.

No time to analyse the details now, but the files are in WIP MR #709.

There are some icolamp crashes, see #710

And then there is the analysis of physcs results and of timing performance to do, which I will do here

First impressions

up to ggttg looks ok, same cross sections, timing performance difficult to tell
from ggttgg onwards, some crashes, different cross sections, clearly FORTRAN much slower

valassi · 2023-06-16T16:26:13Z

Looking at time performance

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tlau> for f in ./logs_*/*txt; do echo $f; egrep '^(Thu|Fri)' $f; done 
./logs_ggtt_CPP/output.txt
Thu Jun 15 07:46:35 CEST 2023
Thu Jun 15 07:46:56 CEST 2023
./logs_ggtt_CUDA/output.txt
Thu Jun 15 07:45:46 CEST 2023
Thu Jun 15 07:46:10 CEST 2023
./logs_ggtt_FORTRAN/output.txt
Thu Jun 15 07:46:11 CEST 2023
Thu Jun 15 07:46:34 CEST 2023
./logs_ggttg_CPP/output.txt
Thu Jun 15 07:48:08 CEST 2023
Thu Jun 15 07:48:42 CEST 2023
./logs_ggttg_CUDA/output.txt
Thu Jun 15 07:46:58 CEST 2023
Thu Jun 15 07:47:32 CEST 2023
./logs_ggttg_FORTRAN/output.txt
Thu Jun 15 07:47:33 CEST 2023
Thu Jun 15 07:48:07 CEST 2023
./logs_ggttgg_CPP/output.txt
Thu Jun 15 08:02:42 CEST 2023
Thu Jun 15 08:07:21 CEST 2023
./logs_ggttgg_CUDA/output.txt
Thu Jun 15 07:48:43 CEST 2023
Thu Jun 15 07:50:39 CEST 2023
./logs_ggttgg_FORTRAN/output.txt
Thu Jun 15 07:50:40 CEST 2023
Thu Jun 15 08:02:41 CEST 2023
./logs_ggttggg_CPP/output.txt
Thu Jun 15 21:10:27 CEST 2023
Fri Jun 16 01:52:35 CEST 2023
./logs_ggttggg_CUDA/output.txt
Thu Jun 15 08:07:22 CEST 2023
Thu Jun 15 08:52:27 CEST 2023
./logs_ggttggg_FORTRAN/output.txt
Thu Jun 15 08:52:28 CEST 2023
Thu Jun 15 21:10:26 CEST 2023

Focusing on ggttgg and ggttggg, even if they are those where Fortran crashes

For ggttgg

CPP 5m21
CUDA 1m56
FORTRAN 12m01

For ggttggg

CPP 4h41m
CUDA 45m
FORTRAN 12h18m

So overall a speedup 2x to 3x for CPP and around 6x to 15x for CUDA with respect to FORTRAN, which is not bad.

Note that here CUDA speedups is with respect to all 4 cores on the CPU, not a single core.

valassi · 2023-06-16T16:27:41Z

Looking at events in lhe files - note that two runs crashed in #710

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tlau> ls -l ./logs_*/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg  9059846 Jun 16 17:53 ./logs_ggtt_CPP/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg  9059847 Jun 16 17:53 ./logs_ggtt_CUDA/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg  9059850 Jun 16 17:53 ./logs_ggtt_FORTRAN/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 10494742 Jun 16 17:53 ./logs_ggttg_CPP/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 10494742 Jun 16 17:53 ./logs_ggttg_CUDA/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 10494742 Jun 16 17:53 ./logs_ggttg_FORTRAN/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 11931512 Jun 16 17:53 ./logs_ggttgg_CPP/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 11931402 Jun 16 17:53 ./logs_ggttgg_CUDA/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 13366778 Jun 16 17:53 ./logs_ggttggg_CPP/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 13366543 Jun 16 17:53 ./logs_ggttggg_CUDA/Events/run_01/unweighted_events.lhe

valassi · 2023-06-16T16:37:48Z

Concerning results

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tlau> for f in `ls -tr ./logs_*/*txt`; do echo $f; egrep '(Cross-section|Luminosity)' $f; done 
./logs_ggtt_CPP/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
./logs_ggtt_CUDA/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
./logs_ggtt_FORTRAN/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
./logs_ggttg_CPP/output.txt
INFO: Effective Luminosity 28.973330329507217 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
./logs_ggttg_CUDA/output.txt
INFO: Effective Luminosity 28.973330329507217 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
./logs_ggttg_FORTRAN/output.txt
INFO: Effective Luminosity 28.973330329507217 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
./logs_ggttgg_CPP/output.txt
INFO: Effective Luminosity 47.57523835538049 pb^-1 
     Cross-section :   252.3 +- 0.3483 pb
./logs_ggttgg_FORTRAN/output.txt
./logs_ggttgg_CUDA/output.txt
INFO: Effective Luminosity 47.60279369991978 pb^-1 
     Cross-section :   252.3 +- 0.3624 pb
./logs_ggttggg_CPP/output.txt
INFO: Effective Luminosity 95.2140713761495 pb^-1 
     Cross-section :   126 +- 0.1757 pb
./logs_ggttggg_FORTRAN/output.txt
./logs_ggttggg_CUDA/output.txt
INFO: Effective Luminosity 95.14995234074645 pb^-1 
     Cross-section :   126.1 +- 0.1732 pb

This is interesting because results are identical for CUDA, CPP and FORTRAN for ggtt, ggttg.

But for ggttgg and ggttggg there is a tiny difference between CUDA and CPP (why?). And the FORTRAN fails as per #710

…raph5#710 with Olivier's select_color change The price to pay is the tmad failures in ggttgg madgraph5#655 Add to the git repo the two ggttggg FORTRAN logs that were previously failing The duration of these tests needs some tuning, the ggttggg take too long madgraph5#711 ls -ltr tlau/logs_ggtt*/*txt -rw-r--r--. 1 avalassi zg 3590 Jun 17 03:41 tlau/logs_ggtt_CUDA/output.txt -rw-r--r--. 1 avalassi zg 3588 Jun 17 03:41 tlau/logs_ggtt_FORTRAN/output.txt -rw-r--r--. 1 avalassi zg 3580 Jun 17 03:42 tlau/logs_ggtt_CPP/output.txt -rw-r--r--. 1 avalassi zg 3462 Jun 17 03:42 tlau/logs_ggttg_CUDA/output.txt -rw-r--r--. 1 avalassi zg 3571 Jun 17 03:43 tlau/logs_ggttg_FORTRAN/output.txt -rw-r--r--. 1 avalassi zg 3515 Jun 17 03:44 tlau/logs_ggttg_CPP/output.txt -rw-r--r--. 1 avalassi zg 4106 Jun 17 03:46 tlau/logs_ggttgg_CUDA/output.txt -rw-r--r--. 1 avalassi zg 4425 Jun 17 04:00 tlau/logs_ggttgg_FORTRAN/output.txt -rw-r--r--. 1 avalassi zg 4349 Jun 17 04:05 tlau/logs_ggttgg_CPP/output.txt -rw-r--r--. 1 avalassi zg 6766 Jun 17 04:50 tlau/logs_ggttggg_CUDA/output.txt -rw-r--r--. 1 avalassi zg 7069 Jun 17 20:45 tlau/logs_ggttggg_FORTRAN/output.txt -rw-r--r--. 1 avalassi zg 6967 Jun 18 01:29 tlau/logs_ggttggg_CPP/output.txt

…adgraph5#711 Revert "[launch] in lauX.sh go back to 10000 unweighted events..." This reverts commit 7021bc6. I realised that also unweighted event generation does take a very long time in these tests

…r survey and refine (8192 events, 1 iteration) Note: cmd.opts['accuracy'] comes from cmd._survey_options where cmd is in madevent_interface.py

valassi · 2023-06-18T09:08:11Z

In the latest commits I have rerun the testst after fixing #710 (using @oliviermattelaer select_color patch ... which however reintroduces #655 that will need to be fixed).

The latest timings are as follows

grep ELAPSED `ls -tr tlau/logs_ggtt*/*txt`
tlau/logs_ggtt_CUDA/output.txt:ELAPSED: 24 seconds
tlau/logs_ggtt_FORTRAN/output.txt:ELAPSED: 23 seconds
tlau/logs_ggtt_CPP/output.txt:ELAPSED: 22 seconds
tlau/logs_ggttg_CUDA/output.txt:ELAPSED: 35 seconds
tlau/logs_ggttg_FORTRAN/output.txt:ELAPSED: 49 seconds
tlau/logs_ggttg_CPP/output.txt:ELAPSED: 36 seconds
tlau/logs_ggttgg_CUDA/output.txt:ELAPSED: 116 seconds
tlau/logs_ggttgg_FORTRAN/output.txt:ELAPSED: 857 seconds
tlau/logs_ggttgg_CPP/output.txt:ELAPSED: 280 seconds
tlau/logs_ggttggg_CUDA/output.txt:ELAPSED: 2705 seconds
tlau/logs_ggttggg_FORTRAN/output.txt:ELAPSED: 57322 seconds
tlau/logs_ggttggg_CPP/output.txt:ELAPSED: 17034 seconds

This includes everything including all build overheads. It is here with the default survey/refine/generate settings.

The most interesting speedups, as usual, are for ggttggg - which however I will try to make shorter as these tests are really very long. Anyway in practice

CPP (512y here) is a factor 3.4 faster than FORTRAN overall (17k vs 57k seconds)
CUDA is a factor 21 faster than FORTRAN on 4 CPU cores (2.7k vs 57k)... maybe it indicates a factor x80 over a single core, maybe not (also CUDA is faster by running over 4 cores, as the fortran overhead is spread out, I imagine)

The cross sections are very similar but with a few small differences

for f in `ls -tr tlau/logs_*/*txt`; do echo $f; egrep '(Cross-section|Luminosity)' $f; done 
tlau/logs_ggtt_CUDA/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
tlau/logs_ggtt_FORTRAN/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
tlau/logs_ggtt_CPP/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
tlau/logs_ggttg_CUDA/output.txt
INFO: Effective Luminosity 28.973330329507217 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
tlau/logs_ggttg_FORTRAN/output.txt
INFO: Effective Luminosity 28.97333746486687 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
tlau/logs_ggttg_CPP/output.txt
INFO: Effective Luminosity 28.973330399461705 pb^-1 
     Cross-section :   414.2 +- 0.7846 pb
tlau/logs_ggttgg_CUDA/output.txt
INFO: Effective Luminosity 47.60279369991978 pb^-1 
     Cross-section :   252.3 +- 0.3624 pb
tlau/logs_ggttgg_FORTRAN/output.txt
INFO: Effective Luminosity 47.5680525374908 pb^-1 
     Cross-section :   252.4 +- 0.3528 pb
tlau/logs_ggttgg_CPP/output.txt
INFO: Effective Luminosity 47.57523835538049 pb^-1 
     Cross-section :   252.3 +- 0.3483 pb
tlau/logs_ggttggg_CUDA/output.txt
INFO: Effective Luminosity 95.14995234074645 pb^-1 
     Cross-section :   126.1 +- 0.1732 pb
tlau/logs_ggttggg_FORTRAN/output.txt
INFO: Effective Luminosity 95.24990591754717 pb^-1 
     Cross-section :   125.9 +- 0.1767 pb
tlau/logs_ggttggg_CPP/output.txt
INFO: Effective Luminosity 95.2140713761495 pb^-1 
     Cross-section :   126 +- 0.1757 pb

I think that this is due to the fact that the numbers of events are not multiples of 2, so effectively CUDA/CPP process a different number of events than scalar FORTRAN. I will try to tune this too.

I also need to check the LHE files including color and helicity...

valassi · 2023-07-26T10:43:40Z

This remains one of the highest priorities in my opinion. One part of this is also being able to configure the use of fewer events in launch, to make faster tests for development (fewer events are clearly a nogo in production, but are essential for developer tests).

valassi · 2024-06-03T16:27:48Z

As discussed in #855 and #852, I remain convinced that making it possible to tune the machinery to run generate_events with reduced precision and fewer events is a priority, to enable QUICK and SYSTEMATIC tests of all processes, all fptype combinations, etc. While a reduced precision is not what the users will use, it is what developers need for unit tests and integration tests. To be discussed...

valassi self-assigned this Jun 16, 2023

valassi mentioned this issue Jun 17, 2023

Issues with runcard includes: stability, dependencies #687

Open

valassi mentioned this issue Jun 18, 2023

first systematic 'launch'-like tests (and move to the latest select_color upstream) #709

Merged

valassi changed the title ~~Analyse, tune and debug 'launch' tests~~ Analyse, tune and debug 'launch' tests (automatic comparison scripts; use fewer events; etc...) Jul 26, 2023

This was referenced Jun 2, 2024

tmad test crashes in rotxxx (SIGFPE erroneous arithmetic operation) #855

Closed

"Fix 826" (actually: fix iconfig-channel mapping) #852

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyse, tune and debug 'launch' tests (automatic comparison scripts; use fewer events; etc...) #711

Analyse, tune and debug 'launch' tests (automatic comparison scripts; use fewer events; etc...) #711

valassi commented Jun 16, 2023

valassi commented Jun 16, 2023 •

edited

Loading

valassi commented Jun 16, 2023

valassi commented Jun 16, 2023

valassi commented Jun 18, 2023

valassi commented Jul 26, 2023

valassi commented Jun 3, 2024

Analyse, tune and debug 'launch' tests (automatic comparison scripts; use fewer events; etc...) #711

Analyse, tune and debug 'launch' tests (automatic comparison scripts; use fewer events; etc...) #711

Comments

valassi commented Jun 16, 2023

valassi commented Jun 16, 2023 • edited Loading

valassi commented Jun 16, 2023

valassi commented Jun 16, 2023

valassi commented Jun 18, 2023

valassi commented Jul 26, 2023

valassi commented Jun 3, 2024

valassi commented Jun 16, 2023 •

edited

Loading