Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyse, tune and debug 'launch' tests (automatic comparison scripts; use fewer events; etc...) #711

Open
valassi opened this issue Jun 16, 2023 · 6 comments
Assignees

Comments

@valassi
Copy link
Member

valassi commented Jun 16, 2023

Hi @oliviermattelaer @roiser @hageboeck @zeniheisser @Jooorgen I have finally done a few systematic 'launch' tests (using the scripts lauX.sh of #683). This is really ./bin/generate_events.

No time to analyse the details now, but the files are in WIP MR #709.

There are some icolamp crashes, see #710

And then there is the analysis of physcs results and of timing performance to do, which I will do here

First impressions

  • up to ggttg looks ok, same cross sections, timing performance difficult to tell
  • from ggttgg onwards, some crashes, different cross sections, clearly FORTRAN much slower
@valassi valassi self-assigned this Jun 16, 2023
@valassi
Copy link
Member Author

valassi commented Jun 16, 2023

Looking at time performance

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tlau> for f in ./logs_*/*txt; do echo $f; egrep '^(Thu|Fri)' $f; done 
./logs_ggtt_CPP/output.txt
Thu Jun 15 07:46:35 CEST 2023
Thu Jun 15 07:46:56 CEST 2023
./logs_ggtt_CUDA/output.txt
Thu Jun 15 07:45:46 CEST 2023
Thu Jun 15 07:46:10 CEST 2023
./logs_ggtt_FORTRAN/output.txt
Thu Jun 15 07:46:11 CEST 2023
Thu Jun 15 07:46:34 CEST 2023
./logs_ggttg_CPP/output.txt
Thu Jun 15 07:48:08 CEST 2023
Thu Jun 15 07:48:42 CEST 2023
./logs_ggttg_CUDA/output.txt
Thu Jun 15 07:46:58 CEST 2023
Thu Jun 15 07:47:32 CEST 2023
./logs_ggttg_FORTRAN/output.txt
Thu Jun 15 07:47:33 CEST 2023
Thu Jun 15 07:48:07 CEST 2023
./logs_ggttgg_CPP/output.txt
Thu Jun 15 08:02:42 CEST 2023
Thu Jun 15 08:07:21 CEST 2023
./logs_ggttgg_CUDA/output.txt
Thu Jun 15 07:48:43 CEST 2023
Thu Jun 15 07:50:39 CEST 2023
./logs_ggttgg_FORTRAN/output.txt
Thu Jun 15 07:50:40 CEST 2023
Thu Jun 15 08:02:41 CEST 2023
./logs_ggttggg_CPP/output.txt
Thu Jun 15 21:10:27 CEST 2023
Fri Jun 16 01:52:35 CEST 2023
./logs_ggttggg_CUDA/output.txt
Thu Jun 15 08:07:22 CEST 2023
Thu Jun 15 08:52:27 CEST 2023
./logs_ggttggg_FORTRAN/output.txt
Thu Jun 15 08:52:28 CEST 2023
Thu Jun 15 21:10:26 CEST 2023

Focusing on ggttgg and ggttggg, even if they are those where Fortran crashes

For ggttgg

  • CPP 5m21
  • CUDA 1m56
  • FORTRAN 12m01

For ggttggg

  • CPP 4h41m
  • CUDA 45m
  • FORTRAN 12h18m

So overall a speedup 2x to 3x for CPP and around 6x to 15x for CUDA with respect to FORTRAN, which is not bad.

Note that here CUDA speedups is with respect to all 4 cores on the CPU, not a single core.

@valassi
Copy link
Member Author

valassi commented Jun 16, 2023

Looking at events in lhe files - note that two runs crashed in #710

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tlau> ls -l ./logs_*/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg  9059846 Jun 16 17:53 ./logs_ggtt_CPP/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg  9059847 Jun 16 17:53 ./logs_ggtt_CUDA/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg  9059850 Jun 16 17:53 ./logs_ggtt_FORTRAN/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 10494742 Jun 16 17:53 ./logs_ggttg_CPP/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 10494742 Jun 16 17:53 ./logs_ggttg_CUDA/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 10494742 Jun 16 17:53 ./logs_ggttg_FORTRAN/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 11931512 Jun 16 17:53 ./logs_ggttgg_CPP/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 11931402 Jun 16 17:53 ./logs_ggttgg_CUDA/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 13366778 Jun 16 17:53 ./logs_ggttggg_CPP/Events/run_01/unweighted_events.lhe
-rw-r--r--. 1 avalassi zg 13366543 Jun 16 17:53 ./logs_ggttggg_CUDA/Events/run_01/unweighted_events.lhe

@valassi
Copy link
Member Author

valassi commented Jun 16, 2023

Concerning results

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tlau> for f in `ls -tr ./logs_*/*txt`; do echo $f; egrep '(Cross-section|Luminosity)' $f; done 
./logs_ggtt_CPP/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
./logs_ggtt_CUDA/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
./logs_ggtt_FORTRAN/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
./logs_ggttg_CPP/output.txt
INFO: Effective Luminosity 28.973330329507217 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
./logs_ggttg_CUDA/output.txt
INFO: Effective Luminosity 28.973330329507217 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
./logs_ggttg_FORTRAN/output.txt
INFO: Effective Luminosity 28.973330329507217 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
./logs_ggttgg_CPP/output.txt
INFO: Effective Luminosity 47.57523835538049 pb^-1 
     Cross-section :   252.3 +- 0.3483 pb
./logs_ggttgg_FORTRAN/output.txt
./logs_ggttgg_CUDA/output.txt
INFO: Effective Luminosity 47.60279369991978 pb^-1 
     Cross-section :   252.3 +- 0.3624 pb
./logs_ggttggg_CPP/output.txt
INFO: Effective Luminosity 95.2140713761495 pb^-1 
     Cross-section :   126 +- 0.1757 pb
./logs_ggttggg_FORTRAN/output.txt
./logs_ggttggg_CUDA/output.txt
INFO: Effective Luminosity 95.14995234074645 pb^-1 
     Cross-section :   126.1 +- 0.1732 pb

This is interesting because results are identical for CUDA, CPP and FORTRAN for ggtt, ggttg.

But for ggttgg and ggttggg there is a tiny difference between CUDA and CPP (why?). And the FORTRAN fails as per #710

valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 18, 2023
…raph5#710 with Olivier's select_color change

The price to pay is the tmad failures in ggttgg madgraph5#655

Add to the git repo the two ggttggg FORTRAN logs that were previously failing

The duration of these tests needs some tuning, the ggttggg take too long madgraph5#711

ls -ltr tlau/logs_ggtt*/*txt
-rw-r--r--. 1 avalassi zg 3590 Jun 17 03:41 tlau/logs_ggtt_CUDA/output.txt
-rw-r--r--. 1 avalassi zg 3588 Jun 17 03:41 tlau/logs_ggtt_FORTRAN/output.txt
-rw-r--r--. 1 avalassi zg 3580 Jun 17 03:42 tlau/logs_ggtt_CPP/output.txt
-rw-r--r--. 1 avalassi zg 3462 Jun 17 03:42 tlau/logs_ggttg_CUDA/output.txt
-rw-r--r--. 1 avalassi zg 3571 Jun 17 03:43 tlau/logs_ggttg_FORTRAN/output.txt
-rw-r--r--. 1 avalassi zg 3515 Jun 17 03:44 tlau/logs_ggttg_CPP/output.txt
-rw-r--r--. 1 avalassi zg 4106 Jun 17 03:46 tlau/logs_ggttgg_CUDA/output.txt
-rw-r--r--. 1 avalassi zg 4425 Jun 17 04:00 tlau/logs_ggttgg_FORTRAN/output.txt
-rw-r--r--. 1 avalassi zg 4349 Jun 17 04:05 tlau/logs_ggttgg_CPP/output.txt
-rw-r--r--. 1 avalassi zg 6766 Jun 17 04:50 tlau/logs_ggttggg_CUDA/output.txt
-rw-r--r--. 1 avalassi zg 7069 Jun 17 20:45 tlau/logs_ggttggg_FORTRAN/output.txt
-rw-r--r--. 1 avalassi zg 6967 Jun 18 01:29 tlau/logs_ggttggg_CPP/output.txt
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 18, 2023
…adgraph5#711

Revert "[launch] in lauX.sh go back to 10000 unweighted events..."
This reverts commit 7021bc6.

I realised that also unweighted event generation does take a very long time in these tests
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 18, 2023
…r survey and refine (8192 events, 1 iteration)

Note: cmd.opts['accuracy'] comes from cmd._survey_options where cmd is in madevent_interface.py
@valassi
Copy link
Member Author

valassi commented Jun 18, 2023

In the latest commits I have rerun the testst after fixing #710 (using @oliviermattelaer select_color patch ... which however reintroduces #655 that will need to be fixed).

The latest timings are as follows

grep ELAPSED `ls -tr tlau/logs_ggtt*/*txt`
tlau/logs_ggtt_CUDA/output.txt:ELAPSED: 24 seconds
tlau/logs_ggtt_FORTRAN/output.txt:ELAPSED: 23 seconds
tlau/logs_ggtt_CPP/output.txt:ELAPSED: 22 seconds
tlau/logs_ggttg_CUDA/output.txt:ELAPSED: 35 seconds
tlau/logs_ggttg_FORTRAN/output.txt:ELAPSED: 49 seconds
tlau/logs_ggttg_CPP/output.txt:ELAPSED: 36 seconds
tlau/logs_ggttgg_CUDA/output.txt:ELAPSED: 116 seconds
tlau/logs_ggttgg_FORTRAN/output.txt:ELAPSED: 857 seconds
tlau/logs_ggttgg_CPP/output.txt:ELAPSED: 280 seconds
tlau/logs_ggttggg_CUDA/output.txt:ELAPSED: 2705 seconds
tlau/logs_ggttggg_FORTRAN/output.txt:ELAPSED: 57322 seconds
tlau/logs_ggttggg_CPP/output.txt:ELAPSED: 17034 seconds

This includes everything including all build overheads. It is here with the default survey/refine/generate settings.

The most interesting speedups, as usual, are for ggttggg - which however I will try to make shorter as these tests are really very long. Anyway in practice

  • CPP (512y here) is a factor 3.4 faster than FORTRAN overall (17k vs 57k seconds)
  • CUDA is a factor 21 faster than FORTRAN on 4 CPU cores (2.7k vs 57k)... maybe it indicates a factor x80 over a single core, maybe not (also CUDA is faster by running over 4 cores, as the fortran overhead is spread out, I imagine)

The cross sections are very similar but with a few small differences

for f in `ls -tr tlau/logs_*/*txt`; do echo $f; egrep '(Cross-section|Luminosity)' $f; done 
tlau/logs_ggtt_CUDA/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
tlau/logs_ggtt_FORTRAN/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
tlau/logs_ggtt_CPP/output.txt
INFO: Effective Luminosity 27.237751523611728 pb^-1 
     Cross-section :   440.5 +- 0.3252 pb
tlau/logs_ggttg_CUDA/output.txt
INFO: Effective Luminosity 28.973330329507217 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
tlau/logs_ggttg_FORTRAN/output.txt
INFO: Effective Luminosity 28.97333746486687 pb^-1 
     Cross-section :   414.2 +- 0.7847 pb
tlau/logs_ggttg_CPP/output.txt
INFO: Effective Luminosity 28.973330399461705 pb^-1 
     Cross-section :   414.2 +- 0.7846 pb
tlau/logs_ggttgg_CUDA/output.txt
INFO: Effective Luminosity 47.60279369991978 pb^-1 
     Cross-section :   252.3 +- 0.3624 pb
tlau/logs_ggttgg_FORTRAN/output.txt
INFO: Effective Luminosity 47.5680525374908 pb^-1 
     Cross-section :   252.4 +- 0.3528 pb
tlau/logs_ggttgg_CPP/output.txt
INFO: Effective Luminosity 47.57523835538049 pb^-1 
     Cross-section :   252.3 +- 0.3483 pb
tlau/logs_ggttggg_CUDA/output.txt
INFO: Effective Luminosity 95.14995234074645 pb^-1 
     Cross-section :   126.1 +- 0.1732 pb
tlau/logs_ggttggg_FORTRAN/output.txt
INFO: Effective Luminosity 95.24990591754717 pb^-1 
     Cross-section :   125.9 +- 0.1767 pb
tlau/logs_ggttggg_CPP/output.txt
INFO: Effective Luminosity 95.2140713761495 pb^-1 
     Cross-section :   126 +- 0.1757 pb

I think that this is due to the fact that the numbers of events are not multiples of 2, so effectively CUDA/CPP process a different number of events than scalar FORTRAN. I will try to tune this too.

I also need to check the LHE files including color and helicity...

@valassi
Copy link
Member Author

valassi commented Jul 26, 2023

This remains one of the highest priorities in my opinion. One part of this is also being able to configure the use of fewer events in launch, to make faster tests for development (fewer events are clearly a nogo in production, but are essential for developer tests).

@valassi valassi changed the title Analyse, tune and debug 'launch' tests Analyse, tune and debug 'launch' tests (automatic comparison scripts; use fewer events; etc...) Jul 26, 2023
@valassi
Copy link
Member Author

valassi commented Jun 3, 2024

As discussed in #855 and #852, I remain convinced that making it possible to tune the machinery to run generate_events with reduced precision and fewer events is a priority, to enable QUICK and SYSTEMATIC tests of all processes, all fptype combinations, etc. While a reduced precision is not what the users will use, it is what developers need for unit tests and integration tests. To be discussed...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant