-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected memory usage when scaling up via num_generation_jobs #8
Comments
Hi @khurtado , Thanks for the debugging efforts 😄 In principle, that seems like an undesired behaviour. I said "in principle" because I do not fully understand how Pythia and Delphes makes use of the scaling factor (a.k.a number of jobs the workflow needs * N) internally. In theory, each parallel job computes a set of values for a given benchmark ( If you could confirm that this internal parallelization of benchmark-based computed values makes sense, and it is done correctly, then we could start debugging the memory consumption of each job. As an initial hint, I always found this particular code snippet a bit funny. Bear in mind it is a rewrite from its older version, which generated a similar list. Maybe @irinaespejo knows where this code snippet comes from. |
Hi @khurtado, Thanks for the update. I think @Sinclert 's intuition is right. We need to investigate how to parallelize the jobs that have the same benchmark within a Pythia+Delphes step (so 6 times) instead of calling Pythia+Delphes 6*n_jobs times. I'm looking into the snippet. Luckily, Delphes is on github delphes/delphes and Pythia alisw/pythia8 so we can ask the developer team. |
Hi @khurtado, is there a way we can access the cluster you are using for debugging purposes? thank you! |
@irinaespejo Yes, let's discuss via slack |
Hi all, @Sinclert and I discussed a solution offline and I'll write it here for the record: proposed solutionThe problem of this issue is that the madminer-workflow, particularly the Pythia and Delphes steps do no scale well. Instead, we propose a subtle change in the architecture of the workflow. The number of arrows (jobs) leaving the generate step will be The Changes to make:
(please do not hesitate to update the to-do list in the comments below) Non-solved questions about the proposed solution
|
This makes sense and sounds good to me! Please, let me know once changes are done and I would be happy to test (or if I can help with anything besides testing). |
After a bit of research, it seems that MadGraph (the pseudo-engine used to run Pythia and Delphes), have an optional argument called This could be used to specify:
Sadly, I could not find an official reference to this argument, so not sure if the accepted values have changed on modern versions of MadGraph (
|
Wow that's interesting. Maybe just assigning |
@Sinclert the options for https://bazaar.launchpad.net/~madteam/mg5amcnlo/3.x/view/head:/Template/LO/README#L80 |
I have created a new branch, mg_process_parallelization, to implement the changes we discussed about. In principle, the Docker image coming from that branch ( In a nutshell:
Bear in mind that the Let me know if fine-tunning the number of processes per benchmark is something of interest. Please, run the sub-workflow with the new Docker image ( |
@Sinclert wow nice, I was also working on this without success. Regarding point 3
When I uncommented I think the easiest solution to check whether we are really running on |
Actually, since I have access to the cluster, I'm going to run the branch |
Hi everyone, The results from running Sanity checks:
Other checks: The command
Now, scalability tests? Answering to Sinclert, yes we are interested in fine-tunning num of processes per benchmark. |
Memory usage results from running branch example of Delphes example of Pythia: |
Hi @Sinclert, I've been testing the mg-process-parallelization branch on on the file
This was solved by doing the following changes:
The workflow finishes successfully now without any further errors. |
Hi @irinaespejo , I included According to this StackOverflow post, we could achieve this by using the sed -i \
-e "s/${default_spec}/${custom_spec}/" \
"${SIGNAL_ABS_PATH}/madminer/cards/me5_configuration_${i}.txt" |
I just tested the snippet you posted and it runs successfully ✔️ (my upload internet connection is pretty slow) |
The PR changing the parallelization strategy (#11) has been merged. We should be in a better spot to test the total time + memory consumption of each benchmark job. |
Hi @khurtado and @irinaespejo, Is there anything else to discuss within this issue? Have you tried the latest version of the workflow? |
The last version of the workflow ran succesfully after Kenyi did some fixing with the cluster permits. |
@irinaespejo Yes, the cluster should have workers to work with. I still need to fix the website certs, I will do that tomorrow. |
Hi. I am closing this issue for now. For future reporting of performance issues / configuration tweaks / etc, please, open a separate issue. |
Hello,
This was discussed via slack at some point, so I just wanted to open an issue so this is not forgotten.
When scaling up a workflow via
num_generation_jobs
, the number of jobs in the physics stage increases properly, but the memory per job also considerably increases per job.E.g.: If
num_generation_jobs
is increased by a factor 10 (from6
to60
), memory usage per delphes job goes from ~700 MB to 7 GB e.g.:num_generation_jobs: 60
122 7325 /reana/users/00000000-0000-0000-0000-000000000000/workflows/8aa9df1b-168b-4622-981e-01be73344b90 madminertool/madminer-workflow-ph:0.3.0 sh -c '/madminer/scripts/4_delphes.sh -p /madminer -m software/MG5_aMC_v2_9_3 -c /reana/users/00000000-0000-0000-0000-000000000000/workflows/8aa9df1b-168b-4622-981e-01be73344b90/workflow_ph/configure/data/madminer_config.h5 -i /reana/users/00000000-0000-0000-0000-000000000000/workflows/8aa9df1b-168b-4622-981e-01be73344b90/ph/input.yml -e /reana/users/00000000-0000-0000-0000-000000000000/workflows/8aa9df1b-168b-4622-981e-01be73344b90/workflow_ph/pythia_33/events/Events.tar.gz -o /reana/users/00000000-0000-0000-0000-000000000000/workflows/8aa9df1b-168b-4622-981e-01be73344b90/workflow_ph/delphes_33'
The text was updated successfully, but these errors were encountered: