Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected memory usage when scaling up via num_generation_jobs #8

Closed
khurtado opened this issue Jul 27, 2021 · 22 comments
Closed

Unexpected memory usage when scaling up via num_generation_jobs #8

khurtado opened this issue Jul 27, 2021 · 22 comments

Comments

@khurtado
Copy link

Hello,

This was discussed via slack at some point, so I just wanted to open an issue so this is not forgotten.
When scaling up a workflow via num_generation_jobs, the number of jobs in the physics stage increases properly, but the memory per job also considerably increases per job.

E.g.: If num_generation_jobs is increased by a factor 10 (from 6 to 60), memory usage per delphes job goes from ~700 MB to 7 GB e.g.:

num_generation_jobs: 6
12        733         /reana/users/00000000-0000-0000-0000-000000000000/workflows/653743d7-6878-4e6a-b991-72529e19aeed madminertool/madminer-workflow-ph:0.3.0 sh -c '/madminer/scripts/4_delphes.sh -p /madminer -m software/MG5_aMC_v2_9_3 -c /reana/users/00000000-0000-0000-0000-000000000000/workflows/653743d
7-6878-4e6a-b991-72529e19aeed/workflow_ph/configure/data/madminer_config.h5 -i /reana/users/00000000-0000-0000-0000-000000000000/workflows/653743d7-6878-4e6a-b991-72529e19aeed/ph/input.yml -e /reana/users/00000000-0000-0000-0000-000000000000/workflows/653743d7-6878-4e6a-b991-72529e19aeed/workflow_ph/pythia
_0/events/Events.tar.gz -o /reana/users/00000000-0000-0000-0000-000000000000/workflows/653743d7-6878-4e6a-b991-72529e19aeed/workflow_ph/delphes_0'```

num_generation_jobs: 60
122 7325 /reana/users/00000000-0000-0000-0000-000000000000/workflows/8aa9df1b-168b-4622-981e-01be73344b90 madminertool/madminer-workflow-ph:0.3.0 sh -c '/madminer/scripts/4_delphes.sh -p /madminer -m software/MG5_aMC_v2_9_3 -c /reana/users/00000000-0000-0000-0000-000000000000/workflows/8aa9df1b-168b-4622-981e-01be73344b90/workflow_ph/configure/data/madminer_config.h5 -i /reana/users/00000000-0000-0000-0000-000000000000/workflows/8aa9df1b-168b-4622-981e-01be73344b90/ph/input.yml -e /reana/users/00000000-0000-0000-0000-000000000000/workflows/8aa9df1b-168b-4622-981e-01be73344b90/workflow_ph/pythia_33/events/Events.tar.gz -o /reana/users/00000000-0000-0000-0000-000000000000/workflows/8aa9df1b-168b-4622-981e-01be73344b90/workflow_ph/delphes_33'


This will put a limit on the factor the madminer workflow can be scaled up to, as this will be linked to the memory available per worker in the cluster we are submitting to.

Is this something that is understood or does it need some investigation?
Is this something that require any fix?
@Sinclert
Copy link
Member

Hi @khurtado ,

Thanks for the debugging efforts 😄

In principle, that seems like an undesired behaviour. I said "in principle" because I do not fully understand how Pythia and Delphes makes use of the scaling factor (a.k.a number of jobs the workflow needs * N) internally.

In theory, each parallel job computes a set of values for a given benchmark (sm, w...) so it makes sense to compute them in parallel. In this scenario, increasing the number of jobs so that there are more than one job per benchmark, is a way to parallelize the computation of every single benchmark on its own. I am unsure if Pythia / Delphes are prepared to handle this, and if so, how it is done.


If you could confirm that this internal parallelization of benchmark-based computed values makes sense, and it is done correctly, then we could start debugging the memory consumption of each job.

As an initial hint, I always found this particular code snippet a bit funny. Bear in mind it is a rewrite from its older version, which generated a similar list. Maybe @irinaespejo knows where this code snippet comes from.

@irinaespejo
Copy link
Member

irinaespejo commented Aug 2, 2021

Hi @khurtado,

Thanks for the update. I think @Sinclert 's intuition is right. We need to investigate how to parallelize the jobs that have the same benchmark within a Pythia+Delphes step (so 6 times) instead of calling Pythia+Delphes 6*n_jobs times. I'm looking into the snippet. Luckily, Delphes is on github delphes/delphes and Pythia alisw/pythia8 so we can ask the developer team.

@irinaespejo
Copy link
Member

Hi @khurtado, is there a way we can access the cluster you are using for debugging purposes? thank you!

@khurtado
Copy link
Author

khurtado commented Aug 3, 2021

@irinaespejo Yes, let's discuss via slack

@irinaespejo
Copy link
Member

Hi all,

@Sinclert and I discussed a solution offline and I'll write it here for the record:

proposed solution

The problem of this issue is that the madminer-workflow, particularly the Pythia and Delphes steps do no scale well.
Right now, we control the number of jobs by an external parameter called num_generation_jobs (here) i.e. the number of arrows (or jobs) leaving the generate step in the current architecture is num_generation_jobs. Each arrow leaving the generate step will make computations according to the distribution of the benchmarks which is controlled by this snippet. This means a Pythia and a Delphes instance is called num_generation_jobs times. Which could be a cause for the bad scalability.

Instead, we propose a subtle change in the architecture of the workflow. The number of arrows (jobs) leaving the generate step will be num_benchmarks and not num_generation_jobs. The each arrow will pass num_jobs to the Pythia and Delphes state. We hope that Delphes and Pythia will know how to internally parallelize a big chunk of jobs. Maybe @khurtado can comment on this Delphes/Pythia internal parallelization.

The num_benchmarks depends on the user-specified benchmarks here and on morphing max_overall_power

Changes to make:

  • 2_generate.sh change num_generation_jobs for num_benchmarks
  • generate.py remove snippet and replace by distribution of jobs for each arrow (benchmark). This is unclear how to do.

(please do not hesitate to update the to-do list in the comments below)

Non-solved questions about the proposed solution

  • Crank up scalability using events instead of jobs

@khurtado
Copy link
Author

This makes sense and sounds good to me!
I don't know much about the internal parallelization details on Delphes/Pythia unfortunately, so I can't comment on that.

Please, let me know once changes are done and I would be happy to test (or if I can help with anything besides testing).

@Sinclert
Copy link
Member

After a bit of research, it seems that MadGraph (the pseudo-engine used to run Pythia and Delphes), have an optional argument called run_mode (MadGraph forum comment).

This could be used to specify:

  • 👎🏻 run_mode=0: single core (no parallelization).
  • 👎🏻 run_mode=1: cluster mode (not useful, as we are relying on REANA to deal with back-ends).
  • run_mode=2: multi-core (process-based parallelization).

Sadly, I could not find an official reference to this argument, so not sure if the accepted values have changed on modern versions of MadGraph (2.9.X and 3.X.X). In any case, this would be the "last piece" to migrate:

  • From: split num_jobs among M benchmarks.
  • To: assign num_jobs to each of the M benchmarks.

@irinaespejo
Copy link
Member

Wow that's interesting. Maybe just assigning run_mode=2 with the current architecture is able to scale. I'll try it and get back to you tomorrow.

@khurtado
Copy link
Author

@Sinclert the options for run_mode seem to be the same in modern versions of Madgraph:

https://bazaar.launchpad.net/~madteam/mg5amcnlo/3.x/view/head:/Template/LO/README#L80

@Sinclert
Copy link
Member

Sinclert commented Sep 1, 2021

@khurtado @irinaespejo

I have created a new branch, mg_process_parallelization, to implement the changes we discussed about. In principle, the Docker image coming from that branch (madminer-workflow-ph:0.5.0-test) should be able to parallelize the MadGraph steps of each benchmark.

In a nutshell:

Bear in mind that the num_generation_jobs workflow-level parameter has not been removed, but it is currently useless, as we are setting the number of parallel processes, per benchmark, by the maximum number possible (usingnb_core=None).

Let me know if fine-tunning the number of processes per benchmark is something of interest.


Please, run the sub-workflow with the new Docker image (0.5.0-test), and compare the results with the old one (0.4.0).

@irinaespejo
Copy link
Member

irinaespejo commented Sep 1, 2021

@Sinclert wow nice, I was also working on this without success. Regarding point 3

A me5_configuration.txt file has been added to the set of cards, with options:
run_mode=2: to run in multi-core mode.
nb_core=None: to assign as many processes as cores detected.

When I uncommented # run_mode=2 and ran the workflow on yadage-run I saw that there where cards still with the uncommented # run_mode=2 begin created in the generate step ans transmitted to he pythia step.

I think the easiest solution to check whether we are really running on run_mode=2 is that @khurtado runs the branch mg-process-parallelization on the VT3 cluster and lets us know if the scalability issue is solved. @khurtado let us know right away of you run into trouble. Thank you!!

@irinaespejo
Copy link
Member

Actually, since I have access to the cluster, I'm going to run the branch mg-process-parallelization workflow now

@irinaespejo
Copy link
Member

Hi everyone,

The results from running scailfin/madminer-workflow-ph (mg-process-parallelization) on VC3:

Sanity checks:

  1. The workflow finishes successfully (the status is running but all files of the steps are there)

  2. The physics workflow indeed uses the branch code

Screenshot from 2021-09-01 14-33-18

Other checks:
3. The pythia stage preserves the change run_mode = 2 introduced in the docker image here
Screenshot from 2021-09-01 14-34-53

The command grep -R "run_mode = 2" shows indeed that

./pythia_3/mg_processes/signal/Cards/me5_configuration.txt:run_mode = 2
./pythia_3/mg_processes/signal/madminer/cards/me5_configuration_0.txt:run_mode = 2
./delphes_3/extract/madminer/cards/me5_configuration_3.txt:run_mode = 2
(and all the other pythia and delphes steps)
All good!

Now, scalability tests? Answering to Sinclert, yes we are interested in fine-tunning num of processes per benchmark.

@irinaespejo
Copy link
Member

Memory usage results from running branch mg_process_parallelization

example of Delphes
ClusterId MemoryUsage Args
217 318 /reana/users/00000000-0000-0000-0000-000000000000/workflows/e23b2ff1-8625-45da-8c13-0c05665dd6e2 madminertool/madminer-workflow-ph:0.5.0-test sh -c '/madminer/scripts/4_delphes.sh -p /madminer -m software/MG5_aMC_v2_9_4 -c /reana/users/00000000-0000-0000-0000-000000000000/workflows/e23b2ff1-8625-45da-8c13-0c05665dd6e2/workflow_ph/configure/data/madminer_config.h5 -i /reana/users/00000000-0000-0000-0000-000000000000/workflows/e23b2ff1-8625-45da-8c13-0c05665dd6e2/ph/input.yml -e /reana/users/00000000-0000-0000-0000-000000000000/workflows/e23b2ff1-8625-45da-8c13-0c05665dd6e2/workflow_ph/pythia_4/events/Events.tar.gz -o /reana/users/00000000-0000-0000-0000-000000000000/workflows/e23b2ff1-8625-45da-8c13-0c05665dd6e2/workflow_ph/delphes_4'

example of Pythia:
210 196 /reana/users/00000000-0000-0000-0000-000000000000/workflows/e23b2ff1-8625-45da-8c13-0c05665dd6e2 madminertool/madminer-workflow-ph:0.5.0-test sh -c '/madminer/scripts/3_pythia.sh -p /madminer -m software/MG5_aMC_v2_9_4 -z /reana/users/00000000-0000-0000-0000-000000000000/workflows/e23b2ff1-8625-45da-8c13-0c05665dd6e2/workflow_ph/generate/folder_0.tar.gz -o /reana/users/00000000-0000-0000-0000-000000000000/workflows/e23b2ff1-8625-45da-8c13-0c05665dd6e2/workflow_ph/pythia_3'

@irinaespejo
Copy link
Member

irinaespejo commented Sep 7, 2021

Hi @Sinclert, I've been testing the mg-process-parallelization branch on scailfin/workflow-madminer-ph. When running make yadage-run I found the following error

on the file .yadage/workflow_ph/generate/_packtivity/generate.run.log there's

2021-09-07 09:10:38,583 | pack.generate.run | INFO | starting file logging for topic: run
2021-09-07 09:11:03,362 | pack.generate.run | INFO | b'Benchmark: 0 sm'
2021-09-07 09:11:03,362 | pack.generate.run | INFO | b'Benchmark: 1 w'
2021-09-07 09:11:03,362 | pack.generate.run | INFO | b'Benchmark: 2 morphing_basis_vector_2'
2021-09-07 09:11:03,362 | pack.generate.run | INFO | b'Benchmark: 3 morphing_basis_vector_3'
2021-09-07 09:11:03,362 | pack.generate.run | INFO | b'Benchmark: 4 morphing_basis_vector_4'
2021-09-07 09:11:03,362 | pack.generate.run | INFO | b'Benchmark: 5 morphing_basis_vector_5'
2021-09-07 09:11:03,610 | pack.generate.run | INFO | b"sed: can't read s/nb_core = None/nb_core = 1/: No such file or directory"

This was solved by doing the following changes:

  1. remove from this line in scripts/2_generate.sh the " ". The new line should look like sed -i \
  2. re-build the madminer-workflow-ph image, in this case I named it madminertool/madminer-workflow-ph:0.5.0-test-2
  3. update the image tag in steps.yml here
  4. make yadage-run

The workflow finishes successfully now without any further errors.

@Sinclert
Copy link
Member

Sinclert commented Sep 7, 2021

Hi @irinaespejo ,

I included "" because of macOS compatibility. I thought it was a quick fix to make the script runnable both in macOS and Linux. It seems it did not work.

According to this StackOverflow post, we could achieve this by using the -e flag instead. Could you try the following snippet and confirm that it runs on Linux?

sed -i \
    -e "s/${default_spec}/${custom_spec}/" \
    "${SIGNAL_ABS_PATH}/madminer/cards/me5_configuration_${i}.txt"

@irinaespejo
Copy link
Member

I just tested the snippet you posted and it runs successfully ✔️ (my upload internet connection is pretty slow)

@Sinclert
Copy link
Member

Sinclert commented Sep 9, 2021

The PR changing the parallelization strategy (#11) has been merged.

We should be in a better spot to test the total time + memory consumption of each benchmark job.

@Sinclert Sinclert changed the title Scaling up via num_generation_jobs increases the memory usage per job by a somewhat similar factor Unexpected memory usage when scaling up via num_generation_jobs Oct 5, 2021
@Sinclert
Copy link
Member

Sinclert commented Oct 5, 2021

Hi @khurtado and @irinaespejo,

Is there anything else to discuss within this issue? Have you tried the latest version of the workflow?

@irinaespejo
Copy link
Member

The last version of the workflow ran succesfully after Kenyi did some fixing with the cluster permits.
@khurtado how is the situation in the cluster to submit computationally intensive workflows? Can we just try? Thanks!!

@khurtado
Copy link
Author

khurtado commented Oct 7, 2021

@irinaespejo Yes, the cluster should have workers to work with. I still need to fix the website certs, I will do that tomorrow.

@Sinclert
Copy link
Member

Hi. I am closing this issue for now.

For future reporting of performance issues / configuration tweaks / etc, please, open a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants