Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for GPU parameters at ReReco spec level #10388

Closed
amaltaro opened this issue Mar 30, 2021 · 17 comments · Fixed by #10799
Closed

Add support for GPU parameters at ReReco spec level #10388

amaltaro opened this issue Mar 30, 2021 · 17 comments · Fixed by #10799

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Mar 30, 2021

Impact of the new feature
ReqMgr2

Is your feature request related to a problem? Please describe.
It's a new project to support GPU processing in central production workflows, where WM system will be the bridge between Offline/CMSSW and the grid resources made available to CMS (through the glideinWMS layer).

There will be a series of such tickets, such that we can try to break work down in even parts.

Describe the solution you'd like
In short, we need to support new workflow spec attributes for the ReReco/DataProcessing spec modules.

Now expanding on the important details:

  • create a new spec attribute named RequiresGPU, with one of the possible 3 string values:
    • forbidden: must not use GPUs (default)
    • optional: can use GPU, if possible
    • required: must use GPUs
  • create another spec attribute named GPUParams, which will contain all the necessary information to match/leverage GPU resources. It's meant to be JSON encoded python dictionary (thus of string type). There are 3 required and 3 optional parameters. The required parameters are:
    • GPUMemory: integer with the amount of memory, in Megabytes (MB). Validate as > 0. E.g.: 8000
    • CUDACapabilities: a list of short strings (could be real though). Validation should ensure at least one item in the list and matching this regex: r"^\d+.\d$". E.g.: ["7.5", "8.0"]
    • CUDARuntime: a short string with the runtime version (could be real though). Validated against this regex: r"^\d+.\d+$". E.g.: "11.2"
      The list of optional parameters are:
    • GPUName: a string with the GPU name. Validate against <= 100 chars. E.g. "Tesla T4", "Quadro RTX 6000";
    • CUDADriverVersion: a string with the CUDA driver version. Validated against this regex: r"^\d+.\d+\d+$"E.g. "460.32.03"
    • CUDARuntimeVersion: a string with the CUDA runtime version. Validated against this regex: r"^\d+.\d+\d+$"E.g. "11.2.152"
  • another validation is, if RequiresGPU is optional or required, then we should enforce GPUParams to contain the 3 required parameters with correct values;
  • same way if RequiresGPU is forbidden, then GPUParams should be an empty string (FIXME: not sure we should enforce it(?))
  • all this validation should happen only at workflow creation, which means those two new attributes are NOT allowed to change after the workflow is created.

Just as an example, here is how GPUParams required parameters would look like:

'{"GPUMemory": 8000, "CUDARuntime": "11.2", "CUDACapabilities": ["7.5", "8.0"]}'

and this is an example with the full set of parameters:

'{"GPUMemory": 8000, "CUDARuntime": "11.2", "CUDACapabilities": ["7.5", "8.0"], "GPUName": "Tesla T4", "CUDADriverVersion": "460.32.03", "CUDARuntimeVersion": "11.2.152"}'

Describe alternatives you've considered
Not to validate the optional parameters.

Additional context
Major discussion happened here: cms-sw/cmssw#33057
and description of these parameters: cms-sw/cmssw#33057 (comment)

@justinasr
Copy link

Hi,
Small technical question about GPUParams string/dict - does ReqMgr2 already expects a string from us or it should be a dict that you can validate parameter by parameter and then stringify it somewhere on your side?
Types: all values in GPUParams are strings except for memory which is int and CUDACapabilities is list of strings, right? I am double checking because your example has floats.

@mrceyhun
Copy link
Contributor

As a complement to Alan's comment, I would like to say that CUDACapabilities should be string.

@amaltaro
Copy link
Contributor Author

Hi @justinasr, it's good to make sure we are talking about the same things.
On the McM side, you would provide a string (JSON object) to the GPUParams parameter, so McM could do something like:

In [15]: GPUParams = {"GPUMemory": 8000, "CUDARuntime": "11.2", "CUDACapabilities": ["7.5", "8.0"]}
In [16]: GPUParams = json.dumps(GPUParams)

and inject it into ReqMgr2.

On the ReqMgr2 side, we would have to load that string (JSON object) as a python dictionary, so:

In [18]: gpuParams = json.loads(GPUParams)

and then the whole validation would happen on top of this python dictionary.
As we discussed in the past, such approach is less memory hungry than a python dictionary, so we better store it as such in central CouchDB.

Types: all values in GPUParams are strings except for memory which is int and CUDACapabilities is list of strings, right?

That's correct. I have fixed the data structure example in the initial description. I also added an example with the full set of parameters.

As a complement to Alan's comment, I would like to say that CUDACapabilities should be string.

Thanks Ceyhun, I have also fixed the CUDACapability vs CUDACapabilities inconsistency in the initial description.

@amaltaro
Copy link
Contributor Author

And here is a ticket to address the job creation and submission to HTCondor: #10393

@fwyzard
Copy link

fwyzard commented Mar 31, 2021

@amaltaro are these parameters

GPUMemory
CUDACapabilities
CUDARuntime
GPUName
CUDADriverVersion
CUDARuntimeVersion

the ones that will be published by the nodes, or the ones that will be requested by the jobs ?

@amaltaro
Copy link
Contributor Author

@fwyzard Andrea, these are the parameters to be published by the jobs, so the actual GPU job requirements.
The job matchmaking expression - containing the node parameters - is defined by the Submission Infrastructure team, they might clarify what those are in the other issue #10393

@fwyzard
Copy link

fwyzard commented Mar 31, 2021

Hi Alan - OK, thanks, then the list looks correct, as far as I can say.

By the way, I've posted a brief description of the fields here: cms-sw/cmssw#33057 (comment) .

@amaltaro
Copy link
Contributor Author

amaltaro commented Mar 31, 2021

Thanks for providing a description for all those parameters, Andrea. I have linked them at the top description of this GH issue, but it might be better to expand those in a wiki under this project. Thanks again!

@amaltaro amaltaro changed the title Add support to GPU parameters at ReReco spec level Add support for GPU parameters at ReReco spec level Mar 31, 2021
@amaltaro
Copy link
Contributor Author

amaltaro commented Apr 26, 2021

As an outcome of the GPU discussion we had today, I have just updated the initial description of this PR, mostly concerned with the RequiresGPU parameter with a different set of possible values (and data type different than the initial description).

@srimanob
Copy link

PR for runTheMatrix is updated,
Master: cms-sw/cmssw#33538
11_3: cms-sw/cmssw#33057

Here is how it looks like to step/task that runs GPU.

'Task3': {'AcquisitionEra': 'CMSSW_11_3_X_2021-04-26-2300',
           'ConfigCacheID': 1042,
           'EventStreams': 0,
           'GPUParams': {'CUDACapabilities': ['7.5'],
                         'CUDADriverVersion': '',
                         'CUDARuntime': '11.2',
                         'CUDARuntimeVersion': '',
                         'GPUMemory': '8000',
                         'GPUName': ''},
           'GlobalTag': u'113X_mcRun3_2021_realistic_v10',
           'InputFromOutputModule': u'FEVTDEBUGHLToutput',
           'InputTask': 'Digi_2021',
           'KeepOutput': True,
           'LumisPerJob': 10,
           'Memory': 3000,
           'Multicore': 1,
           'ProcessingString': u'113X_mcRun3_2021_realistic_v10',
           'RequiresGPU': 'required',
           'SplittingAlgo': 'LumiBased',
           'TaskName': 'Reco_Patatrack_PixelOnlyGPU_2021'},

@srimanob
Copy link

In addition, there is a comment on CUDARuntime in
cms-sw/cmssw#33057 (comment)

Since it matches with node's CUDACompatibleRuntimes, should we just change it to be the same? Or it will make the meaning off from what we consider.

@fwyzard
Copy link

fwyzard commented Apr 27, 2021

I don't think we should change it:

  • a job's CUDARuntime is the CUDA runtime version that a job uses;
  • a node's CUDACompatibleRuntimes is a list of runtimes supported by the node;
  • a node also has a CUDARuntime, which is the native version of the CUDA runtime installed on the node.

@amaltaro
Copy link
Contributor Author

@srimanob the GPUParams has to be provided as a json object (json.dumps). Please have a look at the discussion Justinas and I had in this issue.

Unfortunately we have not yet started working on this issue on the WMCore side, so we still do not have anything for you to test on. However, if your CMSSW pull requests can wait for a week more (maybe two...), we might be able to get the basic changes in ReqMgr2 such that you could test in testbed.

@amaltaro amaltaro self-assigned this Sep 8, 2021
@amaltaro
Copy link
Contributor Author

amaltaro commented Sep 8, 2021

@justinasr @fwyzard @srimanob @mrceyhun Hi everyone, I just wanted to let you know that I'm starting to work on this issue, based on the initial specification in the first comment of this GH issue.

In short, new workflow/spec parameters are going to be:

  • RequiresGPU: with these allowed values forbidden, or optional or required
  • GPUParams: mandatory if RequiresGPU is optional | required, otherwise optional (default to empty string)
    • and it has 3 mandatory parameters, + 3 optional ones.

Please let us know in case you see any inconsistency, or if there are new updates relevant to GPU support.

@srimanob
Copy link

srimanob commented Sep 8, 2021

Thanks @amaltaro
I will re-add them to runTheMatrix.

@amaltaro
Copy link
Contributor Author

@srimanob Phat, as you can see in the PR providing these features (#10799), the GPUMemory parameter has been renamed to GPUMemoryMB, such that it is very explicitly and clear which unit we are supposed to have there. If you see this change to be too problematic, we can discuss it during the day.

@srimanob
Copy link

@amaltaro GPUMemoryMB is fine for me. I've added the PR at cms-sw/cmssw#35263. Could you please have a look? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants