parallel execution: GPU resources #1

bertsky · 2019-11-08T23:00:21Z

When executing workflows with --jobs, CPU resources will be employed in parallel (up to the requested number of jobs or as automatically determined via load factor). (This happens when recurring into workspaces, whereas individual workspaces are built sequentially.)

But this fails when some processors in the workflow require GPU resources that cannot be shared, at least not with the same number of parallel jobs. These processors will then randomly face out-of-memory errors like this...

CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory

...or that...

OOM when allocating tensor with shape[1475200] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Therefore, we need a mechanism to

know which processors allocate GPU resources
know how many such processors can be run in parallel
either reduce the multiscalar execution to that number in general, or
queue only those processors specifically (creating a local bottleneck for each workspace)

The text was updated successfully, but these errors were encountered:

bertsky closed this as completed in 70e90f6 Nov 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel execution: GPU resources #1

parallel execution: GPU resources #1

bertsky commented Nov 8, 2019

parallel execution: GPU resources #1

parallel execution: GPU resources #1

Comments

bertsky commented Nov 8, 2019