Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel execution: GPU resources #1

Closed
bertsky opened this issue Nov 8, 2019 · 0 comments
Closed

parallel execution: GPU resources #1

bertsky opened this issue Nov 8, 2019 · 0 comments

Comments

@bertsky
Copy link
Owner

bertsky commented Nov 8, 2019

When executing workflows with --jobs, CPU resources will be employed in parallel (up to the requested number of jobs or as automatically determined via load factor). (This happens when recurring into workspaces, whereas individual workspaces are built sequentially.)

But this fails when some processors in the workflow require GPU resources that cannot be shared, at least not with the same number of parallel jobs. These processors will then randomly face out-of-memory errors like this...

CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory

...or that...

OOM when allocating tensor with shape[1475200] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Therefore, we need a mechanism to

  • know which processors allocate GPU resources
  • know how many such processors can be run in parallel
  • either reduce the multiscalar execution to that number in general, or
  • queue only those processors specifically (creating a local bottleneck for each workspace)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant