You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to run a ray program with many 2-CPU actors on a single m5.24xlarge instance on AWS to avoid network communication delays, but ray gets horribly slow when PyTorch calls are executed concurrently by multiple actors on the same machine. I tested this on my local and 2 remote Ubuntu machines and this seems to be true for all of them.
In the System Monitor, I can see all CPUs shooting up to close to 100% when limiting the actor to 1 CPU (this is, when just having 1 actor, of course!).
I am not sure whether this is a ray or a PyTorch problem, but I hope someone can help.
Note: on many separate AWS m5.large instances (each has 2 CPUs, i.e. one actor on each machine), my program scales very well, so that is not the cause.
I provide toy code that, when run on a single multi-CPU machine, runs slower if jobs are split among actors than if a single actor does it here:
import time
import ray
import torch
class NeuralNet(torch.nn.Module):
def __init__(self):
super().__init__()
self.l = torch.nn.Linear(1000, 2048)
self.l2 = torch.nn.Linear(2048, 2)
def forward(self, x):
return self.l2(self.l(x))
@ray.remote(num_cpus=1)
class TestActor:
def __init__(self):
self.net = NeuralNet()
self.crit = torch.nn.MSELoss()
def do_torch_stuff(self, batch_size):
p = self.net(torch.rand((batch_size, 1000), ))
def _parallel_on_5_actors():
t0 = time.time()
ray.init()
acs = [TestActor.remote() for _ in range(5)]
for _ in range(1000):
ray.get([ac.do_torch_stuff.remote(10) for ac in acs])
print("With 5 actors: ", time.time() - t0)
def _all_on_1_actor():
t0 = time.time()
ray.init()
ac = TestActor.remote()
for _ in range(5000):
ray.get(ac.do_torch_stuff.remote(10))
print("With 1 actor: ", time.time() - t0)
if __name__ == '__main__':
_all_on_1_actor() # ~10 sec on my machine
# _parallel_on_5_actors() # -> ~18 sec on my machine. Should be 2?!?!?
The text was updated successfully, but these errors were encountered:
PyTorch is already parallelizing using multiple threads internally. When you have multiple processes this can cause excessive thrashing from context switching.
It looks like pytorch doesn't let you set threads explicitly pytorch/pytorch#975
However setting OMP_NUM_THREADS=1 prior to starting ray should work.
System information
Describe the problem
I want to run a ray program with many 2-CPU actors on a single m5.24xlarge instance on AWS to avoid network communication delays, but ray gets horribly slow when PyTorch calls are executed concurrently by multiple actors on the same machine. I tested this on my local and 2 remote Ubuntu machines and this seems to be true for all of them.
In the System Monitor, I can see all CPUs shooting up to close to 100% when limiting the actor to 1 CPU (this is, when just having 1 actor, of course!).
I am not sure whether this is a ray or a PyTorch problem, but I hope someone can help.
Note: on many separate AWS m5.large instances (each has 2 CPUs, i.e. one actor on each machine), my program scales very well, so that is not the cause.
I provide toy code that, when run on a single multi-CPU machine, runs slower if jobs are split among actors than if a single actor does it here:
The text was updated successfully, but these errors were encountered: