Skip to content

Multiply ray Actors on a single machine fight for CPUs with PyTorch #3609

Closed
@EricSteinberger

Description

@EricSteinberger

System information

Describe the problem

I want to run a ray program with many 2-CPU actors on a single m5.24xlarge instance on AWS to avoid network communication delays, but ray gets horribly slow when PyTorch calls are executed concurrently by multiple actors on the same machine. I tested this on my local and 2 remote Ubuntu machines and this seems to be true for all of them.

In the System Monitor, I can see all CPUs shooting up to close to 100% when limiting the actor to 1 CPU (this is, when just having 1 actor, of course!).
I am not sure whether this is a ray or a PyTorch problem, but I hope someone can help.

Note: on many separate AWS m5.large instances (each has 2 CPUs, i.e. one actor on each machine), my program scales very well, so that is not the cause.

I provide toy code that, when run on a single multi-CPU machine, runs slower if jobs are split among actors than if a single actor does it here:

import time

import ray
import torch


class NeuralNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.l = torch.nn.Linear(1000, 2048)
        self.l2 = torch.nn.Linear(2048, 2)

    def forward(self, x):
        return self.l2(self.l(x))


@ray.remote(num_cpus=1)
class TestActor:
    def __init__(self):
        self.net = NeuralNet()
        self.crit = torch.nn.MSELoss()

    def do_torch_stuff(self, batch_size):
        p = self.net(torch.rand((batch_size, 1000), ))


def _parallel_on_5_actors():
    t0 = time.time()

    ray.init()
    acs = [TestActor.remote() for _ in range(5)]
    for _ in range(1000):
        ray.get([ac.do_torch_stuff.remote(10) for ac in acs])

    print("With 5 actors: ", time.time() - t0)


def _all_on_1_actor():
    t0 = time.time()

    ray.init()
    ac = TestActor.remote()
    for _ in range(5000):
        ray.get(ac.do_torch_stuff.remote(10))

    print("With 1 actor: ", time.time() - t0)


if __name__ == '__main__':
    _all_on_1_actor() # ~10 sec on my machine
    # _parallel_on_5_actors() # -> ~18 sec on my machine. Should be 2?!?!?

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionJust a question :)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions