Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: matmul faces OOM issue #718

Open
luweizheng opened this issue Sep 25, 2023 · 1 comment
Open

BUG: matmul faces OOM issue #718

luweizheng opened this issue Sep 25, 2023 · 1 comment
Labels
bug Something isn't working
Milestone

Comments

@luweizheng
Copy link
Collaborator

Describe the bug

matmul faces Out-of-Memory issue. The same matrix size can run on Dask.

To Reproduce

git clone git@github.com:xorbitsai/benchmarks.git
cd array/xorbits

Run the matmul workload, you can specify an Xorbits cluster endpoint via ${address} parameter.

python workloads.py --endpoint ${address} \
    --workloads matmul \
    --size xl

Expected behavior

Xorbits can run matmul on large matrix.

@XprobeBot XprobeBot added the bug Something isn't working label Sep 25, 2023
@XprobeBot XprobeBot modified the milestones: v0.6.3, v0.7.0 Sep 25, 2023
@luweizheng
Copy link
Collaborator Author

When the matrix is large, say: 100_000 * 100_000, using more nodes may help.
Now I use 10 nodes. Each is with 512GB memory and 256GB /dev/shm. I do get the actual calculation results. But I got the following error. It seems that the actors cannot shutdown properly.

2023-09-25 17:11:07,579 xorbits._mars.services.web.core 17333 ERROR    ActorNotExist when handling request with LifecycleWebAPIHandler.decref_tileables
Traceback (most recent call last):
  File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xorbits/_mars/services/web/core.py", line 69, in wrapped
    res = await func(self, *args, **kwargs)
  File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xorbits/_mars/services/lifecycle/api/web.py", line 39, in decref_tileables
    await oscar_api.decref_tileables(tileable_keys, counts=counts)
  File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xorbits/_mars/services/lifecycle/api/oscar.py", line 108, in decref_tileables
    return await self._lifecycle_tracker_ref.decref_tileables(tileable_keys)
  File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
    return self._process_result_message(result)
  File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xoscar/backends/pool.py", line 657, in send
    result = await self._run_coro(message.message_id, coro)
  File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xoscar/backends/pool.py", line 368, in _run_coro
    return await coro
  File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xoscar/api.py", line 306, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 550, in xoscar.core._BaseActor.__on_receive__
    return await self._handle_actor_result(result)
  File "xoscar/core.pyx", line 422, in _handle_actor_result
    task_result = await coros[0]
  File "xoscar/core.pyx", line 465, in xoscar.core._BaseActor._run_actor_async_generator
    async with self._lock:
  File "xoscar/core.pyx", line 466, in xoscar.core._BaseActor._run_actor_async_generator
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 471, in xoscar.core._BaseActor._run_actor_async_generator
    res = await gen.athrow(*res)
  File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xorbits/_mars/services/lifecycle/supervisor/tracker.py", line 255, in decref_tileables
    yield asyncio.gather(*coros)
  File "xoscar/core.pyx", line 476, in xoscar.core._BaseActor._run_actor_async_generator
    res = await self._handle_actor_result(res)
  File "xoscar/core.pyx", line 396, in _handle_actor_result
    result = await result
  File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xorbits/_mars/services/lifecycle/supervisor/tracker.py", line 174, in _remove_chunks
    await self._meta_api.del_chunk_meta.batch(*delete_metas)
  File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xoscar/batch.py", line 151, in _async_batch
    return await self.batch_func(args_list, kwargs_list)
  File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xorbits/_mars/services/meta/api/oscar.py", line 204, in batch_del_chunk_meta
    del_chunk_metas.append(self._meta_store.del_meta.delay(*args, **kwargs))
  File "xoscar/core.pyx", line 259, in xoscar.core.LocalActorRef.__getattr__
    raise ActorNotExist(f"Actor {self.uid} does not exist") from None
xoscar.errors.ActorNotExist: [address=cpu64c-3:39783, pid=17469] Actor b'rorsdfNHuVApDnQuJePe8bJR_meta' does not exist
2023-09-25 17:11:07,586 tornado.access 17333 ERROR    500 POST /api/session/rorsdfNHuVApDnQuJePe8bJR/lifecycle?action=decref_tileables (192.168.0.77) 867.54ms

@XprobeBot XprobeBot modified the milestones: v0.7.0, v0.7.1 Oct 23, 2023
@XprobeBot XprobeBot modified the milestones: v0.7.1, v0.7.2 Nov 21, 2023
@XprobeBot XprobeBot modified the milestones: v0.7.2, v0.7.3 Jan 5, 2024
@XprobeBot XprobeBot modified the milestones: v0.7.3, v0.7.4 Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants