Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] linux://doc/source/ray-air/examples:gptj_serving is failing/flaky on master. #41491

Closed
justinvyu opened this issue Nov 29, 2023 · 2 comments · Fixed by #42243
Closed

[CI] linux://doc/source/ray-air/examples:gptj_serving is failing/flaky on master. #41491

justinvyu opened this issue Nov 29, 2023 · 2 comments · Fixed by #42243
Assignees
Labels
flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ P0 Issues that should be fixed in short order train Ray Train Related Issue

Comments

@justinvyu
Copy link
Contributor

....
Generated from flaky test tracker. Please do not edit the signature in this section.
DataCaseName-linux://doc/source/ray-air/examples:gptj_serving-END
....

@justinvyu justinvyu added triage Needs triage (eg: priority, bug/not-bug, and owning component) serve Ray Serve Related Issue train Ray Train Related Issue flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ labels Nov 29, 2023
can-anyscale pushed a commit that referenced this issue Nov 30, 2023
This example is consistently flaky, so we should make it non-blocking: #41491

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@anyscalesam anyscalesam removed the serve Ray Serve Related Issue label Dec 4, 2023
@justinvyu justinvyu self-assigned this Dec 7, 2023
@justinvyu justinvyu added the P1 Issue that should be fixed within a few weeks label Dec 7, 2023
@anyscalesam anyscalesam removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Dec 11, 2023
@anyscalesam
Copy link
Contributor

@justinvyu can we upgrade this to p0 given it's a linux release test and something when it gets to ray210 release we would block correct?

@justinvyu
Copy link
Contributor Author

@anyscalesam Yes let's fix it. I think it's caused by some external change (GPU out of memory error without any code change on our side).

(ServeReplica:default:PredictDeployment pid=12306)     attn_outputs = self.attn(
--
  | (ServeReplica:default:PredictDeployment pid=12306)   File "/opt/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
  | (ServeReplica:default:PredictDeployment pid=12306)     return forward_call(*args, **kwargs)
  | (ServeReplica:default:PredictDeployment pid=12306)   File "/opt/miniconda/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
  | (ServeReplica:default:PredictDeployment pid=12306)     output = old_forward(*args, **kwargs)
  | (ServeReplica:default:PredictDeployment pid=12306)   File "/tmp/ray/session_2023-11-29_11-26-47_341554_8580/runtime_resources/pip/9fc91d65ff21920e7899b7316ff121f0058d7e36/virtualenv/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py", line 254, in forward
  | (ServeReplica:default:PredictDeployment pid=12306)     present = (key.to(hidden_states.dtype), value)
  | (ServeReplica:default:PredictDeployment pid=12306) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.56 GiB total capacity; 11.48 GiB already allocated; 1.50 MiB free; 11.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@anyscalesam anyscalesam added P0 Issues that should be fixed in short order and removed P1 Issue that should be fixed within a few weeks labels Jan 5, 2024
@matthewdeng matthewdeng assigned matthewdeng and unassigned justinvyu Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ P0 Issues that should be fixed in short order train Ray Train Related Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants