[CI] `linux://doc/source/ray-air/examples:gptj_serving` is failing/flaky on master. #41491

justinvyu · 2023-11-29T18:08:16Z

077addb FAILED Buildkite 🚋 ml: train gpu tests
40073ee FAILED Buildkite 🚋 ml: train gpu tests
ab1532c FAILED Buildkite 🚋 ml: train gpu tests
9afdf6d FAILED Buildkite 🚋 ml: train gpu tests
fa4963b FAILED Buildkite 🚋 ml: train gpu tests
55ae63b FAILED Buildkite 🚋 ml: train gpu tests
d432bd6 FAILED Buildkite 🚋 ml: train gpu tests
7b33cd8 FAILED Buildkite 🚋 ml: train gpu tests
d3ac31e FAILED Buildkite 🚋 ml: train gpu tests
28fdcb6 FAILED Buildkite 🚋 ml: train gpu tests
3eeb3de FAILED Buildkite 🚋 ml: train gpu tests

....
Generated from flaky test tracker. Please do not edit the signature in this section.
DataCaseName-linux://doc/source/ray-air/examples:gptj_serving-END
....

The text was updated successfully, but these errors were encountered:

This example is consistently flaky, so we should make it non-blocking: #41491 Signed-off-by: Justin Yu <justinvyu@anyscale.com>

anyscalesam · 2024-01-04T18:23:48Z

@justinvyu can we upgrade this to p0 given it's a linux release test and something when it gets to ray210 release we would block correct?

justinvyu · 2024-01-05T18:31:00Z

@anyscalesam Yes let's fix it. I think it's caused by some external change (GPU out of memory error without any code change on our side).

(ServeReplica:default:PredictDeployment pid=12306)     attn_outputs = self.attn(
--
  | (ServeReplica:default:PredictDeployment pid=12306)   File "/opt/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
  | (ServeReplica:default:PredictDeployment pid=12306)     return forward_call(*args, **kwargs)
  | (ServeReplica:default:PredictDeployment pid=12306)   File "/opt/miniconda/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
  | (ServeReplica:default:PredictDeployment pid=12306)     output = old_forward(*args, **kwargs)
  | (ServeReplica:default:PredictDeployment pid=12306)   File "/tmp/ray/session_2023-11-29_11-26-47_341554_8580/runtime_resources/pip/9fc91d65ff21920e7899b7316ff121f0058d7e36/virtualenv/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py", line 254, in forward
  | (ServeReplica:default:PredictDeployment pid=12306)     present = (key.to(hidden_states.dtype), value)
  | (ServeReplica:default:PredictDeployment pid=12306) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.56 GiB total capacity; 11.48 GiB already allocated; 1.50 MiB free; 11.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

justinvyu added triage Needs triage (eg: priority, bug/not-bug, and owning component) serve Ray Serve Related Issue train Ray Train Related Issue flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ labels Nov 29, 2023

justinvyu mentioned this issue Nov 29, 2023

[ci][train] Mark gptj_serving test as flaky #41492

Merged

8 tasks

can-anyscale pushed a commit that referenced this issue Nov 30, 2023

mark gptj serving as flaky (#41492)

31260fb

This example is consistently flaky, so we should make it non-blocking: #41491 Signed-off-by: Justin Yu <justinvyu@anyscale.com>

anyscalesam removed the serve Ray Serve Related Issue label Dec 4, 2023

justinvyu self-assigned this Dec 7, 2023

justinvyu added the P1 Issue that should be fixed within a few weeks label Dec 7, 2023

anyscalesam removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Dec 11, 2023

anyscalesam added P0 Issues that should be fixed in short order and removed P1 Issue that should be fixed within a few weeks labels Jan 5, 2024

matthewdeng assigned matthewdeng and unassigned justinvyu Jan 8, 2024

matthewdeng mentioned this issue Jan 8, 2024

[serve] remove GPT-J example #42243

Merged

8 tasks

matthewdeng closed this as completed in #42243 Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] `linux://doc/source/ray-air/examples:gptj_serving` is failing/flaky on master. #41491

[CI] `linux://doc/source/ray-air/examples:gptj_serving` is failing/flaky on master. #41491

justinvyu commented Nov 29, 2023

anyscalesam commented Jan 4, 2024

justinvyu commented Jan 5, 2024

[CI] linux://doc/source/ray-air/examples:gptj_serving is failing/flaky on master. #41491

[CI] linux://doc/source/ray-air/examples:gptj_serving is failing/flaky on master. #41491

Comments

justinvyu commented Nov 29, 2023

anyscalesam commented Jan 4, 2024

justinvyu commented Jan 5, 2024

[CI] `linux://doc/source/ray-air/examples:gptj_serving` is failing/flaky on master. #41491

[CI] `linux://doc/source/ray-air/examples:gptj_serving` is failing/flaky on master. #41491