-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory leak issue in torch_fx
tests
#18547
Fix memory leak issue in torch_fx
tests
#18547
Conversation
The documentation is not available anymore as the PR was closed or merged. |
|
When , using the new process approach, in some cases, setting Otherwise, we might get the following error: tests/models/bart/test_modeling_bart.py::BartModelTest::test_torch_fx Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/queues.py", line 245, in _feed
File "/usr/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
File "/home/yih_dar_huggingface_co/.local/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 358, in reduce_storage
RuntimeError: unable to open shared memory object </torch_46201_690006289_939> in read-write mode: Too many open files (24) More details: > ???
tests/test_modeling_common.py:769:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/test_modeling_common.py:866: in _create_and_check_torch_fx_tracing
???
/usr/lib/python3.9/multiprocessing/process.py:121: in start
???
/usr/lib/python3.9/multiprocessing/context.py:277: in _Popen
???
/usr/lib/python3.9/multiprocessing/popen_fork.py:19: in __init__
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <multiprocessing.popen_fork.Popen object at 0x7fa12a499820>, process_obj = <ForkProcess name='ForkProcess-10' parent=46201 initial>
> ???
E OSError: [Errno 24] Too many open files
/usr/lib/python3.9/multiprocessing/popen_fork.py:64: OSError This seems to relate to torch multiprocessing: https://discuss.pytorch.org/t/runtimeerror-unable-to-open-shared-memory-object-depending-on-the-model/116090 Another related issue (not torch): lava-nc/lava#71 |
With GPU, we have to use
|
tests/test_modeling_common.py
Outdated
# Looks like `MKL_NUM_THREADS > 1` with `fork` will hang if the traced/scripted models call inputs. | ||
# Let's use `spawn` to have a new clean process. | ||
# (we can even use `spawn` on scheduled CI but `fork` on CircleCI if necessary) | ||
ctx = multiprocessing.get_context("fork") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not working with GPU/CUDA - need to be spawn
in this case.
tests/test_modeling_common.py
Outdated
|
||
process = ctx.Process(target=_run_torch_jit, args=(input_queue, output_queue)) | ||
process.start() | ||
traced_model, traced_output, scripted_output, error = output_queue.get(timeout=30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still hangs on GPU VM, even with spawn
tests/test_modeling_common.py
Outdated
# To avoid the child process hanging on the line `traced_model = symbolic_trace(model, input_names)`. | ||
# We will run `model.to(torch_device)` in the child process instead. | ||
if torch_device != "cpu": | ||
model.to("cpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know the exact reason, but passing model in CUDA via Queue will cause tracing issue. Let's pass it in CPU and send to CUDA device in the subprocess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot really use multiprocessing with CUDA, they don't play along well together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaelbenayoun would be the best person to review this :-)
I think it's safe to only run those tests on CPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments.
Also be careful because some models also have their "custom" implementation of the test.
tests/test_modeling_common.py
Outdated
@@ -138,6 +138,34 @@ def _config_zero_init(config): | |||
TINY_BERT_FOR_TOKEN_CLASSIFICATION = "hf-internal-testing/tiny-bert-for-token-classification" | |||
|
|||
|
|||
def _run_torch_jit(in_queue, out_queue): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't "like" here is that now a test would fail for torchscripting before failing for output mismatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I could even running the match in this function, return the matching results to the parent process, and fail the tests there - if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's safe to only run those tests on CPU. Also, when running locally it takes ~ 1 min (althought I agree my machine might be more powerful).
Do you run against my branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it would be perfect!
I did not run anything on your branch, I was talking in general those tests are not that long, and never fail, but my machine is most likely better than the one running the CI!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed fast. This PR is not addressing the time issue, but the memory issue. Currently, each call to test_torch_fx
will increase the memory usage by ~15MB.
The time issue comes after the fix, as we create new processes to run (part of) the code. Using fork
is fine, but spawn
will be quite slow. But currently, this time issue is insignificant (and spawn
is only used on GPU CI running on our own runners - therefore no a real constraint).
tests/test_modeling_common.py
Outdated
|
||
model_output = model(**filtered_inputs) | ||
|
||
# Note: `MKL_NUM_THREADS > 1` with `fork` will hang if the traced/scripted models call inputs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it behave locally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On CircleCI, we have MKL_NUM_THREADS = 1
(and we are on CPU) --> no issue.
On scheduled CI, we have MKL_NUM_THREADS = 8
, but the device is GPU, so we use spawn
--> no issue.
Locally, it seems that if we don't set MKL_NUM_THREADS
explicitly, it will behave like MKL_NUM_THREADS > 1
, and hangs when running on CPU. --> This is not desirable, I will try if I can set it to 1
temporarily inside this test.
I move (almost) the whole testing logic to the child process. On more advantage here is to create the model in the child process, so we don't need to pass it between the process. Now running 100 times, we have only (per run) |
c140d77
to
3393331
Compare
@michaelbenayoun You are right, some model overwrites |
I think it's okay now with the changes you've made! |
Would love to have a approval from you, @michaelbenayoun. |
3393331
to
3dfa380
Compare
ready for @sgugger and/or @LysandreJik to have a final check 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! Thanks for working on this, @ydshieh
tests/test_modeling_common.py
Outdated
@@ -138,6 +138,159 @@ def _config_zero_init(config): | |||
TINY_BERT_FOR_TOKEN_CLASSIFICATION = "hf-internal-testing/tiny-bert-for-token-classification" | |||
|
|||
|
|||
def _run_torch_jit(in_queue, out_queue): | |||
import traceback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be at the top of the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has to be outside the class (i.e. can't be the method in the class), otherwise multiprocessing has issue with pickling object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
I will merge this afternoon, after adding a short command in |
Hi @michaelbenayoun, I just saw that I fixed a similar issue a few months ago transformers/tests/test_modeling_common.py Line 719 in fbf382c
(for |
Co-authored-by: Lysandre Debut <hi@lysand.re>
489e80c
to
7d03df4
Compare
Changed the PR to simply call |
torch_fx
tests
Co-authored-by: Lysandre Debut <hi@lysand.re> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
What does this PR do?
Question: On GPU VMs, we have to usespawn
, see here. However, it still hangs withspawn
(I can't figure out this yet). Should we have 2 branches: one using new process for CPU VM (on CircleCI), and another one using the original approach (no new process - for GPU VM, like on scheduled CI)?I might have a solution! --> send the model to the child process in CPU and send to CUDA device there.
I am going to trynot working neithertorch.multiprocessing
first.Run torch_fx tests in a spawn process to avoid memory issue.
JoinableQueue
instead ofQueue
for the outputs:https://discuss.pytorch.org/t/using-torch-tensor-over-multiprocessing-queue-process-fails/2847