Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_handler_metrics_saver_dist error #3621

Closed
wyli opened this issue Jan 9, 2022 · 2 comments · Fixed by #3641 or #3673
Closed

test_handler_metrics_saver_dist error #3621

wyli opened this issue Jan 9, 2022 · 2 comments · Fixed by #3641 or #3673
Assignees

Comments

@wyli
Copy link
Contributor

wyli commented Jan 9, 2022

Describe the bug
a frequent error from the premerge tests, e.g.

tests/test_handler_metrics_saver_dist.py
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/__w/MONAI/MONAI/tests/test_handler_metrics_saver_dist.py", line 87, in _run
    for i, row in enumerate(f_csv):
OSError: [Errno 9] Bad file descriptor

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/__w/MONAI/MONAI/tests/utils.py", line 396, in run_process
    raise e
  File "/__w/MONAI/MONAI/tests/utils.py", line 387, in run_process
    func(*args, **kwargs)
  File "/__w/MONAI/MONAI/tests/utils.py", line 557, in _call_original_func
    return f(*args, **kwargs)
  File "/__w/MONAI/MONAI/tests/test_handler_metrics_saver_dist.py", line 31, in test_content
    self._run(tempdir)
  File "/__w/MONAI/MONAI/tests/test_handler_metrics_saver_dist.py", line 87, in _run
    for i, row in enumerate(f_csv):
OSError: [Errno 9] Bad file descriptor
F
======================================================================
FAIL: test_content (__main__.DistributedMetricsSaver)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/MONAI/MONAI/tests/utils.py", line 432, in _wrapper
    assert results.get(), "Distributed call failed."
AssertionError: Distributed call failed.

----------------------------------------------------------------------
Ran 1 test in 10.636s
@wyli wyli added this to the Bug Fixes or Misc improvements milestone Jan 9, 2022
@wyli wyli self-assigned this Jan 9, 2022
@wyli wyli added this to MONAI 0.9 Jan 9, 2022
@wyli
Copy link
Contributor Author

wyli commented Jan 11, 2022

the root cause is that the heavy lifting happens on rank 0, and rank 1 may exit early and clear the process rank information. get_rank() might return a wrong number after exiting and clearing:

if dist.get_rank() == 0:

@wyli wyli mentioned this issue Jan 11, 2022
7 tasks
@wyli wyli moved this to Done in MONAI 0.9 Jan 12, 2022
@wyli
Copy link
Contributor Author

wyli commented Jan 18, 2022

still an issue https://github.com/Project-MONAI/MONAI/runs/4852155300?check_suite_focus=true

I'll add a barrier following the comment #3641 (comment) cc @Nic-Ma

@wyli wyli reopened this Jan 18, 2022
wyli added a commit to wyli/MONAI that referenced this issue Jan 18, 2022
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
wyli added a commit that referenced this issue Jan 25, 2022
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
wyli added a commit to wyli/MONAI that referenced this issue Jan 25, 2022
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
wyli added a commit to wyli/MONAI that referenced this issue Jan 26, 2022
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
wyli added a commit that referenced this issue Feb 3, 2022
* temp spatial_resample

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes resampling

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes precisions

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update dict version

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes unit tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* adds docs

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* copy grid for resampling

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes unit tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* remove normalize coordinates

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* [MONAI] python code formatting

Signed-off-by: monai-bot <monai.miccai2019@gmail.com>

* try to fix #3621 (#3673)

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes typing

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes grid_sample, interpolate URLs

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* simplify norm_coords

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update docstring

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update moveaxis

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* spatial sample tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* additional tests spatial resample

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* test invert saptial resample

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes unit tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* rtol assert close

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes TF32 tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* smaller tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* skip when quick testing

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* comp tensor and ndarray

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update based on comments

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* try to use torch.solve

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* Revert "fixes tests"

This reverts commit e532490.

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update based on comments

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes test_affined

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* default to float32 rotate/randrotate

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* workaround for #3752

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* default to float32 rotate/randrotate

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* temp test

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update docstring

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

Co-authored-by: monai-bot <monai.miccai2019@gmail.com>
wyli added a commit that referenced this issue Feb 4, 2022
* temp spatial_resample

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes resampling

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes precisions

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update dict version

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes unit tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* adds docs

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* copy grid for resampling

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes unit tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* remove normalize coordinates

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* [MONAI] python code formatting

Signed-off-by: monai-bot <monai.miccai2019@gmail.com>

* try to fix #3621 (#3673)

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes typing

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes grid_sample, interpolate URLs

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* simplify norm_coords

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update docstring

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update moveaxis

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* spatial sample tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* additional tests spatial resample

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* test invert saptial resample

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes unit tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* rtol assert close

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes TF32 tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* smaller tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* skip when quick testing

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* comp tensor and ndarray

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update based on comments

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* try to use torch.solve

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* temp updates

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* enhance typing

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* temp test

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* Revert "temp test"

This reverts commit 6200a38.

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* enhance types

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update util

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* reverse workaround

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* formatting

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update type def.

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* temp test

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* warn unused

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* remote ignore

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* Revert "warn unused"

This reverts commit e645807.

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* Revert "temp test"

This reverts commit ddc4770.

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update based on comments

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

Co-authored-by: monai-bot <monai.miccai2019@gmail.com>
wyli added a commit that referenced this issue Feb 7, 2022
* temp spatial_resample

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes resampling

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes precisions

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update dict version

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes unit tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* adds docs

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* copy grid for resampling

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes unit tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* remove normalize coordinates

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* [MONAI] python code formatting

Signed-off-by: monai-bot <monai.miccai2019@gmail.com>

* try to fix #3621 (#3673)

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes typing

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes grid_sample, interpolate URLs

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* simplify norm_coords

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update docstring

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update moveaxis

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* spatial sample tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* additional tests spatial resample

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* test invert saptial resample

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* adds a base writer and an itk writer

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update docstrings

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* remove return self

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* adds reorient_spatial_axes

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update based on comments

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update based on comments

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes unit tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* sync 3701

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* try to fix #3766

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* revise docstring to be concise

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update based on comments

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* 3765 Enhance `create_multigpu_supervised_XXX` for distributed (#3768)

* [DLMED] add check for devices

Signed-off-by: Nic Ma <nma@nvidia.com>

* [DLMED] update according to comments

Signed-off-by: Nic Ma <nma@nvidia.com>

* update to support dynamic spatial_size

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update based on comments

Signed-off-by: Wenqi Li <wenqil@nvidia.com>
wyli added a commit that referenced this issue Feb 8, 2022
* temp spatial_resample

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes resampling

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes precisions

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update dict version

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes unit tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* adds docs

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* copy grid for resampling

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes unit tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* remove normalize coordinates

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* [MONAI] python code formatting

Signed-off-by: monai-bot <monai.miccai2019@gmail.com>

* try to fix #3621 (#3673)

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes typing

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes grid_sample, interpolate URLs

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* simplify norm_coords

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update docstring

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update moveaxis

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* spatial sample tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* additional tests spatial resample

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* test invert saptial resample

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* adds a base writer and an itk writer

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update docstrings

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* remove return self

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* adds reorient_spatial_axes

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update based on comments

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update based on comments

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* fixes unit tests

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* sync 3701

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* try to fix #3766

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* revise docstring to be concise

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update based on comments

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* 3765 Enhance `create_multigpu_supervised_XXX` for distributed (#3768)

* [DLMED] add check for devices

Signed-off-by: Nic Ma <nma@nvidia.com>

* [DLMED] update according to comments

Signed-off-by: Nic Ma <nma@nvidia.com>

* update to support dynamic spatial_size

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* adds nibabel pil writers

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* remove unused ignore

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

* update based on comments

Signed-off-by: Wenqi Li <wenqil@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done
1 participant