Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MpDeviceLoaderWrapper not having attribute batch_sampler #2242

Merged
merged 2 commits into from
Dec 13, 2023

Conversation

vanbasten23
Copy link
Contributor

@vanbasten23 vanbasten23 commented Dec 13, 2023

What does this PR do?

Fixes # (issue)

Currently, running accelerate test on TPU fails with error:

stderr: concurrent.futures.process._RemoteTraceback:
stderr: """
stderr: Traceback (most recent call last):
stderr:   File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
stderr:     r = call_item.fn(*call_item.args, **call_item.kwargs)
stderr:   File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 198, in _process_chunk
stderr:     return [fn(*args) for args in chunk]
stderr:   File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 198, in <listcomp>
stderr:     return [fn(*args) for args in chunk]
stderr:   File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/runtime.py", line 87, in wrapper
stderr:     return fn(*args, **kwargs)
stderr:   File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/_internal/pjrt.py", line 77, in _run_thread_per_device
stderr:     replica_results = list(
stderr:   File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
stderr:     yield fs.pop().result()
stderr:   File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 444, in result
stderr:     return self.__get_result()
stderr:   File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
stderr:     raise self._exception
stderr:   File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
stderr:     result = self.fn(*self.args, **self.kwargs)
stderr:   File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/_internal/pjrt.py", line 70, in _thread_fn
stderr:     return fn()
stderr:   File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/_internal/pjrt.py", line 176, in __call__
stderr:     self.fn(runtime.global_ordinal(), *self.args, **self.kwargs)
stderr:   File "/root/accelerate/src/accelerate/utils/launch.py", line 562, in __call__
stderr:     self.launcher(*args)
stderr:   File "/root/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 656, in main
stderr:     custom_sampler_check()
stderr:   File "/root/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 328, in custom_sampler_check
stderr:     if hasattr(dl.batch_sampler, "batch_sampler"):
stderr: AttributeError: 'MpDeviceLoaderWrapper' object has no attribute 'batch_sampler'
stderr: """
stderr:
stderr: The above exception was the direct cause of the following exception:
stderr:
stderr: Traceback (most recent call last):
stderr:   File "/ansible/.venv/bin/accelerate-launch", line 8, in <module>
stderr:     sys.exit(main())
stderr:   File "/root/accelerate/src/accelerate/commands/launch.py", line 1023, in main
stderr:     launch_command(args)
stderr:   File "/root/accelerate/src/accelerate/commands/launch.py", line 1013, in launch_command
stderr:     tpu_launcher(args)
stderr:   File "/root/accelerate/src/accelerate/commands/launch.py", line 756, in tpu_launcher
stderr:     xmp.spawn(PrepareForLaunch(main_function), args=(), nprocs=args.num_processes)
stderr:   File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/runtime.py", line 87, in wrapper
stderr:     return fn(*args, **kwargs)
stderr:   File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 38, in spawn
stderr:     return pjrt.spawn(fn, nprocs, start_method, args)
stderr:   File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/_internal/pjrt.py", line 200, in spawn
stderr:     run_multiprocess(spawn_fn, start_method=start_method)
stderr:   File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/runtime.py", line 87, in wrapper
stderr:     return fn(*args, **kwargs)
stderr:   File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/_internal/pjrt.py", line 160, in run_multiprocess
stderr:     replica_results = list(
stderr:   File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/_internal/pjrt.py", line 161, in <genexpr>
stderr:     itertools.chain.from_iterable(
stderr:   File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 484, in _chain_from_iterable_of_lists
stderr:     for element in iterable:
stderr:   File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
stderr:     yield fs.pop().result()
stderr:   File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 444, in result
stderr:     return self.__get_result()
stderr:   File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
stderr:     raise self._exception
stderr: AttributeError: 'MpDeviceLoaderWrapper' object has no attribute 'batch_sampler'
Traceback (most recent call last):
  File "/ansible/.venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/accelerate/src/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/root/accelerate/src/accelerate/commands/test.py", line 54, in test_command
    result = execute_subprocess_async(cmd, env=os.environ.copy())
  File "/root/accelerate/src/accelerate/test_utils/testing.py", line 465, in execute_subprocess_async
    raise RuntimeError(
RuntimeError: 'accelerate-launch /root/accelerate/src/accelerate/test_utils/scripts/test_script.py' failed with returncode 1

The combined stderr from workers follows:
WARNING:root:Unsupported nprocs (4), ignoring...
E1212 23:53:37.682445343   35243 oauth2_credentials.cc:176]            Call to http server ended with error 400 [{
  "error": "invalid_grant",
  "error_description": "reauth related error (invalid_rapt)",
  "error_uri": "https://support.google.com/a/answer/9368756",
  "error_subtype": "invalid_rapt"
}].
E1212 23:53:37.684281636   36095 oauth2_credentials.cc:176]            Call to http server ended with error 400 [{
  "error": "invalid_grant",
  "error_description": "reauth related error (invalid_rapt)",
  "error_uri": "https://support.google.com/a/answer/9368756",
  "error_subtype": "invalid_rapt"
}].
E1212 23:53:37.712086736   35241 oauth2_credentials.cc:176]            Call to http server ended with error 400 [{
  "error": "invalid_grant",
  "error_description": "reauth related error (invalid_rapt)",
  "error_uri": "https://support.google.com/a/answer/9368756",
  "error_subtype": "invalid_rapt"
}].
E1212 23:53:37.780963362   36286 oauth2_credentials.cc:176]            Call to http server ended with error 400 [{
  "error": "invalid_grant",
  "error_description": "reauth related error (invalid_rapt)",
  "error_uri": "https://support.google.com/a/answer/9368756",
  "error_subtype": "invalid_rapt"
}].
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 198, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 198, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/runtime.py", line 87, in wrapper
    return fn(*args, **kwargs)
  File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/_internal/pjrt.py", line 77, in _run_thread_per_device
    replica_results = list(
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
    yield fs.pop().result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/_internal/pjrt.py", line 70, in _thread_fn
    return fn()
  File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/_internal/pjrt.py", line 176, in __call__
    self.fn(runtime.global_ordinal(), *self.args, **self.kwargs)
  File "/root/accelerate/src/accelerate/utils/launch.py", line 562, in __call__
    self.launcher(*args)
  File "/root/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 656, in main
    custom_sampler_check()
  File "/root/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 328, in custom_sampler_check
    if hasattr(dl.batch_sampler, "batch_sampler"):
AttributeError: 'MpDeviceLoaderWrapper' object has no attribute 'batch_sampler'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/ansible/.venv/bin/accelerate-launch", line 8, in <module>
    sys.exit(main())
  File "/root/accelerate/src/accelerate/commands/launch.py", line 1023, in main
    launch_command(args)
  File "/root/accelerate/src/accelerate/commands/launch.py", line 1013, in launch_command
    tpu_launcher(args)
  File "/root/accelerate/src/accelerate/commands/launch.py", line 756, in tpu_launcher
    xmp.spawn(PrepareForLaunch(main_function), args=(), nprocs=args.num_processes)
  File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/runtime.py", line 87, in wrapper
    return fn(*args, **kwargs)
  File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 38, in spawn
    return pjrt.spawn(fn, nprocs, start_method, args)
  File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/_internal/pjrt.py", line 200, in spawn
    run_multiprocess(spawn_fn, start_method=start_method)
  File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/runtime.py", line 87, in wrapper
    return fn(*args, **kwargs)
  File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/_internal/pjrt.py", line 160, in run_multiprocess
    replica_results = list(
  File "/ansible/.venv/lib/python3.8/site-packages/torch_xla/_internal/pjrt.py", line 161, in <genexpr>
    itertools.chain.from_iterable(
  File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 484, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
    yield fs.pop().result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
AttributeError: 'MpDeviceLoaderWrapper' object has no attribute 'batch_sampler'

Full output is here https://gist.github.com/vanbasten23/521846344942fa07ae0ae681d18a23bf.
I'm not sure if it is caused by https://github.com/huggingface/accelerate/pull/2097/files but this PR is intended to fix this error. With the fix, I can see this error goes away: https://gist.github.com/vanbasten23/004826dc3aa3d2408444306222a1fee2 (though fails with another error further down in the test.)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR. @muellerzr

Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@muellerzr muellerzr merged commit ad3a5bc into huggingface:main Dec 13, 2023
23 checks passed
@vanbasten23
Copy link
Contributor Author

Thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants