Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

device agnostic fsdp testing #27120

Merged
merged 3 commits into from
Nov 1, 2023
Merged

Conversation

statelesshz
Copy link
Contributor

@statelesshz statelesshz commented Oct 28, 2023

What does this PR do?

Part of #25654 (comment)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

cc @ydshieh and @ArthurZucker

@amyeroberts
Copy link
Collaborator

cc @ydshieh

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works for me :-)

@ydshieh
Copy link
Collaborator

ydshieh commented Oct 30, 2023

@statelesshz How these tests happen for multiple npu device? They pass or fail?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@statelesshz
Copy link
Contributor Author

@ydshieh Only one test case failed due to missing communication operator,and I will upload the test log tomorrow :-)

@statelesshz
Copy link
Contributor Author

System info


(hf_test) [root@localhost transformers]# transformers-cli env
Fail to import hypothesis in common_utils, tests are not derandomized

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.35.0.dev0
- Platform: Linux-4.19.90-vhulk2211.3.0.h1543.eulerosv2r10.aarch64-aarch64-with-glibc2.26
- Python version: 3.8.18
- Huggingface_hub version: 0.17.3
- Safetensors version: 0.4.0
- Accelerate version: 0.24.0
- Accelerate config: 	not found
- PyTorch version (GPU?): 2.1.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

(hf_test) [root@localhost transformers]# accelerate env
Fail to import hypothesis in common_utils, tests are not derandomized

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.24.0
- Platform: Linux-4.19.90-vhulk2211.3.0.h1543.eulerosv2r10.aarch64-aarch64-with-glibc2.26
- Python version: 3.8.18
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.1.0 (False)
- PyTorch XPU available: False
- PyTorch NPU available: True
- System RAM: 755.10 GB
- `Accelerate` default config:
	Not found

test result

spec.py

import torch
import torch_npu
# !! Further additional imports can be added here !!
# Specify the device name (eg. 'cuda', 'cpu', 'npu')
DEVICE_NAME = 'npu:0'
# Specify device-specific backends to dispatch to.
# If not specified, will fallback to 'default' in 'testing_utils.py`
MANUAL_SEED_FN = torch.npu.manual_seed_all
EMPTY_CACHE_FN = torch.npu.empty_cache
DEVICE_COUNT_FN = torch.npu.device_count

do the following instructions:

RUN_SLOW=1 TRANSFORMERS_TEST_BACKEND="torch_npu" TRANSFORMERS_TEST_DEVICE="npu:0" TRANSFORMERS_TEST_DEVICE_SPEC="spec.py" python -m pytest -v  tests/fsdp/

The output is as follows

============================================================================================================= test session starts ==============================================================================================================
platform linux -- Python 3.8.18, pytest-7.4.3, pluggy-1.3.0 -- /data/anaconda/envs/hf_test/bin/python
cachedir: .pytest_cache
rootdir: /data/hf_test/transformers
configfile: setup.cfg
collected 12 items                                                                                                                                                                                                                             

tests/fsdp/test_fsdp.py::TrainerIntegrationFSDP::test_basic_run_full_shard_bf16 PASSED                                                                                                                                                   [  8%]
tests/fsdp/test_fsdp.py::TrainerIntegrationFSDP::test_basic_run_full_shard_fp16 PASSED                                                                                                                                                   [ 16%]
tests/fsdp/test_fsdp.py::TrainerIntegrationFSDP::test_basic_run_shard_grad_op_bf16 PASSED                                                                                                                                                [ 25%]
tests/fsdp/test_fsdp.py::TrainerIntegrationFSDP::test_basic_run_shard_grad_op_fp16 PASSED                                                                                                                                                [ 33%]
tests/fsdp/test_fsdp.py::TrainerIntegrationFSDP::test_basic_run_with_cpu_offload_0_fp16 PASSED                                                                                                                                           [ 41%]
tests/fsdp/test_fsdp.py::TrainerIntegrationFSDP::test_basic_run_with_cpu_offload_1_bf16 PASSED                                                                                                                                           [ 50%]
tests/fsdp/test_fsdp.py::TrainerIntegrationFSDP::test_fsdp_config_full_shard_bf16 PASSED                                                                                                                                                 [ 58%]
tests/fsdp/test_fsdp.py::TrainerIntegrationFSDP::test_fsdp_config_full_shard_fp16 PASSED                                                                                                                                                 [ 66%]
tests/fsdp/test_fsdp.py::TrainerIntegrationFSDP::test_fsdp_config_shard_grad_op_bf16 PASSED                                                                                                                                              [ 75%]
tests/fsdp/test_fsdp.py::TrainerIntegrationFSDP::test_fsdp_config_shard_grad_op_fp16 PASSED                                                                                                                                              [ 83%]
tests/fsdp/test_fsdp.py::TrainerIntegrationFSDP::test_training_and_can_resume_normally_FULL_STATE_DICT PASSED                                                                                                                            [ 91%]
tests/fsdp/test_fsdp.py::TrainerIntegrationFSDP::test_training_and_can_resume_normally_SHARDED_STATE_DICT FAILED                                                                                                                         [100%]

=================================================================================================================== FAILURES ===================================================================================================================
_______________________________________________________________________________ TrainerIntegrationFSDP.test_training_and_can_resume_normally_SHARDED_STATE_DICT ________________________________________________________________________________

a = (<test_fsdp.TrainerIntegrationFSDP testMethod=test_training_and_can_resume_normally_SHARDED_STATE_DICT>,), kw = {}

    @wraps(func)
    def standalone_func(*a, **kw):
>       return func(*(a + p.args), **p.kwargs, **kw)

/data/anaconda/envs/hf_test/lib/python3.8/site-packages/parameterized/parameterized.py:620: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/fsdp/test_fsdp.py:209: in test_training_and_can_resume_normally
    logs = self.run_cmd_and_get_logs(use_accelerate, sharding_strategy, launcher, script, args, output_dir)
tests/fsdp/test_fsdp.py:239: in run_cmd_and_get_logs
    execute_subprocess_async(cmd, env=self.get_env())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cmd = ['accelerate', 'launch', '--num_processes', '2', '--main_process_port', '10999', ...]
env = {'ASCEND_AICPU_PATH': '/data/hf_test/ascend-toolkit/latest', 'ASCEND_HOME_PATH': '/data/hf_test/ascend-toolkit/latest'...PP_PATH': '/data/hf_test/ascend-toolkit/latest/opp', 'ASCEND_TOOLKIT_HOME': '/data/hf_test/ascend-toolkit/latest', ...}
stdin = None, timeout = 180, quiet = False, echo = True

    def execute_subprocess_async(cmd, env=None, stdin=None, timeout=180, quiet=False, echo=True) -> _RunOutput:
        loop = asyncio.get_event_loop()
        result = loop.run_until_complete(
            _stream_subprocess(cmd, env=env, stdin=stdin, timeout=timeout, quiet=quiet, echo=echo)
        )
    
        cmd_str = " ".join(cmd)
        if result.returncode > 0:
            stderr = "\n".join(result.stderr)
>           raise RuntimeError(
                f"'{cmd_str}' failed with returncode {result.returncode}\n\n"
                f"The combined stderr from workers follows:\n{stderr}"
            )
E           RuntimeError: 'accelerate launch --num_processes 2 --main_process_port 10999 --use_fsdp --fsdp_auto_wrap_policy TRANSFORMER_BASED_WRAP --fsdp_state_dict_type SHARDED_STATE_DICT --fsdp_transformer_layer_cls_to_wrap BertLayer --fsdp_sharding_strategy 1 /data/hf_test/transformers/examples/pytorch/text-classification/run_glue.py --model_name_or_path /data/hf_test/bert-base-cased --task_name mrpc --output_dir ./xxx --overwrite_output_dir --do_train --max_seq_length 128 --per_device_train_batch_size 16 --learning_rate 5e-5 --num_train_epochs 2 --lr_scheduler_type cosine --logging_steps 25 --save_strategy epoch --do_eval --evaluation_strategy epoch --report_to none' failed with returncode 1
E           
E           The combined stderr from workers follows:
E           The following values were not passed to `accelerate launch` and had defaults used instead:
E           		More than one GPU was found, enabling multi-GPU training.
E           		If this was unintended please pass in `--num_processes=1`.
E           	`--num_machines` was set to a value of `1`
E           	`--mixed_precision` was set to a value of `'no'`
E           	`--dynamo_backend` was set to a value of `'no'`
E           To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
E           Using the latest cached version of the module from /root/.cache/huggingface/modules/datasets_modules/datasets/glue/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad (last modified on Thu Oct 26 17:35:09 2023) since it couldn't be found locally at glue., or remotely on the Hugging Face Hub.
E           Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/glue/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad
E           Overwrite dataset info from restored data version if exists.
E           Loading Dataset info from /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad
E           Found cached dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
E           Loading Dataset info from /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad
E           [INFO|configuration_utils.py:714] 2023-10-31 09:38:07,303 >> loading configuration file /data/hf_test/bert-base-cased/config.json
E           [INFO|configuration_utils.py:776] 2023-10-31 09:38:07,314 >> Model config BertConfig {
E             "_name_or_path": "/data/hf_test/bert-base-cased",
E             "architectures": [
E               "BertForMaskedLM"
E             ],
E             "attention_probs_dropout_prob": 0.1,
E             "classifier_dropout": null,
E             "finetuning_task": "mrpc",
E             "gradient_checkpointing": false,
E             "hidden_act": "gelu",
E             "hidden_dropout_prob": 0.1,
E             "hidden_size": 768,
E             "initializer_range": 0.02,
E             "intermediate_size": 3072,
E             "layer_norm_eps": 1e-12,
E             "max_position_embeddings": 512,
E             "model_type": "bert",
E             "num_attention_heads": 12,
E             "num_hidden_layers": 12,
E             "pad_token_id": 0,
E             "position_embedding_type": "absolute",
E             "transformers_version": "4.35.0.dev0",
E             "type_vocab_size": 2,
E             "use_cache": true,
E             "vocab_size": 28996
E           }
E           
E           [INFO|configuration_utils.py:714] 2023-10-31 09:38:07,314 >> loading configuration file /data/hf_test/bert-base-cased/config.json
E           [INFO|configuration_utils.py:776] 2023-10-31 09:38:07,316 >> Model config BertConfig {
E             "_name_or_path": "/data/hf_test/bert-base-cased",
E             "architectures": [
E               "BertForMaskedLM"
E             ],
E             "attention_probs_dropout_prob": 0.1,
E             "classifier_dropout": null,
E             "gradient_checkpointing": false,
E             "hidden_act": "gelu",
E             "hidden_dropout_prob": 0.1,
E             "hidden_size": 768,
E             "initializer_range": 0.02,
E             "intermediate_size": 3072,
E             "layer_norm_eps": 1e-12,
E             "max_position_embeddings": 512,
E             "model_type": "bert",
E             "num_attention_heads": 12,
E             "num_hidden_layers": 12,
E             "pad_token_id": 0,
E             "position_embedding_type": "absolute",
E             "transformers_version": "4.35.0.dev0",
E             "type_vocab_size": 2,
E             "use_cache": true,
E             "vocab_size": 28996
E           }
E           
E           [INFO|tokenization_utils_base.py:2019] 2023-10-31 09:38:07,316 >> loading file vocab.txt
E           [INFO|tokenization_utils_base.py:2019] 2023-10-31 09:38:07,316 >> loading file tokenizer.json
E           [INFO|tokenization_utils_base.py:2019] 2023-10-31 09:38:07,317 >> loading file added_tokens.json
E           [INFO|tokenization_utils_base.py:2019] 2023-10-31 09:38:07,317 >> loading file special_tokens_map.json
E           [INFO|tokenization_utils_base.py:2019] 2023-10-31 09:38:07,317 >> loading file tokenizer_config.json
E           [INFO|configuration_utils.py:714] 2023-10-31 09:38:07,317 >> loading configuration file /data/hf_test/bert-base-cased/config.json
E           [INFO|configuration_utils.py:776] 2023-10-31 09:38:07,318 >> Model config BertConfig {
E             "_name_or_path": "/data/hf_test/bert-base-cased",
E             "architectures": [
E               "BertForMaskedLM"
E             ],
E             "attention_probs_dropout_prob": 0.1,
E             "classifier_dropout": null,
E             "gradient_checkpointing": false,
E             "hidden_act": "gelu",
E             "hidden_dropout_prob": 0.1,
E             "hidden_size": 768,
E             "initializer_range": 0.02,
E             "intermediate_size": 3072,
E             "layer_norm_eps": 1e-12,
E             "max_position_embeddings": 512,
E             "model_type": "bert",
E             "num_attention_heads": 12,
E             "num_hidden_layers": 12,
E             "pad_token_id": 0,
E             "position_embedding_type": "absolute",
E             "transformers_version": "4.35.0.dev0",
E             "type_vocab_size": 2,
E             "use_cache": true,
E             "vocab_size": 28996
E           }
E           
E           Using the latest cached version of the module from /root/.cache/huggingface/modules/datasets_modules/datasets/glue/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad (last modified on Thu Oct 26 17:35:09 2023) since it couldn't be found locally at glue., or remotely on the Hugging Face Hub.
E           [INFO|modeling_utils.py:3057] 2023-10-31 09:38:07,393 >> loading weights file /data/hf_test/bert-base-cased/pytorch_model.bin
E           [INFO|modeling_utils.py:3838] 2023-10-31 09:38:08,324 >> Some weights of the model checkpoint at /data/hf_test/bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
E           - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
E           - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
E           [WARNING|modeling_utils.py:3850] 2023-10-31 09:38:08,324 >> Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /data/hf_test/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
E           You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
E           Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-f2e61c34c9899b5a.arrow
E           Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-fd9184904bb613ef.arrow
E           Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-e2ab4fdde1bba06e.arrow
E           [WARNING|modeling_utils.py:3850] 2023-10-31 09:38:08,625 >> Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /data/hf_test/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
E           You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
E           [INFO|trainer.py:698] 2023-10-31 09:38:12,532 >> The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: idx, sentence2, sentence1. If idx, sentence2, sentence1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
E           [INFO|trainer.py:1674] 2023-10-31 09:38:13,434 >> ***** Running training *****
E           [INFO|trainer.py:1675] 2023-10-31 09:38:13,435 >>   Num examples = 3,668
E           [INFO|trainer.py:1676] 2023-10-31 09:38:13,435 >>   Num Epochs = 2
E           [INFO|trainer.py:1677] 2023-10-31 09:38:13,435 >>   Instantaneous batch size per device = 16
E           [INFO|trainer.py:1680] 2023-10-31 09:38:13,435 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
E           [INFO|trainer.py:1681] 2023-10-31 09:38:13,435 >>   Gradient Accumulation steps = 1
E           [INFO|trainer.py:1682] 2023-10-31 09:38:13,435 >>   Total optimization steps = 230
E           [INFO|trainer.py:1683] 2023-10-31 09:38:13,436 >>   Number of trainable parameters = 54,155,905
 50%|█████     | 115/230 [00:14<00:13,  8.55it/s][INFO|trainer.py:698] 2023-10-31 09:38:27,965 >> The following columns in the evaluation set don't have a corresponding argument in `FullyShardedDataParallel.forward` and have been ignored: idx, sentence2, sentence1. If idx, sentence2, sentence1 are not expected by `FullyShardedDataParallel.forward`,  you can safely ignore this message.
E           [INFO|trainer.py:3093] 2023-10-31 09:38:27,969 >> ***** Running Evaluation *****
E           [INFO|trainer.py:3095] 2023-10-31 09:38:27,969 >>   Num examples = 408
E           [INFO|trainer.py:3098] 2023-10-31 09:38:27,969 >>   Batch size = 8
 50%|█████     | 115/230 [00:15<00:13,  8.55it/[INFO|trainer.py:2816] 2023-10-31 09:38:29,156 >> Saving model checkpoint to ./xxx/checkpoint-115
E           [INFO|configuration_utils.py:461] 2023-10-31 09:38:29,158 >> Configuration saved in ./xxx/checkpoint-115/config.json
E           [INFO|modeling_utils.py:2168] 2023-10-31 09:38:29,159 >> Model weights saved in ./xxx/checkpoint-115/pytorch_model.bin
E           [INFO|tokenization_utils_base.py:2426] 2023-10-31 09:38:29,159 >> tokenizer config file saved in ./xxx/checkpoint-115/tokenizer_config.json
E           [INFO|tokenization_utils_base.py:2435] 2023-10-31 09:38:29,160 >> Special tokens file saved in ./xxx/checkpoint-115/special_tokens_map.json
E           /data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/_shard/sharded_tensor/api.py:1121: UserWarning: Please use DTensor instead and we are deprecating ShardedTensor.
E             warnings.warn(DEPRECATE_MSG)
E           /data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/_shard/sharded_tensor/api.py:1121: UserWarning: Please use DTensor instead and we are deprecating ShardedTensor.
E             warnings.warn(DEPRECATE_MSG)
E           Traceback (most recent call last):
E             File "/data/hf_test/transformers/examples/pytorch/text-classification/run_glue.py", line 649, in <module>
E               main()
E             File "/data/hf_test/transformers/examples/pytorch/text-classification/run_glue.py", line 557, in main
E               train_result = trainer.train(resume_from_checkpoint=checkpoint)
E             File "/data/hf_test/transformers/src/transformers/trainer.py", line 1511, in train
E               return inner_training_loop(
E             File "/data/hf_test/transformers/src/transformers/trainer.py", line 1894, in _inner_training_loop
E               self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
E             File "/data/hf_test/transformers/src/transformers/trainer.py", line 2234, in _maybe_log_save_evaluate
E               self._save_checkpoint(model, trial, metrics=metrics)
E             File "/data/hf_test/transformers/src/transformers/trainer.py", line 2291, in _save_checkpoint
E               self.save_model(output_dir, _internal_call=True)
E             File "/data/hf_test/transformers/src/transformers/trainer.py", line 2756, in save_model
E               save_fsdp_model(self.accelerator.state.fsdp_plugin, self.accelerator, self.model, output_dir)
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py", line 72, in save_fsdp_model
E               dist_cp.save_state_dict(
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 113, in save_state_dict
E               central_plan = distW.reduce_scatter("plan", local_step, global_step)
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/checkpoint/utils.py", line 177, in reduce_scatter
E               all_data = self.gather_object(local_data)
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/checkpoint/utils.py", line 108, in gather_object
E               dist.gather_object(
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
E               return func(*args, **kwargs)
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2509, in gather_object
E           Traceback (most recent call last):
E             File "/data/hf_test/transformers/examples/pytorch/text-classification/run_glue.py", line 649, in <module>
E               main()
E             File "/data/hf_test/transformers/examples/pytorch/text-classification/run_glue.py", line 557, in main
E               gather(
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
E               return func(*args, **kwargs)
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3078, in gather
E               train_result = trainer.train(resume_from_checkpoint=checkpoint)
E             File "/data/hf_test/transformers/src/transformers/trainer.py", line 1511, in train
E               return inner_training_loop(
E             File "/data/hf_test/transformers/src/transformers/trainer.py", line 1894, in _inner_training_loop
E               work = default_pg.gather(output_tensors, input_tensors, opts)
E           RuntimeError: ProcessGroupHCCL does not support gather
E               self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
E             File "/data/hf_test/transformers/src/transformers/trainer.py", line 2234, in _maybe_log_save_evaluate
E               self._save_checkpoint(model, trial, metrics=metrics)
E             File "/data/hf_test/transformers/src/transformers/trainer.py", line 2291, in _save_checkpoint
E               self.save_model(output_dir, _internal_call=True)
E             File "/data/hf_test/transformers/src/transformers/trainer.py", line 2756, in save_model
E               save_fsdp_model(self.accelerator.state.fsdp_plugin, self.accelerator, self.model, output_dir)
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py", line 72, in save_fsdp_model
E               dist_cp.save_state_dict(
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 113, in save_state_dict
E               central_plan = distW.reduce_scatter("plan", local_step, global_step)
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/checkpoint/utils.py", line 177, in reduce_scatter
E               all_data = self.gather_object(local_data)
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/checkpoint/utils.py", line 108, in gather_object
E               dist.gather_object(
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
E               return func(*args, **kwargs)
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2509, in gather_object
E               gather(
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
E               return func(*args, **kwargs)
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3078, in gather
E               work = default_pg.gather(output_tensors, input_tensors, opts)
E           RuntimeError: ProcessGroupHCCL does not support gather
E           /data/anaconda/envs/hf_test/lib/python3.8/tempfile.py:818: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp9dfsya53'>
E             _warnings.warn(warn_message, ResourceWarning)
 50%|█████     | 115/230 [00:17<00:17,  6.63it/s]
E           /data/anaconda/envs/hf_test/lib/python3.8/tempfile.py:818: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpzdu72fdr'>
E             _warnings.warn(warn_message, ResourceWarning)
E           [2023-10-31 09:38:36,223] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3079461) of binary: /data/anaconda/envs/hf_test/bin/python
E           Traceback (most recent call last):
E             File "/data/anaconda/envs/hf_test/bin/accelerate", line 8, in <module>
E               sys.exit(main())
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
E               args.func(args)
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/accelerate/commands/launch.py", line 981, in launch_command
E               multi_gpu_launcher(args)
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
E               distrib_run.run(args)
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
E               elastic_launch(
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
E               return launch_agent(self._config, self._entrypoint, list(args))
E             File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
E               raise ChildFailedError(
E           torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
E           ============================================================
E           /data/hf_test/transformers/examples/pytorch/text-classification/run_glue.py FAILED
E           ------------------------------------------------------------
E           Failures:
E           [1]:
E             time      : 2023-10-31_09:38:36
E             host      : localhost.localdomain
E             rank      : 1 (local_rank: 1)
E             exitcode  : 1 (pid: 3079463)
E             error_file: <N/A>
E             traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
E           ------------------------------------------------------------
E           Root Cause (first observed failure):
E           [0]:
E             time      : 2023-10-31_09:38:36
E             host      : localhost.localdomain
E             rank      : 0 (local_rank: 0)
E             exitcode  : 1 (pid: 3079461)
E             error_file: <N/A>
E             traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
E           ============================================================

src/transformers/testing_utils.py:1835: RuntimeError
------------------------------------------------------------------------------------------------------------- Captured stdout call -------------------------------------------------------------------------------------------------------------

Running:  accelerate launch --num_processes 2 --main_process_port 10999 --use_fsdp --fsdp_auto_wrap_policy TRANSFORMER_BASED_WRAP --fsdp_state_dict_type SHARDED_STATE_DICT --fsdp_transformer_layer_cls_to_wrap BertLayer --fsdp_sharding_strategy 1 /data/hf_test/transformers/examples/pytorch/text-classification/run_glue.py --model_name_or_path /data/hf_test/bert-base-cased --task_name mrpc --output_dir ./xxx --overwrite_output_dir --do_train --max_seq_length 128 --per_device_train_batch_size 16 --learning_rate 5e-5 --num_train_epochs 2 --lr_scheduler_type cosine --logging_steps 25 --save_strategy epoch --do_eval --evaluation_strategy epoch --report_to none
stdout: Fail to import hypothesis in common_utils, tests are not derandomized
stdout: Fail to import hypothesis in common_utils, tests are not derandomized
stdout: 10/31/2023 09:38:07 - WARNING - __main__ - Process rank: 0, device: npu:0, n_gpu: 1distributed training: True, 16-bits training: False
stdout: 10/31/2023 09:38:07 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
stdout: _n_gpu=1,
stdout: adafactor=False,
stdout: adam_beta1=0.9,
stdout: adam_beta2=0.999,
stdout: adam_epsilon=1e-08,
stdout: auto_find_batch_size=False,
stdout: bf16=False,
stdout: bf16_full_eval=False,
stdout: data_seed=None,
stdout: dataloader_drop_last=False,
stdout: dataloader_num_workers=0,
stdout: dataloader_pin_memory=True,
stdout: ddp_backend=None,
stdout: ddp_broadcast_buffers=None,
stdout: ddp_bucket_cap_mb=None,
stdout: ddp_find_unused_parameters=None,
stdout: ddp_timeout=1800,
stdout: debug=[],
stdout: deepspeed=None,
stdout: disable_tqdm=False,
stdout: dispatch_batches=None,
stdout: do_eval=True,
stdout: do_predict=False,
stdout: do_train=True,
stdout: eval_accumulation_steps=None,
stdout: eval_delay=0,
stdout: eval_steps=None,
stdout: evaluation_strategy=epoch,
stdout: fp16=False,
stdout: fp16_backend=auto,
stdout: fp16_full_eval=False,
stdout: fp16_opt_level=O1,
stdout: fsdp=[],
stdout: fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
stdout: fsdp_min_num_params=0,
stdout: fsdp_transformer_layer_cls_to_wrap=None,
stdout: full_determinism=False,
stdout: gradient_accumulation_steps=1,
stdout: gradient_checkpointing=False,
stdout: greater_is_better=None,
stdout: group_by_length=False,
stdout: half_precision_backend=auto,
stdout: hub_always_push=False,
stdout: hub_model_id=None,
stdout: hub_private_repo=False,
stdout: hub_strategy=every_save,
stdout: hub_token=<HUB_TOKEN>,
stdout: ignore_data_skip=False,
stdout: include_inputs_for_metrics=False,
stdout: include_tokens_per_second=False,
stdout: jit_mode_eval=False,
stdout: label_names=None,
stdout: label_smoothing_factor=0.0,
stdout: learning_rate=5e-05,
stdout: length_column_name=length,
stdout: load_best_model_at_end=False,
stdout: local_rank=0,
stdout: log_level=passive,
stdout: log_level_replica=warning,
stdout: log_on_each_node=True,
stdout: logging_dir=./xxx/runs/Oct31_09-37-58_localhost.localdomain,
stdout: logging_first_step=False,
stdout: logging_nan_inf_filter=True,
stdout: logging_steps=25,
stdout: logging_strategy=steps,
stdout: lr_scheduler_type=cosine,
stdout: max_grad_norm=1.0,
stdout: max_steps=-1,
stdout: metric_for_best_model=None,
stdout: mp_parameters=,
stdout: no_cuda=False,
stdout: num_train_epochs=2.0,
stdout: optim=adamw_torch,
stdout: optim_args=None,
stdout: output_dir=./xxx,
stdout: overwrite_output_dir=True,
stdout: past_index=-1,
stdout: per_device_eval_batch_size=8,
stdout: per_device_train_batch_size=16,
stdout: prediction_loss_only=False,
stdout: push_to_hub=False,
stdout: push_to_hub_model_id=None,
stdout: push_to_hub_organization=None,
stdout: push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
stdout: ray_scope=last,
stdout: remove_unused_columns=True,
stdout: report_to=[],
stdout: resume_from_checkpoint=None,
stdout: run_name=./xxx,
stdout: save_on_each_node=False,
stdout: save_safetensors=False,
stdout: save_steps=500,
stdout: save_strategy=epoch,
stdout: save_total_limit=None,
stdout: seed=42,
stdout: skip_memory_metrics=True,
stdout: tf32=None,
stdout: torch_compile=False,
stdout: torch_compile_backend=None,
stdout: torch_compile_mode=None,
stdout: torchdynamo=None,
stdout: tpu_metrics_debug=False,
stdout: tpu_num_cores=None,
stdout: use_cpu=False,
stdout: use_ipex=False,
stdout: use_legacy_prediction_loop=False,
stdout: use_mps_device=False,
stdout: warmup_ratio=0.0,
stdout: warmup_steps=0,
stdout: weight_decay=0.0,
stdout: )
stdout: 10/31/2023 09:38:07 - WARNING - datasets.load - Using the latest cached version of the module from /root/.cache/huggingface/modules/datasets_modules/datasets/glue/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad (last modified on Thu Oct 26 17:35:09 2023) since it couldn't be found locally at glue., or remotely on the Hugging Face Hub.
stdout: 10/31/2023 09:38:07 - INFO - datasets.info - Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/glue/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad
stdout: 10/31/2023 09:38:07 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists.
stdout: 10/31/2023 09:38:07 - INFO - datasets.info - Loading Dataset info from /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad
stdout: 10/31/2023 09:38:07 - INFO - datasets.builder - Found cached dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
stdout: 10/31/2023 09:38:07 - INFO - datasets.info - Loading Dataset info from /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad
stdout: 10/31/2023 09:38:07 - WARNING - __main__ - Process rank: 1, device: npu:1, n_gpu: 1distributed training: True, 16-bits training: False
stdout: 10/31/2023 09:38:07 - WARNING - datasets.load - Using the latest cached version of the module from /root/.cache/huggingface/modules/datasets_modules/datasets/glue/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad (last modified on Thu Oct 26 17:35:09 2023) since it couldn't be found locally at glue., or remotely on the Hugging Face Hub.
stdout: Warning: since the loaded file is not a zipfile, only "torch.device" and "str" type parameters are currently supported for parameter types of map_locationIf parameter types of map_location is "Callable[[torch.Tensor, str], torch.Tensor]" or "Dict[str, str]", which is only support for zipfile,all tensors are currently loaded onto the CPU, which may introduce problems
stdout: Warning: since the loaded file is not a zipfile, only "torch.device" and "str" type parameters are currently supported for parameter types of map_locationIf parameter types of map_location is "Callable[[torch.Tensor, str], torch.Tensor]" or "Dict[str, str]", which is only support for zipfile,all tensors are currently loaded onto the CPU, which may introduce problems
stdout: 10/31/2023 09:38:08 - INFO - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-f2e61c34c9899b5a.arrow
stdout: 10/31/2023 09:38:08 - INFO - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-fd9184904bb613ef.arrow
stdout: 10/31/2023 09:38:08 - INFO - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-e2ab4fdde1bba06e.arrow
stdout: 10/31/2023 09:38:10 - INFO - __main__ - Sample 2619 of the training set: {'sentence1': 'The proceedings were taken up with prosecutors outlining their case against Amrozi , reading 33 pages of documents outlining allegations against him .', 'sentence2': 'Proceedings were taken up with prosecutors outlining their case against Amrozi , reading a 33-page accusation letter to the court .', 'label': 1, 'idx': 2916, 'input_ids': [101, 1109, 10830, 1127, 1678, 1146, 1114, 24987, 1149, 13260, 1147, 1692, 1222, 7277, 2180, 5303, 117, 3455, 3081, 5097, 1104, 4961, 1149, 13260, 9966, 1222, 1140, 119, 102, 20661, 1127, 1678, 1146, 1114, 24987, 1149, 13260, 1147, 1692, 1222, 7277, 2180, 5303, 117, 3455, 170, 3081, 118, 3674, 21100, 2998, 1106, 1103, 2175, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}.
stdout: 10/31/2023 09:38:10 - INFO - __main__ - Sample 456 of the training set: {'sentence1': "Chechen officials working for the Moscow-backed government are a frequent target for rebels and tension is running high ahead of next Sunday 's presidential election in war-torn Chechnya .", 'sentence2': "Officials in Chechnya 's Moscow-backed government are a frequent target for rebels , and tension is running high ahead of Sunday 's presidential election in the war-ravaged region .", 'label': 1, 'idx': 509, 'input_ids': [101, 20394, 11252, 1424, 3878, 1684, 1111, 1103, 4116, 118, 5534, 1433, 1132, 170, 6539, 4010, 1111, 9283, 1105, 6646, 1110, 1919, 1344, 3075, 1104, 1397, 3625, 112, 188, 5200, 1728, 1107, 1594, 118, 7820, 20394, 11252, 15449, 119, 102, 9018, 1116, 1107, 20394, 11252, 15449, 112, 188, 4116, 118, 5534, 1433, 1132, 170, 6539, 4010, 1111, 9283, 117, 1105, 6646, 1110, 1919, 1344, 3075, 1104, 3625, 112, 188, 5200, 1728, 1107, 1103, 1594, 118, 187, 15677, 3660, 1805, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}.
stdout: 10/31/2023 09:38:10 - INFO - __main__ - Sample 102 of the training set: {'sentence1': "Standard & Poor 's 500 stock index futures declined 4.40 points to 983.50 , while Nasdaq futures fell 6.5 points to 1,206.50 .", 'sentence2': "The Standard & Poor 's 500 Index was up 1.75 points , or 0.18 percent , to 977.68 .", 'label': 0, 'idx': 116, 'input_ids': [101, 6433, 111, 11767, 112, 188, 2260, 4482, 7448, 2174, 1116, 5799, 125, 119, 1969, 1827, 1106, 5103, 1495, 119, 1851, 117, 1229, 11896, 1116, 1810, 4426, 2174, 1116, 2204, 127, 119, 126, 1827, 1106, 122, 117, 20278, 119, 1851, 119, 102, 1109, 6433, 111, 11767, 112, 188, 2260, 10146, 1108, 1146, 122, 119, 3453, 1827, 117, 1137, 121, 119, 1407, 3029, 117, 1106, 5311, 1559, 119, 5599, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}.
stdout: 10/31/2023 09:38:12 - WARNING - evaluate.loading - Using the latest cached version of the module from /root/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--glue/05234ba7acc44554edcca0978db5fa3bc600eeee66229abe79ff9887eacaf3ed (last modified on Fri Oct 27 14:19:02 2023) since it couldn't be found locally at evaluate-metric--glue, or remotely on the Hugging Face Hub.
stdout: 10/31/2023 09:38:12 - WARNING - accelerate.utils.other - Detected kernel version 4.19.90, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
stdout: 10/31/2023 09:38:12 - WARNING - evaluate.loading - Using the latest cached version of the module from /root/.cache/huggingface/modules/evaluate_modules/metrics/evaluate-metric--glue/05234ba7acc44554edcca0978db5fa3bc600eeee66229abe79ff9887eacaf3ed (last modified on Fri Oct 27 14:19:02 2023) since it couldn't be found locally at evaluate-metric--glue, or remotely on the Hugging Face Hub.
stdout: {'loss': 0.6273, 'learning_rate': 4.855652305297052e-05, 'epoch': 0.22}
stdout: {'loss': 0.6318, 'learning_rate': 4.43927822676105e-05, 'epoch': 0.43}
stdout: {'loss': 0.6359, 'learning_rate': 3.798959875088584e-05, 'epoch': 0.65}
stdout: {'loss': 0.6028, 'learning_rate': 3.008640032631585e-05, 'epoch': 0.87}
stdout: {'eval_loss': 0.5785336494445801, 'eval_accuracy': 0.7058823529411765, 'eval_f1': 0.8219584569732937, 'eval_combined_score': 0.7639204049572351, 'eval_runtime': 1.1857, 'eval_samples_per_second': 344.089, 'eval_steps_per_second': 21.927, 'epoch': 1.0}
stdout: Fail to import hypothesis in common_utils, tests are not derandomized
------------------------------------------------------------------------------------------------------------- Captured stderr call -------------------------------------------------------------------------------------------------------------
stderr: The following values were not passed to `accelerate launch` and had defaults used instead:
stderr: 		More than one GPU was found, enabling multi-GPU training.
stderr: 		If this was unintended please pass in `--num_processes=1`.
stderr: 	`--num_machines` was set to a value of `1`
stderr: 	`--mixed_precision` was set to a value of `'no'`
stderr: 	`--dynamo_backend` was set to a value of `'no'`
stderr: To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
stderr: Using the latest cached version of the module from /root/.cache/huggingface/modules/datasets_modules/datasets/glue/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad (last modified on Thu Oct 26 17:35:09 2023) since it couldn't be found locally at glue., or remotely on the Hugging Face Hub.
stderr: Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/glue/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad
stderr: Overwrite dataset info from restored data version if exists.
stderr: Loading Dataset info from /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad
stderr: Found cached dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
stderr: Loading Dataset info from /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad
stderr: [INFO|configuration_utils.py:714] 2023-10-31 09:38:07,303 >> loading configuration file /data/hf_test/bert-base-cased/config.json
stderr: [INFO|configuration_utils.py:776] 2023-10-31 09:38:07,314 >> Model config BertConfig {
stderr:   "_name_or_path": "/data/hf_test/bert-base-cased",
stderr:   "architectures": [
stderr:     "BertForMaskedLM"
stderr:   ],
stderr:   "attention_probs_dropout_prob": 0.1,
stderr:   "classifier_dropout": null,
stderr:   "finetuning_task": "mrpc",
stderr:   "gradient_checkpointing": false,
stderr:   "hidden_act": "gelu",
stderr:   "hidden_dropout_prob": 0.1,
stderr:   "hidden_size": 768,
stderr:   "initializer_range": 0.02,
stderr:   "intermediate_size": 3072,
stderr:   "layer_norm_eps": 1e-12,
stderr:   "max_position_embeddings": 512,
stderr:   "model_type": "bert",
stderr:   "num_attention_heads": 12,
stderr:   "num_hidden_layers": 12,
stderr:   "pad_token_id": 0,
stderr:   "position_embedding_type": "absolute",
stderr:   "transformers_version": "4.35.0.dev0",
stderr:   "type_vocab_size": 2,
stderr:   "use_cache": true,
stderr:   "vocab_size": 28996
stderr: }
stderr: 
stderr: [INFO|configuration_utils.py:714] 2023-10-31 09:38:07,314 >> loading configuration file /data/hf_test/bert-base-cased/config.json
stderr: [INFO|configuration_utils.py:776] 2023-10-31 09:38:07,316 >> Model config BertConfig {
stderr:   "_name_or_path": "/data/hf_test/bert-base-cased",
stderr:   "architectures": [
stderr:     "BertForMaskedLM"
stderr:   ],
stderr:   "attention_probs_dropout_prob": 0.1,
stderr:   "classifier_dropout": null,
stderr:   "gradient_checkpointing": false,
stderr:   "hidden_act": "gelu",
stderr:   "hidden_dropout_prob": 0.1,
stderr:   "hidden_size": 768,
stderr:   "initializer_range": 0.02,
stderr:   "intermediate_size": 3072,
stderr:   "layer_norm_eps": 1e-12,
stderr:   "max_position_embeddings": 512,
stderr:   "model_type": "bert",
stderr:   "num_attention_heads": 12,
stderr:   "num_hidden_layers": 12,
stderr:   "pad_token_id": 0,
stderr:   "position_embedding_type": "absolute",
stderr:   "transformers_version": "4.35.0.dev0",
stderr:   "type_vocab_size": 2,
stderr:   "use_cache": true,
stderr:   "vocab_size": 28996
stderr: }
stderr: 
stderr: [INFO|tokenization_utils_base.py:2019] 2023-10-31 09:38:07,316 >> loading file vocab.txt
stderr: [INFO|tokenization_utils_base.py:2019] 2023-10-31 09:38:07,316 >> loading file tokenizer.json
stderr: [INFO|tokenization_utils_base.py:2019] 2023-10-31 09:38:07,317 >> loading file added_tokens.json
stderr: [INFO|tokenization_utils_base.py:2019] 2023-10-31 09:38:07,317 >> loading file special_tokens_map.json
stderr: [INFO|tokenization_utils_base.py:2019] 2023-10-31 09:38:07,317 >> loading file tokenizer_config.json
stderr: [INFO|configuration_utils.py:714] 2023-10-31 09:38:07,317 >> loading configuration file /data/hf_test/bert-base-cased/config.json
stderr: [INFO|configuration_utils.py:776] 2023-10-31 09:38:07,318 >> Model config BertConfig {
stderr:   "_name_or_path": "/data/hf_test/bert-base-cased",
stderr:   "architectures": [
stderr:     "BertForMaskedLM"
stderr:   ],
stderr:   "attention_probs_dropout_prob": 0.1,
stderr:   "classifier_dropout": null,
stderr:   "gradient_checkpointing": false,
stderr:   "hidden_act": "gelu",
stderr:   "hidden_dropout_prob": 0.1,
stderr:   "hidden_size": 768,
stderr:   "initializer_range": 0.02,
stderr:   "intermediate_size": 3072,
stderr:   "layer_norm_eps": 1e-12,
stderr:   "max_position_embeddings": 512,
stderr:   "model_type": "bert",
stderr:   "num_attention_heads": 12,
stderr:   "num_hidden_layers": 12,
stderr:   "pad_token_id": 0,
stderr:   "position_embedding_type": "absolute",
stderr:   "transformers_version": "4.35.0.dev0",
stderr:   "type_vocab_size": 2,
stderr:   "use_cache": true,
stderr:   "vocab_size": 28996
stderr: }
stderr: 
stderr: Using the latest cached version of the module from /root/.cache/huggingface/modules/datasets_modules/datasets/glue/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad (last modified on Thu Oct 26 17:35:09 2023) since it couldn't be found locally at glue., or remotely on the Hugging Face Hub.
stderr: [INFO|modeling_utils.py:3057] 2023-10-31 09:38:07,393 >> loading weights file /data/hf_test/bert-base-cased/pytorch_model.bin
stderr: [INFO|modeling_utils.py:3838] 2023-10-31 09:38:08,324 >> Some weights of the model checkpoint at /data/hf_test/bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
stderr: - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
stderr: - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
stderr: [WARNING|modeling_utils.py:3850] 2023-10-31 09:38:08,324 >> Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /data/hf_test/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
stderr: You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
stderr: Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-f2e61c34c9899b5a.arrow
stderr: Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-fd9184904bb613ef.arrow
stderr: Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-e2ab4fdde1bba06e.arrow
stderr: [WARNING|modeling_utils.py:3850] 2023-10-31 09:38:08,625 >> Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /data/hf_test/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
stderr: You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
stderr: [INFO|trainer.py:698] 2023-10-31 09:38:12,532 >> The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: idx, sentence2, sentence1. If idx, sentence2, sentence1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
stderr: [INFO|trainer.py:1674] 2023-10-31 09:38:13,434 >> ***** Running training *****
stderr: [INFO|trainer.py:1675] 2023-10-31 09:38:13,435 >>   Num examples = 3,668
stderr: [INFO|trainer.py:1676] 2023-10-31 09:38:13,435 >>   Num Epochs = 2
stderr: [INFO|trainer.py:1677] 2023-10-31 09:38:13,435 >>   Instantaneous batch size per device = 16
stderr: [INFO|trainer.py:1680] 2023-10-31 09:38:13,435 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
stderr: [INFO|trainer.py:1681] 2023-10-31 09:38:13,435 >>   Gradient Accumulation steps = 1
stderr: [INFO|trainer.py:1682] 2023-10-31 09:38:13,435 >>   Total optimization steps = 230
stderr: [INFO|trainer.py:1683] 2023-10-31 09:38:13,436 >>   Number of trainable parameters = 54,155,905
 50%|█████     | 115/230 [00:14<00:13,  8.55it/s][INFO|trainer.py:698] 2023-10-31 09:38:27,965 >> The following columns in the evaluation set don't have a corresponding argument in `FullyShardedDataParallel.forward` and have been ignored: idx, sentence2, sentence1. If idx, sentence2, sentence1 are not expected by `FullyShardedDataParallel.forward`,  you can safely ignore this message.
stderr: [INFO|trainer.py:3093] 2023-10-31 09:38:27,969 >> ***** Running Evaluation *****
stderr: [INFO|trainer.py:3095] 2023-10-31 09:38:27,969 >>   Num examples = 408
stderr: [INFO|trainer.py:3098] 2023-10-31 09:38:27,969 >>   Batch size = 8
 50%|█████     | 115/230 [00:15<00:13,  8.55it/[INFO|trainer.py:2816] 2023-10-31 09:38:29,156 >> Saving model checkpoint to ./xxx/checkpoint-115
stderr: [INFO|configuration_utils.py:461] 2023-10-31 09:38:29,158 >> Configuration saved in ./xxx/checkpoint-115/config.json
stderr: [INFO|modeling_utils.py:2168] 2023-10-31 09:38:29,159 >> Model weights saved in ./xxx/checkpoint-115/pytorch_model.bin
stderr: [INFO|tokenization_utils_base.py:2426] 2023-10-31 09:38:29,159 >> tokenizer config file saved in ./xxx/checkpoint-115/tokenizer_config.json
stderr: [INFO|tokenization_utils_base.py:2435] 2023-10-31 09:38:29,160 >> Special tokens file saved in ./xxx/checkpoint-115/special_tokens_map.json
stderr: /data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/_shard/sharded_tensor/api.py:1121: UserWarning: Please use DTensor instead and we are deprecating ShardedTensor.
stderr:   warnings.warn(DEPRECATE_MSG)
stderr: /data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/_shard/sharded_tensor/api.py:1121: UserWarning: Please use DTensor instead and we are deprecating ShardedTensor.
stderr:   warnings.warn(DEPRECATE_MSG)
stderr: Traceback (most recent call last):
stderr:   File "/data/hf_test/transformers/examples/pytorch/text-classification/run_glue.py", line 649, in <module>
stderr:     main()
stderr:   File "/data/hf_test/transformers/examples/pytorch/text-classification/run_glue.py", line 557, in main
stderr:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
stderr:   File "/data/hf_test/transformers/src/transformers/trainer.py", line 1511, in train
stderr:     return inner_training_loop(
stderr:   File "/data/hf_test/transformers/src/transformers/trainer.py", line 1894, in _inner_training_loop
stderr:     self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
stderr:   File "/data/hf_test/transformers/src/transformers/trainer.py", line 2234, in _maybe_log_save_evaluate
stderr:     self._save_checkpoint(model, trial, metrics=metrics)
stderr:   File "/data/hf_test/transformers/src/transformers/trainer.py", line 2291, in _save_checkpoint
stderr:     self.save_model(output_dir, _internal_call=True)
stderr:   File "/data/hf_test/transformers/src/transformers/trainer.py", line 2756, in save_model
stderr:     save_fsdp_model(self.accelerator.state.fsdp_plugin, self.accelerator, self.model, output_dir)
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py", line 72, in save_fsdp_model
stderr:     dist_cp.save_state_dict(
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 113, in save_state_dict
stderr:     central_plan = distW.reduce_scatter("plan", local_step, global_step)
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/checkpoint/utils.py", line 177, in reduce_scatter
stderr:     all_data = self.gather_object(local_data)
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/checkpoint/utils.py", line 108, in gather_object
stderr:     dist.gather_object(
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
stderr:     return func(*args, **kwargs)
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2509, in gather_object
stderr: Traceback (most recent call last):
stderr:   File "/data/hf_test/transformers/examples/pytorch/text-classification/run_glue.py", line 649, in <module>
stderr:     main()
stderr:   File "/data/hf_test/transformers/examples/pytorch/text-classification/run_glue.py", line 557, in main
stderr:     gather(
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
stderr:     return func(*args, **kwargs)
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3078, in gather
stderr:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
stderr:   File "/data/hf_test/transformers/src/transformers/trainer.py", line 1511, in train
stderr:     return inner_training_loop(
stderr:   File "/data/hf_test/transformers/src/transformers/trainer.py", line 1894, in _inner_training_loop
stderr:     work = default_pg.gather(output_tensors, input_tensors, opts)
stderr: RuntimeError: ProcessGroupHCCL does not support gather
stderr:     self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
stderr:   File "/data/hf_test/transformers/src/transformers/trainer.py", line 2234, in _maybe_log_save_evaluate
stderr:     self._save_checkpoint(model, trial, metrics=metrics)
stderr:   File "/data/hf_test/transformers/src/transformers/trainer.py", line 2291, in _save_checkpoint
stderr:     self.save_model(output_dir, _internal_call=True)
stderr:   File "/data/hf_test/transformers/src/transformers/trainer.py", line 2756, in save_model
stderr:     save_fsdp_model(self.accelerator.state.fsdp_plugin, self.accelerator, self.model, output_dir)
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/accelerate/utils/fsdp_utils.py", line 72, in save_fsdp_model
stderr:     dist_cp.save_state_dict(
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 113, in save_state_dict
stderr:     central_plan = distW.reduce_scatter("plan", local_step, global_step)
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/checkpoint/utils.py", line 177, in reduce_scatter
stderr:     all_data = self.gather_object(local_data)
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/checkpoint/utils.py", line 108, in gather_object
stderr:     dist.gather_object(
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
stderr:     return func(*args, **kwargs)
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2509, in gather_object
stderr:     gather(
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
stderr:     return func(*args, **kwargs)
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3078, in gather
stderr:     work = default_pg.gather(output_tensors, input_tensors, opts)
stderr: RuntimeError: ProcessGroupHCCL does not support gather
stderr: /data/anaconda/envs/hf_test/lib/python3.8/tempfile.py:818: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp9dfsya53'>
stderr:   _warnings.warn(warn_message, ResourceWarning)
 50%|█████     | 115/230 [00:17<00:17,  6.63it/s]
stderr: /data/anaconda/envs/hf_test/lib/python3.8/tempfile.py:818: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpzdu72fdr'>
stderr:   _warnings.warn(warn_message, ResourceWarning)
stderr: [2023-10-31 09:38:36,223] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3079461) of binary: /data/anaconda/envs/hf_test/bin/python
stderr: Traceback (most recent call last):
stderr:   File "/data/anaconda/envs/hf_test/bin/accelerate", line 8, in <module>
stderr:     sys.exit(main())
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
stderr:     args.func(args)
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/accelerate/commands/launch.py", line 981, in launch_command
stderr:     multi_gpu_launcher(args)
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
stderr:     distrib_run.run(args)
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
stderr:     elastic_launch(
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
stderr:     return launch_agent(self._config, self._entrypoint, list(args))
stderr:   File "/data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
stderr:     raise ChildFailedError(
stderr: torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
stderr: ============================================================
stderr: /data/hf_test/transformers/examples/pytorch/text-classification/run_glue.py FAILED
stderr: ------------------------------------------------------------
stderr: Failures:
stderr: [1]:
stderr:   time      : 2023-10-31_09:38:36
stderr:   host      : localhost.localdomain
stderr:   rank      : 1 (local_rank: 1)
stderr:   exitcode  : 1 (pid: 3079463)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: ------------------------------------------------------------
stderr: Root Cause (first observed failure):
stderr: [0]:
stderr:   time      : 2023-10-31_09:38:36
stderr:   host      : localhost.localdomain
stderr:   rank      : 0 (local_rank: 0)
stderr:   exitcode  : 1 (pid: 3079461)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: ============================================================
=============================================================================================================== warnings summary ===============================================================================================================
../../anaconda/envs/hf_test/lib/python3.8/site-packages/_pytest/config/__init__.py:1373
  /data/anaconda/envs/hf_test/lib/python3.8/site-packages/_pytest/config/__init__.py:1373: PytestConfigWarning: Unknown config option: doctest_glob
  
    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

tests/fsdp/test_fsdp.py::TrainerIntegrationFSDP::test_fsdp_config_full_shard_fp16
tests/fsdp/test_fsdp.py::TrainerIntegrationFSDP::test_fsdp_config_shard_grad_op_fp16
  /data/anaconda/envs/hf_test/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================================================================================================== short test summary info ============================================================================================================
FAILED tests/fsdp/test_fsdp.py::TrainerIntegrationFSDP::test_training_and_can_resume_normally_SHARDED_STATE_DICT - RuntimeError: 'accelerate launch --num_processes 2 --main_process_port 10999 --use_fsdp --fsdp_auto_wrap_policy TRANSFORMER_BASED_WRAP --fsdp_state_dict_type SHARDED_STATE_DICT --fsdp_transformer_layer_cls_to_wrap BertLayer --fsdp_shar...
============================================================================================= 1 failed, 11 passed, 3 warnings in 777.77s (0:12:57) ============================================================================================

@statelesshz
Copy link
Contributor Author

FYI #27120 (comment) 😄 @ydshieh

@statelesshz
Copy link
Contributor Author

Is this PR still under review? Please inform me if any further revisions are required :-) @ydshieh and @amyeroberts

@ydshieh
Copy link
Collaborator

ydshieh commented Oct 31, 2023

We don't expect all tests 100% run without problem on other devices. My question is just to see what is the current results. It doesn't seem bad running on NPU !

LGTM but waiting @amyeroberts to give her 👍 if everything is good to her.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for another great PR!

@ydshieh ydshieh merged commit 82c7e87 into huggingface:main Nov 1, 2023
18 checks passed
@statelesshz statelesshz deleted the test_fsdp branch November 1, 2023 06:19
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 19, 2023
* make fsdp test cases device agnostic

* make style
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants