onlinedpo error when use deepspeed zero3 #7
Labels
👖 action-adventure
✨ enhancement
New feature or request
🏋 Iterative SFT
Related to Iterative SFT
🎯 optimal import sentence
❓ question
Seeking clarification or more information
🏋 RLOO
Related to RLOO
System Info
`
transformers 4.47.0
triton 3.0.0
trl 0.12.1
trove-classifiers 2024.10.21.16
truststore 0.8.0
typer 0.14.0
types-dataclasses 0.6.6
typing_extensions 4.12.2
typing-inspect 0.9.0
tzdata 2024.2
tzlocal 5.2
ujson 5.10.0
urllib3 2.2.2
utils 1.0.2
uvicorn 0.32.1
uvloop 0.21.0
virtualenv 20.28.0
vllm 0.6.3
vllm-flash-attn 2.6.1
trl env
`Copy-paste the following information when reporting an issue:
Platform: Linux-5.4.143-2-velinux1-amd64-x86_64-with-glibc2.35
Python version: 3.11.9
PyTorch version: 2.4.0
CUDA device(s): NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB
Transformers version: 4.47.0
Accelerate version: 1.1.1
Accelerate config: not found
Datasets version: 3.1.0
HF Hub version: 0.26.3
TRL version: 0.12.1
bitsandbytes version: 0.45.0
DeepSpeed version: 0.16.1
Diffusers version: not installed
Liger-Kernel version: not installed
LLM-Blender version: 0.0.2
OpenAI version: 1.57.0
PEFT version: 0.13.2`
Information
Tasks
examples
folderReproduction
class UnifiedDPODataset(Dataset):
"""
统一的DPO数据集
"""
def init(self, file, tokenizer, max_seq_length, max_prompt_length, template,
maximum_es_score,minimum_es_score,bool_training:bool):
self.tokenizer = tokenizer
self.template_name = template.template_name
#==None
self.system_format = template.system_format
self.user_format = template.user_format
self.assistant_format = template.assistant_format
self.system = template.system
class UnifiedOnlineDPODataset(UnifiedDPODataset):
def init(self, file, tokenizer, max_seq_length,template,
maximum_es_score,minimum_es_score,bool_training:bool):
max_prompt_length=max_seq_length
super(UnifiedOnlineDPODataset, self).init(file=file, tokenizer=tokenizer, max_seq_length=max_seq_length,
max_prompt_length=max_prompt_length, template=template,maximum_es_score=maximum_es_score,minimum_es_score=minimum_es_score,
bool_training=bool_training)
def getitem(self, index):
data = self.data_list[index]
# build prompt
#构建 system, history 部分
# 判断第0个是否为system
# chosen = data['chosen']
# if chosen[0]['role'] == 'system':
# system = chosen[0]['content'].strip()
# history = chosen[1:-1] # 对话上文
# chosen = chosen[-1]
# else:
# # user/assist ,单轮 history为空
# system = None
# history = chosen[:-1] # 对话上文
# ##chosen/rejected 最后一轮,assist的回复
# chosen = chosen[-1]
# prompt_input_ids,prompt = self.build_prompt_input_ids(system, history)
Expected behavior
| [rank4]: Traceback (most recent call last): |
| app | task-multitask-rl-dev-26d7c8b2
| container | main
| filename | /var/log/pods/sc-ep_task-multitask-rl-dev-26d7c8b2-master-0_0dfb4ee9-631c-4e77-8070-46d72626f885/main/0.log
| | 2024-12-30 10:53:44.559 | [rank4]: Traceback (most recent call last): |
| | 2024-12-30 10:53:44.559 | [rank4]: File "/lpai-running/code/firefly-zyy-dev/339ecc/shells/../train_onlinedpo.py", line 251, in |
| | 2024-12-30 10:53:44.559 | [rank4]: main() |
| | 2024-12-30 10:53:44.559 | [rank4]: File "/lpai-running/code/firefly-zyy-dev/339ecc/shells/../train_onlinedpo.py", line 195, in main |
| | 2024-12-30 10:53:44.559 | [rank4]: train_result=trainer.train() |
| | 2024-12-30 10:53:44.559 | [rank4]: ^^^^^^^^^^^^^^^ |
| | 2024-12-30 10:53:44.559 | [rank4]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2164, in train |
| | 2024-12-30 10:53:44.559 | [rank4]: return inner_training_loop( |
| | 2024-12-30 10:53:44.559 | [rank4]: ^^^^^^^^^^^^^^^^^^^^ |
| | 2024-12-30 10:53:44.559 | [rank4]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2522, in _inner_training_loop |
| | 2024-12-30 10:53:44.559 | [rank4]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch) |
| | 2024-12-30 10:53:44.559 | [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| | 2024-12-30 10:53:44.559 | [rank4]: File "/lpai-running/code/firefly-zyy-dev/339ecc/models/online_dpo_trainer.py", line 480, in training_step |
| | 2024-12-30 10:53:44.559 | [rank4]: output = unwrapped_model.generate( |
| | 2024-12-30 10:53:44.559 | [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^ |
| | 2024-12-30 10:53:44.559 | [rank4]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context |
| | 2024-12-30 10:53:44.559 | [rank4]: return func(*args, **kwargs) |
| | 2024-12-30 10:53:44.559 | [rank4]: ^^^^^^^^^^^^^^^^^^^^^ |
| | 2024-12-30 10:53:44.559 | [rank4]: File "/opt/conda/lib/python3.11/site-packages/transformers/generation/utils.py", line 2252, in generate |
| | 2024-12-30 10:53:44.559 | [rank4]: result = self._sample( |
| | 2024-12-30 10:53:44.559 | [rank4]: ^^^^^^^^^^^^^ |
| | 2024-12-30 10:53:44.559 | [rank4]: File "/opt/conda/lib/python3.11/site-packages/transformers/generation/utils.py", line 3254, in _sample |
| | 2024-12-30 10:53:44.559 | [rank4]: outputs = model_forward(**model_inputs, return_dict=True) |
| | 2024-12-30 10:53:44.559 | [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| | 2024-12-30 10:53:44.559 | [rank4]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| | 2024-12-30 10:53:44.559 | [rank4]: return self._call_impl(*args, **kwargs) |
| | 2024-12-30 10:53:44.559 | [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| | 2024-12-30 10:53:44.559 | [rank4]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl |
| | 2024-12-30 10:53:44.559 | [rank4]: result = forward_call(*args, **kwargs) |
| | 2024-12-30 10:53:44.559 | [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| | 2024-12-30 10:53:44.559 | [rank4]: File "/opt/conda/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1163, in forward |
| | 2024-12-30 10:53:44.559 | [rank4]: outputs = self.model( |
| | 2024-12-30 10:53:44.559 | [rank4]: ^^^^^^^^^^^ |
| | 2024-12-30 10:53:44.559 | [rank4]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl |
| | 2024-12-30 10:53:44.559 | [rank4]: return self._call_impl(*args, **kwargs) |
| | 2024-12-30 10:53:44.559 | [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| | 2024-12-30 10:53:44.559 | [rank4]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl |
| | 2024-12-30 10:53:44.559 | [rank4]: return forward_call(*args, **kwargs) |
| | 2024-12-30 10:53:44.559 | [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| | 2024-12-30 10:53:44.559 | [rank4]: File "/opt/conda/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 883, in forward |
| | 2024-12-30 10:53:44.559 | [rank4]: causal_mask = self._update_causal_mask( |
| | 2024-12-30 10:53:44.559 | [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^ |
| | 2024-12-30 10:53:44.559 | [rank4]: File "/opt/conda/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 993, in _update_causal_mask |
| | 2024-12-30 10:53:44.559 | [rank4]: causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position( |
| | 2024-12-30 10:53:44.559 | [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| | 2024-12-30 10:53:44.559 | [rank4]: File "/opt/conda/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1060, in _prepare_4d_causal_attention_mask_with_cache_position |
| | 2024-12-30 10:53:44.559 | [rank4]: causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1) |
| | 2024-12-30 10:53:44.559 | [rank4]: RuntimeError: The size of tensor a (4137) must match the size of tensor b (4138) at non-singleton dimension 0 |
| Fieldsapptask-multitask-rl-dev-26d7c8b2containermainfilename/var/log/pods/sc-ep_task-multitask-rl-dev-26d7c8b2-master-0_0dfb4ee9-631c-4e77-8070-46d72626f885/main/0.logjobsc-ep/task-multitask-rl-dev-26d7c8b2namespacesc-epnode_name10.48.7.142podtask-multitask-rl-dev-26d7c8b2-master-0streamstderr | Fields | | app | task-multitask-rl-dev-26d7c8b2 | | container | main | | filename | /var/log/pods/sc-ep_task-multitask-rl-dev-26d7c8b2-master-0_0dfb4ee9-631c-4e77-8070-46d72626f885/main/0.log | | job | sc-ep/task-multitask-rl-dev-26d7c8b2 | | namespace | sc-ep | | node_name | 10.48.7.142 | | pod | task-multitask-rl-dev-26d7c8b2-master-0 | | stream | stderr
Fields
| app | task-multitask-rl-dev-26d7c8b2
| container | main
| filename | /var/log/pods/sc-ep_task-multitask-rl-dev-26d7c8b2-master-0_0dfb4ee9-631c-4e77-8070-46d72626f885/main/0.log
| job | sc-ep/task-multitask-rl-dev-26d7c8b2
| namespace | sc-ep
| node_name | 10.48.7.142
| pod | task-multitask-rl-dev-26d7c8b2-master-0
| stream | stderr
| | 2024-12-30 10:53:44.559 | [rank4]: Exception raised from infer_size_impl at /opt/conda/conda-bld/pytorch_1720538435607/work/aten/src/ATen/ExpandUtils.cpp:31 (most recent call first): |
| | 2024-12-30 10:53:44.559 | [rank4]: C++ CapturedTraceback: |
| | 2024-12-30 10:53:44.559 | [rank4]: #4 std::_Function_handler<std::shared_ptr<c10::LazyValuestd::string const> (), c10::SetStackTraceFetcher(std::function<std::string ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#40 do_call_core from /usr/local/src/conda/python-3.11.9/Python/ceval.c:7349 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#41 _PyEval_EvalFrame from /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#42 method_vectorcall from /usr/local/src/conda/python-3.11.9/Objects/classobject.c:59 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#43 _PyVectorcall_Call from /usr/local/src/conda/python-3.11.9/Objects/call.c:257 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#44 do_call_core from /usr/local/src/conda/python-3.11.9/Python/ceval.c:7349 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#45 _PyEval_EvalFrame from /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#46 method_vectorcall from /usr/local/src/conda/python-3.11.9/Objects/classobject.c:59 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#47 _PyVectorcall_Call from /usr/local/src/conda/python-3.11.9/Objects/call.c:257 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#48 do_call_core from /usr/local/src/conda/python-3.11.9/Python/ceval.c:7349 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#49 _PyEval_EvalFrame from /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#50 method_vectorcall from /usr/local/src/conda/python-3.11.9/Objects/classobject.c:59 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#51 _PyVectorcall_Call from /usr/local/src/conda/python-3.11.9/Objects/call.c:257 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#52 do_call_core from /usr/local/src/conda/python-3.11.9/Python/ceval.c:7349 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#53 _PyEval_EvalFrame from /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#54 _PyVectorcall_Call from /usr/local/src/conda/python-3.11.9/Objects/call.c:257 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#55 do_call_core from /usr/local/src/conda/python-3.11.9/Python/ceval.c:7349 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#56 _PyEval_EvalFrame from /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#57 method_vectorcall from /usr/local/src/conda/python-3.11.9/Objects/classobject.c:59 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#58 _PyVectorcall_Call from /usr/local/src/conda/python-3.11.9/Objects/call.c:257 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#59 partial_call from /usr/local/src/conda/python-3.11.9/Modules/_functoolsmodule.c:324 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#60 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.11.9/Objects/call.c:214 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#61 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.11.9/Include/internal/pycore_call.h:92 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#62 _PyEval_EvalFrameDefault from /usr/local/src/conda/python-3.11.9/Python/ceval.c:4769 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#63 _PyEval_EvalFrame from /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#64 PyEval_EvalCode from /usr/local/src/conda/python-3.11.9/Python/ceval.c:1148 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#65 run_eval_code_obj from /usr/local/src/conda/python-3.11.9/Python/pythonrun.c:1741 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#66 run_mod from /usr/local/src/conda/python-3.11.9/Python/pythonrun.c:1762 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#67 pyrun_file from /usr/local/src/conda/python-3.11.9/Python/pythonrun.c:1657 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#68 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.11.9/Python/pythonrun.c:440 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#69 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.11.9/Python/pythonrun.c:79 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#70 pymain_run_file_obj from /usr/local/src/conda/python-3.11.9/Modules/main.c:360 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#71 Py_BytesMain from /usr/local/src/conda/python-3.11.9/Modules/main.c:734 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#72 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#73 __libc_start_main_impl from ./csu/../csu/libc-start.c:392 |
| | 2024-12-30 10:53:44.559 | [rank4]: huggingface#74 _start from ??:0 |
| | 2024-12-30 10:53:44.559 | |
Checklist
The text was updated successfully, but these errors were encountered: