-
Notifications
You must be signed in to change notification settings - Fork 973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PaddleX 实例分割模型训练自定义数据集时报内存溢出 #2634
Comments
修改yaml中的batch_size试下呢 报错信息有说: If yes, please stop them, or start PaddlePaddle on another GPU. |
epochs_iters: 从40 调整到10 和 batch_size: 从 2 调整到 1 都是一样内存不足 |
是不是GPU0上有其他任务呢?找一张空卡试一下? |
没有,查过了,第一张卡的内存达到10G左右报内存不足了 |
(base) root@iZuf6ivw3n7qat4vq7x0mdZ:~# nvidia-smi +-----------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------+ |
请问您使用的paddle版本是? |
我刚刚使用 3.0b2 的 PaddleX 和 Paddle 以及示例数据集尝试进行训练,结果可以正常训练,显存占用10G: 辛苦有空先按照 实例分割模块使用教程 里的训练流程下载示例数据集进行训练,看下是否正常。 |
我按照实例分割模块使用教程训练流程下载示例数据集进行训练整体流程都正常,就是使用自定义数据集就报这样的问题。 |
自定义数据集特点:每张图片分辨率:4024 ~ 5440 ×3036 ~ 3648 之间 ,每张图片中的实例在1~5个,自定义数据集图片总数量:47张。 |
好的,我试一下resize,另外咨询下resize后,标注实例的位置也会同等resize吗 |
你先试试修改 |
我自定义数据集的原始图片都是超大图片,没有低于1080P的图片所以resize是最好的选择。有关于修改resize的相关文档吗 |
没有相关文档。 |
可能要等待一段时间,我有结论会同步到这里 |
Checklist:
描述问题
使用PaddleX instance_segmentation 训练自定义数据集时报OOM,数据集大小:22M 总共:47张图片
服务器配置:NVIDIA Tesla V100 16G*4
服务器训练instance_seg_coco_examples数据集正常
训练自定义数据集前check_dataset正常
训练时报错,错误信息如下:
Out of memory error on GPU 0. Cannot allocate 7.392883GB memory on GPU 0, 9.610779GB memory has been allocated and available memory is only 6.154968GB.
Please check whether there is any other process using GPU 0.
复现
python main.py -c paddlex/configs/instance_segmentation/Mask-RT-DETR-L.yaml
-o Global.mode=train
-o Global.device=gpu:0,1,2,3
-o Global.dataset_dir=../dataset/express_coco_instance_seg
高性能推理
您是否完全按照高性能推理文档教程跑通了流程?
您使用的是离线激活方式还是在线激活方式?
服务化部署
您是否完全按照服务化部署文档教程跑通了流程?
您在服务化部署中是否有使用高性能推理插件,如果是,您使用的是离线激活方式还是在线激活方式?
如果是多语言调用的问题,请给出调用示例子。
端侧部署
您是否完全按照端侧部署文档教程跑通了流程?
您使用的端侧设备是?对应的PaddlePaddle版本和PaddleLite版本分别是什么?
您使用的模型和数据集是?
模型:instance_segmentation
数据集:自定义数据集
请提供您出现的报错信息及相关log
/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:686: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
warnings.warn(warning_message)
LAUNCH INFO 2024-12-12 09:55:57,548 ----------- Configuration ----------------------
LAUNCH INFO 2024-12-12 09:55:57,548 auto_cluster_config: 0
LAUNCH INFO 2024-12-12 09:55:57,548 auto_parallel_config: None
LAUNCH INFO 2024-12-12 09:55:57,548 auto_tuner_json: None
LAUNCH INFO 2024-12-12 09:55:57,548 devices: 0,1,2,3
LAUNCH INFO 2024-12-12 09:55:57,548 elastic_level: -1
LAUNCH INFO 2024-12-12 09:55:57,548 elastic_timeout: 30
LAUNCH INFO 2024-12-12 09:55:57,548 enable_gpu_log: True
LAUNCH INFO 2024-12-12 09:55:57,548 gloo_port: 6767
LAUNCH INFO 2024-12-12 09:55:57,548 host: None
LAUNCH INFO 2024-12-12 09:55:57,548 ips: None
LAUNCH INFO 2024-12-12 09:55:57,548 job_id: default
LAUNCH INFO 2024-12-12 09:55:57,548 legacy: False
LAUNCH INFO 2024-12-12 09:55:57,548 log_dir: /root/PaddleX/output/distributed_train_logs
LAUNCH INFO 2024-12-12 09:55:57,548 log_level: INFO
LAUNCH INFO 2024-12-12 09:55:57,548 log_overwrite: False
LAUNCH INFO 2024-12-12 09:55:57,548 master: None
LAUNCH INFO 2024-12-12 09:55:57,548 max_restart: 3
LAUNCH INFO 2024-12-12 09:55:57,548 nnodes: 1
LAUNCH INFO 2024-12-12 09:55:57,549 nproc_per_node: None
LAUNCH INFO 2024-12-12 09:55:57,549 rank: -1
LAUNCH INFO 2024-12-12 09:55:57,549 run_mode: collective
LAUNCH INFO 2024-12-12 09:55:57,549 server_num: None
LAUNCH INFO 2024-12-12 09:55:57,549 servers:
LAUNCH INFO 2024-12-12 09:55:57,549 sort_ip: False
LAUNCH INFO 2024-12-12 09:55:57,549 start_port: 6070
LAUNCH INFO 2024-12-12 09:55:57,549 trainer_num: None
LAUNCH INFO 2024-12-12 09:55:57,549 trainers:
LAUNCH INFO 2024-12-12 09:55:57,549 training_script: tools/train.py
LAUNCH INFO 2024-12-12 09:55:57,549 training_script_args: ['--eval', '--config', '/root/.paddlex/tmpnzorflxb/instancesegmodel_Mask-RT-DETR-L.yml', '--use_vdl', 'True', '--vdl_log_dir', '/root/PaddleX/output']
LAUNCH INFO 2024-12-12 09:55:57,549 with_gloo: 1
LAUNCH INFO 2024-12-12 09:55:57,549 --------------------------------------------------
LAUNCH INFO 2024-12-12 09:55:57,549 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2024-12-12 09:55:57,552 Run Pod: vlsjqt, replicas 4, status ready
LAUNCH INFO 2024-12-12 09:55:57,651 Watching Pod: vlsjqt, replicas 4, status running
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_cudnn_dir', current_value='/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/../nvidia/cudnn/lib', default_value='')
FLAGS(name='FLAGS_nccl_dir', current_value='/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/../nvidia/nccl/lib', default_value='')
FLAGS(name='FLAGS_cusparse_dir', current_value='/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/../nvidia/cusparse/lib', default_value='')
FLAGS(name='FLAGS_cusolver_dir', current_value='/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/../nvidia/cusolver/lib', default_value='')
FLAGS(name='FLAGS_cublas_dir', current_value='/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/../nvidia/cublas/lib', default_value='')
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
FLAGS(name='FLAGS_enable_pir_api', current_value=False, default_value=True)
FLAGS(name='FLAGS_curand_dir', current_value='/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/../nvidia/curand/lib', default_value='')
FLAGS(name='FLAGS_nvidia_package_dir', current_value='/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/../nvidia', default_value='')
FLAGS(name='FLAGS_cupti_dir', current_value='/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/../nvidia/cuda_cupti/lib', default_value='')
=======================================================================
I1212 09:56:01.623795 331699 tcp_utils.cc:181] The server starts to listen on IP_ANY:50581
I1212 09:56:01.624085 331699 tcp_utils.cc:130] Successfully connected to 10.11.32.133:50581
I1212 09:56:04.711324 331699 process_group_nccl.cc:150] ProcessGroupNCCL pg_timeout_ 1800000
I1212 09:56:04.711365 331699 process_group_nccl.cc:151] ProcessGroupNCCL nccl_comm_init_option_ 0
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
[12/12 09:56:05] ppdet.data.source.coco INFO: Load [37 samples valid, 0 samples invalid] in file /root/dataset/express_coco_instance_seg/annotations/instance_train.json.
W1212 09:56:05.518914 331699 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.4, Runtime API Version: 12.3
W1212 09:56:05.520212 331699 gpu_resources.cc:164] device: 0, cuDNN Version: 9.0.
[12/12 09:56:07] ppdet.utils.checkpoint INFO: The shape [80, 256] in pretrained weight transformer.denoising_class_embed.weight is unmatched with the shape [2, 256] in model transformer.denoising_class_embed.weight. And the weight transformer.denoising_class_embed.weight will not be loaded
[12/12 09:56:07] ppdet.utils.checkpoint INFO: The shape [80] in pretrained weight transformer.score_head.bias is unmatched with the shape [2] in model transformer.score_head.bias. And the weight transformer.score_head.bias will not be loaded
[12/12 09:56:07] ppdet.utils.checkpoint INFO: The shape [256, 80] in pretrained weight transformer.score_head.weight is unmatched with the shape [256, 2] in model transformer.score_head.weight. And the weight transformer.score_head.weight will not be loaded
[12/12 09:56:07] ppdet.utils.checkpoint INFO: Finish loading model weights: /root/.cache/paddle/weights/Mask-RT-DETR-L_pretrained.pdparams
W1212 09:56:12.050602 331699 reducer.cc:733] All parameters are involved in the backward pass. It is recommended to set find_unused_parameters to False to improve performance. However, if unused parameters appear in subsequent iterative training, then an error will occur. Please make it clear that in the subsequent training, there will be no parameters that are not used in the backward pass, and then set find_unused_parameters
[12/12 09:56:12] ppdet.engine.callbacks INFO: Epoch: [0] [ 0/10] learning_rate: 0.000000 loss_class: 0.016525 loss_bbox: 1.174297 loss_giou: 3.149053 loss_mask: 0.405251 loss_dice: 4.808298 loss_class_aux: 1.177263 loss_bbox_aux: 8.399309 loss_giou_aux: 18.635376 loss_mask_aux: 26.235355 loss_dice_aux: 34.775093 loss_class_dn: 7.234075 loss_bbox_dn: 0.185098 loss_giou_dn: 0.602603 loss_mask_dn: 0.201542 loss_dice_dn: 1.712869 loss_class_aux_dn: 41.473408 loss_bbox_aux_dn: 2.603938 loss_giou_aux_dn: 6.139773 loss_mask_aux_dn: 1.318109 loss_dice_aux_dn: 13.829336 loss: 174.076584 eta: 0:06:40 batch_cost: 4.0012 data_cost: 0.1661 ips: 0.2499 images/s, max_mem_reserved: 2028 MB, max_mem_allocated: 1939 MB
[12/12 09:56:22] ppdet.utils.checkpoint INFO: Save checkpoint: /root/PaddleX/output/0
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
[12/12 09:56:22] ppdet.engine INFO: Export inference config file to /root/PaddleX/output/0/inference/inference.yml
I1212 09:56:37.835078 331699 program_interpreter.cc:242] New Executor is Running.
[12/12 09:56:38] ppdet.engine INFO: Export model and saved in /root/PaddleX/output/0/inference
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
[12/12 09:56:39] ppdet.data.source.coco INFO: Load [9 samples valid, 0 samples invalid] in file /root/dataset/express_coco_instance_seg/annotations/instance_val.json.
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
Traceback (most recent call last):
File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/tools/train.py", line 212, in
main()
File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/tools/train.py", line 208, in main
run(FLAGS, cfg)
File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/tools/train.py", line 161, in run
trainer.train(FLAGS.eval)
File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/engine/trainer.py", line 685, in train
self._eval_with_loader(self._eval_loader)
File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/engine/trainer.py", line 718, in _eval_with_loader
outs = self.model(data)
File "/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1532, in call
return self.forward(*inputs, **kwargs)
File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 76, in forward
outs.append(self.get_pred())
File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/modeling/architectures/detr.py", line 118, in get_pred
return self._forward()
File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/modeling/architectures/detr.py", line 105, in _forward
bbox, bbox_num, mask = self.post_process(
File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/modeling/post_process.py", line 574, in call
mask_pred, scores = self._mask_postprocess(masks, scores)
File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/modeling/post_process.py", line 481, in _mask_postprocess
mask_score = F.sigmoid(mask_pred)
File "/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/tensor/ops.py", line 815, in sigmoid
return _C_ops.sigmoid(x)
MemoryError:
C++ Traceback (most recent call last):
0 paddle::pybind::eager_api_sigmoid(_object*, _object*, _object*)
1 sigmoid_ad_func(paddle::Tensor const&)
2 paddle::experimental::sigmoid(paddle::Tensor const&)
3 phi::KernelImpl<void ()(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor), &(void phi::SigmoidKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor*))>::VariadicCompute(phi::DeviceContext const&, phi::DenseTensor const&, phi::DenseTensor*)
4 void phi::ActivationGPUImpl<float, phi::GPUContext, phi::funcs::CudaSigmoidFunctor >(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor*, phi::funcs::CudaSigmoidFunctor const&)
5 float* phi::DeviceContext::Alloc(phi::TensorBase*, unsigned long, bool) const
6 phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
7 paddle::memory::allocation::Allocator::Allocate(unsigned long)
8 paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
LAUNCH INFO 2024-12-12 09:56:52,708 Pod failed
LAUNCH ERROR 2024-12-12 09:56:52,709 Container failed !!!
Container rank 0 status failed cmd ['/root/miniconda3/envs/ocr/bin/python', '-u', 'tools/train.py', '--eval', '--config', '/root/.paddlex/tmpnzorflxb/instancesegmodel_Mask-RT-DETR-L.yml', '--use_vdl', 'True', '--vdl_log_dir', '/root/PaddleX/output'] code 1 log /root/PaddleX/output/distributed_train_logs/workerlog.0
LAUNCH INFO 2024-12-12 09:56:52,709 ------------------------- ERROR LOG DETAIL -------------------------
LAUNCH INFO 2024-12-12 09:56:54,512 Exit code 1
9 paddle::memory::allocation::Allocator::Allocate(unsigned long)
10 paddle::memory::allocation::Allocator::Allocate(unsigned long)
11 std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
12 common::enforce::GetCurrentTraceBackStringabi:cxx11
Error Message Summary:
ResourceExhaustedError:
Out of memory error on GPU 0. Cannot allocate 7.392883GB memory on GPU 0, 9.610779GB memory has been allocated and available memory is only 6.154968GB.
Please check whether there is any other process using GPU 0.
(at ../paddle/phi/core/memory/allocation/cuda_allocator.cc:71)
I1212 09:56:51.151332 331699 process_group_nccl.cc:155] ProcessGroupNCCL destruct
I1212 09:56:51.404326 331753 tcp_store.cc:290] receive shutdown event and so quit from MasterDaemon run loop
rch.py", line 76, in forward
outs.append(self.get_pred())
File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/modeling/architectures/detr.py", line 118, in get_pred
return self._forward()
File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/modeling/architectures/detr.py", line 105, in _forward
bbox, bbox_num, mask = self.post_process(
File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/modeling/post_process.py", line 574, in call
mask_pred, scores = self._mask_postprocess(masks, scores)
File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/modeling/post_process.py", line 481, in _mask_postprocess
mask_score = F.sigmoid(mask_pred)
File "/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/tensor/ops.py", line 815, in sigmoid
return _C_ops.sigmoid(x)
MemoryError:
C++ Traceback (most recent call last):
0 paddle::pybind::eager_api_sigmoid(_object*, _object*, _object*)
1 sigmoid_ad_func(paddle::Tensor const&)
2 paddle::experimental::sigmoid(paddle::Tensor const&)
3 phi::KernelImpl<void ()(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor), &(void phi::SigmoidKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor*))>::VariadicCompute(phi::DeviceContext const&, phi::DenseTensor const&, phi::DenseTensor*)
4 void phi::ActivationGPUImpl<float, phi::GPUContext, phi::funcs::CudaSigmoidFunctor >(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor*, phi::funcs::CudaSigmoidFunctor const&)
5 float* phi::DeviceContext::Alloc(phi::TensorBase*, unsigned long, bool) const
6 phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
7 paddle::memory::allocation::Allocator::Allocate(unsigned long)
8 paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
9 paddle::memory::allocation::Allocator::Allocate(unsigned long)
10 paddle::memory::allocation::Allocator::Allocate(unsigned long)
11 std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
12 common::enforce::GetCurrentTraceBackStringabi:cxx11
Error Message Summary:
ResourceExhaustedError:
Out of memory error on GPU 0. Cannot allocate 7.392883GB memory on GPU 0, 9.610779GB memory has been allocated and available memory is only 6.154968GB.
Please check whether there is any other process using GPU 0.
(at ../paddle/phi/core/memory/allocation/cuda_allocator.cc:71)
I1212 09:56:51.151332 331699 process_group_nccl.cc:155] ProcessGroupNCCL destruct
I1212 09:56:51.404326 331753 tcp_store.cc:290] receive shutdown event and so quit from MasterDaemon run loop
Traceback (most recent call last):
File "/root/PaddleX/paddlex/utils/result_saver.py", line 29, in wrap
result = func(self, *args, **kwargs)
File "/root/PaddleX/paddlex/engine.py", line 41, in run
self._model.train()
File "/root/PaddleX/paddlex/model.py", line 94, in train
trainer.train()
File "/root/PaddleX/paddlex/modules/base/trainer.py", line 71, in train
train_result = self.pdx_model.train(**train_args)
File "/root/PaddleX/paddlex/repo_apis/PaddleDetection_api/instance_seg/model.py", line 137, in train
return self.runner.train(
File "/root/PaddleX/paddlex/repo_apis/PaddleDetection_api/instance_seg/runner.py", line 55, in train
return self.run_cmd(
File "/root/PaddleX/paddlex/repo_apis/base/runner.py", line 355, in run_cmd
raise CalledProcessError(
paddlex.utils.errors.others.CalledProcessError: Command ['/root/miniconda3/envs/ocr/bin/python', '-m', 'paddle.distributed.launch', '--devices', '0,1,2,3', '--log_dir', '/root/PaddleX/output/distributed_train_logs', 'tools/train.py', '--eval', '--config', '/root/.paddlex/tmpnzorflxb/instancesegmodel_Mask-RT-DETR-L.yml', '--use_vdl', 'True', '--vdl_log_dir', '/root/PaddleX/output'] returned non-zero exit status 1.
环境
请提供您使用的PaddlePaddle、PaddleX版本号、Python版本号
PaddlePaddle 和 PaddleX 版本:3.0.0b2 Python版本:3.10
请提供您使用的操作系统信息,如Linux/Windows/MacOS
linux Ubuntu 22.04
请问您使用的CUDA/cuDNN的版本号是?
cuda=12.3
The text was updated successfully, but these errors were encountered: