Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PaddleX 实例分割模型训练自定义数据集时报内存溢出 #2634

Open
4 tasks
yzyMichael opened this issue Dec 12, 2024 · 19 comments
Open
4 tasks
Assignees

Comments

@yzyMichael
Copy link

yzyMichael commented Dec 12, 2024

Checklist:

描述问题

使用PaddleX instance_segmentation 训练自定义数据集时报OOM,数据集大小:22M 总共:47张图片
服务器配置:NVIDIA Tesla V100 16G*4
服务器训练instance_seg_coco_examples数据集正常
训练自定义数据集前check_dataset正常

训练时报错,错误信息如下:
Out of memory error on GPU 0. Cannot allocate 7.392883GB memory on GPU 0, 9.610779GB memory has been allocated and available memory is only 6.154968GB.

Please check whether there is any other process using GPU 0.

复现

python main.py -c paddlex/configs/instance_segmentation/Mask-RT-DETR-L.yaml
-o Global.mode=train
-o Global.device=gpu:0,1,2,3
-o Global.dataset_dir=../dataset/express_coco_instance_seg

  1. 高性能推理

  2. 服务化部署

    • 您是否完全按照服务化部署文档教程跑通了流程?

    • 您在服务化部署中是否有使用高性能推理插件,如果是,您使用的是离线激活方式还是在线激活方式?

    • 如果是多语言调用的问题,请给出调用示例子。

  3. 端侧部署

    • 您是否完全按照端侧部署文档教程跑通了流程?

    • 您使用的端侧设备是?对应的PaddlePaddle版本和PaddleLite版本分别是什么?

  4. 您使用的模型数据集是?
    模型:instance_segmentation
    数据集:自定义数据集

  5. 请提供您出现的报错信息及相关log
    /root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:686: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
    warnings.warn(warning_message)
    LAUNCH INFO 2024-12-12 09:55:57,548 ----------- Configuration ----------------------
    LAUNCH INFO 2024-12-12 09:55:57,548 auto_cluster_config: 0
    LAUNCH INFO 2024-12-12 09:55:57,548 auto_parallel_config: None
    LAUNCH INFO 2024-12-12 09:55:57,548 auto_tuner_json: None
    LAUNCH INFO 2024-12-12 09:55:57,548 devices: 0,1,2,3
    LAUNCH INFO 2024-12-12 09:55:57,548 elastic_level: -1
    LAUNCH INFO 2024-12-12 09:55:57,548 elastic_timeout: 30
    LAUNCH INFO 2024-12-12 09:55:57,548 enable_gpu_log: True
    LAUNCH INFO 2024-12-12 09:55:57,548 gloo_port: 6767
    LAUNCH INFO 2024-12-12 09:55:57,548 host: None
    LAUNCH INFO 2024-12-12 09:55:57,548 ips: None
    LAUNCH INFO 2024-12-12 09:55:57,548 job_id: default
    LAUNCH INFO 2024-12-12 09:55:57,548 legacy: False
    LAUNCH INFO 2024-12-12 09:55:57,548 log_dir: /root/PaddleX/output/distributed_train_logs
    LAUNCH INFO 2024-12-12 09:55:57,548 log_level: INFO
    LAUNCH INFO 2024-12-12 09:55:57,548 log_overwrite: False
    LAUNCH INFO 2024-12-12 09:55:57,548 master: None
    LAUNCH INFO 2024-12-12 09:55:57,548 max_restart: 3
    LAUNCH INFO 2024-12-12 09:55:57,548 nnodes: 1
    LAUNCH INFO 2024-12-12 09:55:57,549 nproc_per_node: None
    LAUNCH INFO 2024-12-12 09:55:57,549 rank: -1
    LAUNCH INFO 2024-12-12 09:55:57,549 run_mode: collective
    LAUNCH INFO 2024-12-12 09:55:57,549 server_num: None
    LAUNCH INFO 2024-12-12 09:55:57,549 servers:
    LAUNCH INFO 2024-12-12 09:55:57,549 sort_ip: False
    LAUNCH INFO 2024-12-12 09:55:57,549 start_port: 6070
    LAUNCH INFO 2024-12-12 09:55:57,549 trainer_num: None
    LAUNCH INFO 2024-12-12 09:55:57,549 trainers:
    LAUNCH INFO 2024-12-12 09:55:57,549 training_script: tools/train.py
    LAUNCH INFO 2024-12-12 09:55:57,549 training_script_args: ['--eval', '--config', '/root/.paddlex/tmpnzorflxb/instancesegmodel_Mask-RT-DETR-L.yml', '--use_vdl', 'True', '--vdl_log_dir', '/root/PaddleX/output']
    LAUNCH INFO 2024-12-12 09:55:57,549 with_gloo: 1
    LAUNCH INFO 2024-12-12 09:55:57,549 --------------------------------------------------
    LAUNCH INFO 2024-12-12 09:55:57,549 Job: default, mode collective, replicas 1[1:1], elastic False
    LAUNCH INFO 2024-12-12 09:55:57,552 Run Pod: vlsjqt, replicas 4, status ready
    LAUNCH INFO 2024-12-12 09:55:57,651 Watching Pod: vlsjqt, replicas 4, status running
    ======================= Modified FLAGS detected =======================
    FLAGS(name='FLAGS_cudnn_dir', current_value='/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/../nvidia/cudnn/lib', default_value='')
    FLAGS(name='FLAGS_nccl_dir', current_value='/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/../nvidia/nccl/lib', default_value='')
    FLAGS(name='FLAGS_cusparse_dir', current_value='/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/../nvidia/cusparse/lib', default_value='')
    FLAGS(name='FLAGS_cusolver_dir', current_value='/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/../nvidia/cusolver/lib', default_value='')
    FLAGS(name='FLAGS_cublas_dir', current_value='/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/../nvidia/cublas/lib', default_value='')
    FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
    FLAGS(name='FLAGS_enable_pir_api', current_value=False, default_value=True)
    FLAGS(name='FLAGS_curand_dir', current_value='/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/../nvidia/curand/lib', default_value='')
    FLAGS(name='FLAGS_nvidia_package_dir', current_value='/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/../nvidia', default_value='')
    FLAGS(name='FLAGS_cupti_dir', current_value='/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/../nvidia/cuda_cupti/lib', default_value='')
    =======================================================================
    I1212 09:56:01.623795 331699 tcp_utils.cc:181] The server starts to listen on IP_ANY:50581
    I1212 09:56:01.624085 331699 tcp_utils.cc:130] Successfully connected to 10.11.32.133:50581
    I1212 09:56:04.711324 331699 process_group_nccl.cc:150] ProcessGroupNCCL pg_timeout_ 1800000
    I1212 09:56:04.711365 331699 process_group_nccl.cc:151] ProcessGroupNCCL nccl_comm_init_option_ 0
    loading annotations into memory...
    Done (t=0.00s)
    creating index...
    index created!
    [12/12 09:56:05] ppdet.data.source.coco INFO: Load [37 samples valid, 0 samples invalid] in file /root/dataset/express_coco_instance_seg/annotations/instance_train.json.
    W1212 09:56:05.518914 331699 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.4, Runtime API Version: 12.3
    W1212 09:56:05.520212 331699 gpu_resources.cc:164] device: 0, cuDNN Version: 9.0.
    [12/12 09:56:07] ppdet.utils.checkpoint INFO: The shape [80, 256] in pretrained weight transformer.denoising_class_embed.weight is unmatched with the shape [2, 256] in model transformer.denoising_class_embed.weight. And the weight transformer.denoising_class_embed.weight will not be loaded
    [12/12 09:56:07] ppdet.utils.checkpoint INFO: The shape [80] in pretrained weight transformer.score_head.bias is unmatched with the shape [2] in model transformer.score_head.bias. And the weight transformer.score_head.bias will not be loaded
    [12/12 09:56:07] ppdet.utils.checkpoint INFO: The shape [256, 80] in pretrained weight transformer.score_head.weight is unmatched with the shape [256, 2] in model transformer.score_head.weight. And the weight transformer.score_head.weight will not be loaded
    [12/12 09:56:07] ppdet.utils.checkpoint INFO: Finish loading model weights: /root/.cache/paddle/weights/Mask-RT-DETR-L_pretrained.pdparams
    W1212 09:56:12.050602 331699 reducer.cc:733] All parameters are involved in the backward pass. It is recommended to set find_unused_parameters to False to improve performance. However, if unused parameters appear in subsequent iterative training, then an error will occur. Please make it clear that in the subsequent training, there will be no parameters that are not used in the backward pass, and then set find_unused_parameters
    [12/12 09:56:12] ppdet.engine.callbacks INFO: Epoch: [0] [ 0/10] learning_rate: 0.000000 loss_class: 0.016525 loss_bbox: 1.174297 loss_giou: 3.149053 loss_mask: 0.405251 loss_dice: 4.808298 loss_class_aux: 1.177263 loss_bbox_aux: 8.399309 loss_giou_aux: 18.635376 loss_mask_aux: 26.235355 loss_dice_aux: 34.775093 loss_class_dn: 7.234075 loss_bbox_dn: 0.185098 loss_giou_dn: 0.602603 loss_mask_dn: 0.201542 loss_dice_dn: 1.712869 loss_class_aux_dn: 41.473408 loss_bbox_aux_dn: 2.603938 loss_giou_aux_dn: 6.139773 loss_mask_aux_dn: 1.318109 loss_dice_aux_dn: 13.829336 loss: 174.076584 eta: 0:06:40 batch_cost: 4.0012 data_cost: 0.1661 ips: 0.2499 images/s, max_mem_reserved: 2028 MB, max_mem_allocated: 1939 MB
    [12/12 09:56:22] ppdet.utils.checkpoint INFO: Save checkpoint: /root/PaddleX/output/0
    loading annotations into memory...
    Done (t=0.00s)
    creating index...
    index created!
    [12/12 09:56:22] ppdet.engine INFO: Export inference config file to /root/PaddleX/output/0/inference/inference.yml
    I1212 09:56:37.835078 331699 program_interpreter.cc:242] New Executor is Running.
    [12/12 09:56:38] ppdet.engine INFO: Export model and saved in /root/PaddleX/output/0/inference
    loading annotations into memory...
    Done (t=0.00s)
    creating index...
    index created!
    [12/12 09:56:39] ppdet.data.source.coco INFO: Load [9 samples valid, 0 samples invalid] in file /root/dataset/express_coco_instance_seg/annotations/instance_val.json.
    loading annotations into memory...
    Done (t=0.00s)
    creating index...
    index created!
    Traceback (most recent call last):
    File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/tools/train.py", line 212, in
    main()
    File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/tools/train.py", line 208, in main
    run(FLAGS, cfg)
    File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/tools/train.py", line 161, in run
    trainer.train(FLAGS.eval)
    File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/engine/trainer.py", line 685, in train
    self._eval_with_loader(self._eval_loader)
    File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/engine/trainer.py", line 718, in _eval_with_loader
    outs = self.model(data)
    File "/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1532, in call
    return self.forward(*inputs, **kwargs)
    File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 76, in forward
    outs.append(self.get_pred())
    File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/modeling/architectures/detr.py", line 118, in get_pred
    return self._forward()
    File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/modeling/architectures/detr.py", line 105, in _forward
    bbox, bbox_num, mask = self.post_process(
    File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/modeling/post_process.py", line 574, in call
    mask_pred, scores = self._mask_postprocess(masks, scores)
    File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/modeling/post_process.py", line 481, in _mask_postprocess
    mask_score = F.sigmoid(mask_pred)
    File "/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/tensor/ops.py", line 815, in sigmoid
    return _C_ops.sigmoid(x)
    MemoryError:


C++ Traceback (most recent call last):

0 paddle::pybind::eager_api_sigmoid(_object*, _object*, _object*)
1 sigmoid_ad_func(paddle::Tensor const&)
2 paddle::experimental::sigmoid(paddle::Tensor const&)
3 phi::KernelImpl<void ()(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor), &(void phi::SigmoidKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor*))>::VariadicCompute(phi::DeviceContext const&, phi::DenseTensor const&, phi::DenseTensor*)
4 void phi::ActivationGPUImpl<float, phi::GPUContext, phi::funcs::CudaSigmoidFunctor >(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor*, phi::funcs::CudaSigmoidFunctor const&)
5 float* phi::DeviceContext::Alloc(phi::TensorBase*, unsigned long, bool) const
6 phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
7 paddle::memory::allocation::Allocator::Allocate(unsigned long)
8 paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
LAUNCH INFO 2024-12-12 09:56:52,708 Pod failed
LAUNCH ERROR 2024-12-12 09:56:52,709 Container failed !!!
Container rank 0 status failed cmd ['/root/miniconda3/envs/ocr/bin/python', '-u', 'tools/train.py', '--eval', '--config', '/root/.paddlex/tmpnzorflxb/instancesegmodel_Mask-RT-DETR-L.yml', '--use_vdl', 'True', '--vdl_log_dir', '/root/PaddleX/output'] code 1 log /root/PaddleX/output/distributed_train_logs/workerlog.0
LAUNCH INFO 2024-12-12 09:56:52,709 ------------------------- ERROR LOG DETAIL -------------------------
LAUNCH INFO 2024-12-12 09:56:54,512 Exit code 1
9 paddle::memory::allocation::Allocator::Allocate(unsigned long)
10 paddle::memory::allocation::Allocator::Allocate(unsigned long)
11 std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
12 common::enforce::GetCurrentTraceBackStringabi:cxx11


Error Message Summary:

ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 7.392883GB memory on GPU 0, 9.610779GB memory has been allocated and available memory is only 6.154968GB.

Please check whether there is any other process using GPU 0.

  1. If yes, please stop them, or start PaddlePaddle on another GPU.
  2. If no, please decrease the batch size of your model.
    (at ../paddle/phi/core/memory/allocation/cuda_allocator.cc:71)

I1212 09:56:51.151332 331699 process_group_nccl.cc:155] ProcessGroupNCCL destruct
I1212 09:56:51.404326 331753 tcp_store.cc:290] receive shutdown event and so quit from MasterDaemon run loop
rch.py", line 76, in forward
outs.append(self.get_pred())
File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/modeling/architectures/detr.py", line 118, in get_pred
return self._forward()
File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/modeling/architectures/detr.py", line 105, in _forward
bbox, bbox_num, mask = self.post_process(
File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/modeling/post_process.py", line 574, in call
mask_pred, scores = self._mask_postprocess(masks, scores)
File "/root/PaddleX/paddlex/repo_manager/repos/PaddleDetection/ppdet/modeling/post_process.py", line 481, in _mask_postprocess
mask_score = F.sigmoid(mask_pred)
File "/root/miniconda3/envs/ocr/lib/python3.10/site-packages/paddle/tensor/ops.py", line 815, in sigmoid
return _C_ops.sigmoid(x)
MemoryError:


C++ Traceback (most recent call last):

0 paddle::pybind::eager_api_sigmoid(_object*, _object*, _object*)
1 sigmoid_ad_func(paddle::Tensor const&)
2 paddle::experimental::sigmoid(paddle::Tensor const&)
3 phi::KernelImpl<void ()(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor), &(void phi::SigmoidKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor*))>::VariadicCompute(phi::DeviceContext const&, phi::DenseTensor const&, phi::DenseTensor*)
4 void phi::ActivationGPUImpl<float, phi::GPUContext, phi::funcs::CudaSigmoidFunctor >(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor*, phi::funcs::CudaSigmoidFunctor const&)
5 float* phi::DeviceContext::Alloc(phi::TensorBase*, unsigned long, bool) const
6 phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
7 paddle::memory::allocation::Allocator::Allocate(unsigned long)
8 paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
9 paddle::memory::allocation::Allocator::Allocate(unsigned long)
10 paddle::memory::allocation::Allocator::Allocate(unsigned long)
11 std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
12 common::enforce::GetCurrentTraceBackStringabi:cxx11


Error Message Summary:

ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 7.392883GB memory on GPU 0, 9.610779GB memory has been allocated and available memory is only 6.154968GB.

Please check whether there is any other process using GPU 0.

  1. If yes, please stop them, or start PaddlePaddle on another GPU.
  2. If no, please decrease the batch size of your model.
    (at ../paddle/phi/core/memory/allocation/cuda_allocator.cc:71)

I1212 09:56:51.151332 331699 process_group_nccl.cc:155] ProcessGroupNCCL destruct
I1212 09:56:51.404326 331753 tcp_store.cc:290] receive shutdown event and so quit from MasterDaemon run loop
Traceback (most recent call last):
File "/root/PaddleX/paddlex/utils/result_saver.py", line 29, in wrap
result = func(self, *args, **kwargs)
File "/root/PaddleX/paddlex/engine.py", line 41, in run
self._model.train()
File "/root/PaddleX/paddlex/model.py", line 94, in train
trainer.train()
File "/root/PaddleX/paddlex/modules/base/trainer.py", line 71, in train
train_result = self.pdx_model.train(**train_args)
File "/root/PaddleX/paddlex/repo_apis/PaddleDetection_api/instance_seg/model.py", line 137, in train
return self.runner.train(
File "/root/PaddleX/paddlex/repo_apis/PaddleDetection_api/instance_seg/runner.py", line 55, in train
return self.run_cmd(
File "/root/PaddleX/paddlex/repo_apis/base/runner.py", line 355, in run_cmd
raise CalledProcessError(
paddlex.utils.errors.others.CalledProcessError: Command ['/root/miniconda3/envs/ocr/bin/python', '-m', 'paddle.distributed.launch', '--devices', '0,1,2,3', '--log_dir', '/root/PaddleX/output/distributed_train_logs', 'tools/train.py', '--eval', '--config', '/root/.paddlex/tmpnzorflxb/instancesegmodel_Mask-RT-DETR-L.yml', '--use_vdl', 'True', '--vdl_log_dir', '/root/PaddleX/output'] returned non-zero exit status 1.

环境

  1. 请提供您使用的PaddlePaddle、PaddleX版本号、Python版本号
    PaddlePaddle 和 PaddleX 版本:3.0.0b2 Python版本:3.10

  2. 请提供您使用的操作系统信息,如Linux/Windows/MacOS
    linux Ubuntu 22.04

  3. 请问您使用的CUDA/cuDNN的版本号是?
    cuda=12.3

@188080501
Copy link

188080501 commented Dec 12, 2024

修改yaml中的batch_size试下呢

报错信息有说:
Please check whether there is any other process using GPU 0.

If yes, please stop them, or start PaddlePaddle on another GPU.
If no, please decrease the batch size of your model.

@yzyMichael
Copy link
Author

epochs_iters: 从40 调整到10 和 batch_size: 从 2 调整到 1 都是一样内存不足

@zhang-prog
Copy link
Collaborator

是不是GPU0上有其他任务呢?找一张空卡试一下?

@yzyMichael
Copy link
Author

没有,查过了,第一张卡的内存达到10G左右报内存不足了

@yzyMichael
Copy link
Author

image

(base) root@iZuf6ivw3n7qat4vq7x0mdZ:~# nvidia-smi
Thu Dec 12 13:13:49 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-SXM2-16GB Off | 00000000:00:07.0 Off | 0 |
| N/A 39C P0 66W / 300W | 9842MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla V100-SXM2-16GB Off | 00000000:00:08.0 Off | 0 |
| N/A 35C P0 68W / 300W | 5728MiB / 16384MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 Tesla V100-SXM2-16GB Off | 00000000:00:09.0 Off | 0 |
| N/A 38C P0 69W / 300W | 4998MiB / 16384MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 Tesla V100-SXM2-16GB Off | 00000000:00:0A.0 Off | 0 |
| N/A 38C P0 68W / 300W | 5872MiB / 16384MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 333417 C /root/miniconda3/envs/ocr/bin/python 9838MiB |
| 1 N/A N/A 333419 C /root/miniconda3/envs/ocr/bin/python 5724MiB |
| 2 N/A N/A 333422 C /root/miniconda3/envs/ocr/bin/python 4994MiB |
| 3 N/A N/A 333424 C /root/miniconda3/envs/ocr/bin/python 5868MiB |
+-----------------------------------------------------------------------------------------+
(base) root@iZuf6ivw3n7qat4vq7x0mdZ:~# nvidia-smi
Thu Dec 12 13:13:51 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-SXM2-16GB Off | 00000000:00:07.0 Off | 0 |
| N/A 40C P0 71W / 300W | 192MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla V100-SXM2-16GB Off | 00000000:00:08.0 Off | 0 |
| N/A 35C P0 68W / 300W | 5728MiB / 16384MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 Tesla V100-SXM2-16GB Off | 00000000:00:09.0 Off | 0 |
| N/A 38C P0 69W / 300W | 4998MiB / 16384MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 Tesla V100-SXM2-16GB Off | 00000000:00:0A.0 Off | 0 |
| N/A 38C P0 68W / 300W | 5872MiB / 16384MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 1 N/A N/A 333419 C /root/miniconda3/envs/ocr/bin/python 5724MiB |
| 2 N/A N/A 333422 C /root/miniconda3/envs/ocr/bin/python 4994MiB |
| 3 N/A N/A 333424 C /root/miniconda3/envs/ocr/bin/python 5868MiB |
+-----------------------------------------------------------------------------------------+
(base) root@iZuf6ivw3n7qat4vq7x0mdZ:~# nvidia-smi
Thu Dec 12 13:13:53 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-SXM2-16GB Off | 00000000:00:07.0 Off | 0 |
| N/A 40C P0 72W / 300W | 76MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla V100-SXM2-16GB Off | 00000000:00:08.0 Off | 0 |
| N/A 35C P0 67W / 300W | 56MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 Tesla V100-SXM2-16GB Off | 00000000:00:09.0 Off | 0 |
| N/A 38C P0 69W / 300W | 72MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 Tesla V100-SXM2-16GB Off | 00000000:00:0A.0 Off | 0 |
| N/A 38C P0 68W / 300W | 5872MiB / 16384MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

@yzyMichael
Copy link
Author

image

@yzyMichael
Copy link
Author

image
image

@cuicheng01
Copy link
Collaborator

请问您使用的paddle版本是?

@yzyMichael
Copy link
Author

image

@zhang-prog
Copy link
Collaborator

我刚刚使用 3.0b2 的 PaddleX 和 Paddle 以及示例数据集尝试进行训练,结果可以正常训练,显存占用10G:
image
image

辛苦有空先按照 实例分割模块使用教程 里的训练流程下载示例数据集进行训练,看下是否正常。

@yzyMichael
Copy link
Author

我按照实例分割模块使用教程训练流程下载示例数据集进行训练整体流程都正常,就是使用自定义数据集就报这样的问题。

@yzyMichael
Copy link
Author

yzyMichael commented Dec 16, 2024

自定义数据集特点:每张图片分辨率:4024  ~ 5440 ×3036 ~  3648 之间 ,每张图片中的实例在1~5个,自定义数据集图片总数量:47张。

@zhang-prog
Copy link
Collaborator

可能是输入图片太大了,可以resize一下自定义数据集,或者试试修改下 BatchRandomResize,将大于640的删除。
image

@yzyMichael
Copy link
Author

好的,我试一下resize,另外咨询下resize后,标注实例的位置也会同等resize吗

@zhang-prog
Copy link
Collaborator

你先试试修改 BatchRandomResize 看可不可行,因为自己 resize 数据集也要resize标注的,有点麻烦。

@yzyMichael
Copy link
Author

我自定义数据集的原始图片都是超大图片,没有低于1080P的图片所以resize是最好的选择。有关于修改resize的相关文档吗

@zhang-prog
Copy link
Collaborator

没有相关文档。
稍等我先排查下是不是 BatchRandomResize 没生效而导致将整张图片都传入训练了,因为现在显存貌似和输入图片尺寸相关了。

@zhang-prog
Copy link
Collaborator

可能要等待一段时间,我有结论会同步到这里

@zhang-prog
Copy link
Collaborator

BatchRandomResize 是正常生效的,训练过程没问题。
问题发生在后处理做mask时,要用到原始图片分辨率。因为你的图片太大,所以导致OOM。
image

另外,因为直接用超大图片去训练也是不太合理的,所以我有几个解决方法供参考:

  1. resize 数据集和标注,需要自己去解决。
  2. 换更小的数据集。
  3. 用更大显存的显卡。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants