Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多卡训练时报错 #2712

Open
CashBai opened this issue Dec 23, 2024 · 1 comment
Open

多卡训练时报错 #2712

CashBai opened this issue Dec 23, 2024 · 1 comment
Assignees

Comments

@CashBai
Copy link

CashBai commented Dec 23, 2024

平台为win10
训练时的cmd里记录如下:

LAUNCH INFO 2024-12-23 13:55:59,670 -----------  Configuration  ----------------------
LAUNCH INFO 2024-12-23 13:55:59,671 auto_parallel_config: None
LAUNCH INFO 2024-12-23 13:55:59,671 auto_tuner_json: None
LAUNCH INFO 2024-12-23 13:55:59,671 devices: 0,1
LAUNCH INFO 2024-12-23 13:55:59,671 elastic_level: -1
LAUNCH INFO 2024-12-23 13:55:59,671 elastic_timeout: 30
LAUNCH INFO 2024-12-23 13:55:59,671 enable_gpu_log: True
LAUNCH INFO 2024-12-23 13:55:59,671 gloo_port: 6767
LAUNCH INFO 2024-12-23 13:55:59,671 host: None
LAUNCH INFO 2024-12-23 13:55:59,671 ips: None
LAUNCH INFO 2024-12-23 13:55:59,671 job_id: default
LAUNCH INFO 2024-12-23 13:55:59,671 legacy: False
LAUNCH INFO 2024-12-23 13:55:59,671 log_dir: D:\model\ccd2-1\distributed_train_logs
LAUNCH INFO 2024-12-23 13:55:59,671 log_level: INFO
LAUNCH INFO 2024-12-23 13:55:59,671 log_overwrite: False
LAUNCH INFO 2024-12-23 13:55:59,671 master: None
LAUNCH INFO 2024-12-23 13:55:59,671 max_restart: 3
LAUNCH INFO 2024-12-23 13:55:59,671 nnodes: 1
LAUNCH INFO 2024-12-23 13:55:59,671 nproc_per_node: None
LAUNCH INFO 2024-12-23 13:55:59,671 rank: -1
LAUNCH INFO 2024-12-23 13:55:59,671 run_mode: collective
LAUNCH INFO 2024-12-23 13:55:59,671 server_num: None
LAUNCH INFO 2024-12-23 13:55:59,671 servers:
LAUNCH INFO 2024-12-23 13:55:59,671 sort_ip: False
LAUNCH INFO 2024-12-23 13:55:59,671 start_port: 6070
LAUNCH INFO 2024-12-23 13:55:59,671 trainer_num: None
LAUNCH INFO 2024-12-23 13:55:59,671 trainers:
LAUNCH INFO 2024-12-23 13:55:59,671 training_script: tools/train.py
LAUNCH INFO 2024-12-23 13:55:59,671 training_script_args: ['--do_eval', '--config', 'C:\\Users\\user\\.paddlex\\tmp2_33b798\\segmodel_Deeplabv3_Plus-R50.yml', '--batch_size', '2', '--learning_rate', '0.001', '--iters', '88000', '--device', 'gpu', '--use_vdl', '--save_dir', 'D:\\model\\ccd2-1', '--save_interval', '1100', '--log_iters', '10']
LAUNCH INFO 2024-12-23 13:55:59,671 with_gloo: 1
LAUNCH INFO 2024-12-23 13:55:59,671 --------------------------------------------------
LAUNCH INFO 2024-12-23 13:55:59,672 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2024-12-23 13:55:59,673 Run Pod: fqfbaa, replicas 2, status ready
LAUNCH INFO 2024-12-23 13:55:59,679 Watching Pod: fqfbaa, replicas 2, status running
LAUNCH WARNING 2024-12-23 13:55:59,779 save gpu info failed
LAUNCH INFO 2024-12-23 13:56:02,684 Pod failed
LAUNCH ERROR 2024-12-23 13:56:02,684 Container failed !!!
Container rank 0 status failed cmd ['C:\\ProgramData\\anaconda3\\envs\\paddlex_det\\python.exe', '-u', 'tools/train.py', '--do_eval', '--config', 'C:\\Users\\user\\.paddlex\\tmp2_33b798\\segmodel_Deeplabv3_Plus-R50.yml', '--batch_size', '2', '--learning_rate', '0.001', '--iters', '88000', '--device', 'gpu', '--use_vdl', '--save_dir', 'D:\\model\\ccd2-1', '--save_interval', '1100', '--log_iters', '10'] code 1 log D:\model\ccd2-1\distributed_train_logs\workerlog.0
env {'ALLUSERSPROFILE': 'C:\\ProgramData', 'APPDATA': 'C:\\Users\\user\\AppData\\Roaming', 'COMMONPROGRAMFILES': 'C:\\Program Files\\Common Files', 'COMMONPROGRAMFILES(X86)': 'C:\\Program Files (x86)\\Common Files', 'COMMONPROGRAMW6432': 'C:\\Program Files\\Common Files', 'COMPUTERNAME': 'AI2', 'COMSPEC': 'C:\\Windows\\system32\\cmd.exe', 'CONDA_DEFAULT_ENV': 'paddlex_det', 'CONDA_EXE': 'C:\\ProgramData\\anaconda3\\Scripts\\conda.exe', 'CONDA_PREFIX': 'C:\\ProgramData\\anaconda3\\envs\\paddlex_det', 'CONDA_PREFIX_1': 'C:\\ProgramData\\anaconda3', 'CONDA_PROMPT_MODIFIER': '(paddlex_det) ', 'CONDA_PYTHON_EXE': 'C:\\ProgramData\\anaconda3\\python.exe', 'CONDA_SHLVL': '2', 'CUDA_PATH': 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.8', 'CUDA_PATH_V11_8': 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.8', 'DRIVERDATA': 'C:\\Windows\\System32\\Drivers\\DriverData', 'FLAGS_ENABLE_PIR_API': '0', 'FLAGS_JSON_FORMAT_MODEL': '0', 'FPS_BROWSER_APP_PROFILE_STRING': 'Internet Explorer', 'FPS_BROWSER_USER_PROFILE_STRING': 'Default', 'GENICAM_CACHE_V2_4': 'C:\\Program Files\\Cognex\\Common\\genicam\\cache', 'GENICAM_GENTL32_PATH': 'C:\\Program Files (x86)\\Common Files\\MVS\\Runtime\\Win32_i86', 'GENICAM_GENTL64_PATH': 'C:\\Program Files (x86)\\Common Files\\MVS\\Runtime\\Win64_x64', 'GENICAM_ROOT_V2_4': 'C:\\Program Files\\Cognex\\Common\\genicam', 'HOMEDRIVE': 'C:', 'HOMEPATH': '\\Users\\user', 'IGCCSVC_DB': 'AQAAANCMnd8BFdERjHoAwE/Cl+sBAAAAesNkBH7uNkqtZYc2tEuS3QQAAAACAAAAAAAQZgAAAAEAACAAAABo85rIEbFnFvjA5JNLZ0BMuiP6JFD6HB5/d5wa6rBDGAAAAAAOgAAAAAIAACAAAADDIHQo9IH+cKwwt9BzLQO+g2/PZFgmDYlb5ros7gqIW2AAAACtZzl3taFQ0VaWDnwYAIoK0OB4qQqRLygJjYoOnAkaVQAdMaba7tSy/UVM7Y+oXrxw4QY5EJiboqFLxn1hSVr7kf6eEt1KKPg/2dGzxPKSj8NxEZhkIQuhfsDr8yCKAY5AAAAA3Ep71lUUHDlozpmxFD+49X0eDX4eXF35ADoX93nccTpy3hWWXVZEredANb55n3iVV9SH+DuEem6+JJUbA43png==', 'KMP_DUPLICATE_LIB_OK': 'True', 'KMP_INIT_AT_FORK': 'FALSE', 'LOCALAPPDATA': 'C:\\Users\\user\\AppData\\Local', 'LOGONSERVER': '\\\\AI2', 'MVCAM_COMMON_RUNENV': 'C:\\Program Files (x86)\\MVS\\Development', 'MVCAM_GENICAM_CLPROTOCOL': 'C:\\Program Files (x86)\\Common Files\\MVS\\Runtime\\CLProtocol', 'MVCAM_GIGE_DEBUG_HEARTBEAT': '60000', 'NIEXTCCOMPILERSUPP': 'C:\\Program Files (x86)\\National Instruments\\Shared\\ExternalCompilerSupport\\C\\', 'NI_MO_INSTALL_PATH': 'C:\\Users\\Public\\Documents\\National Instruments\\model_optimizer\\', 'NUMBER_OF_PROCESSORS': '32', 'NVTOOLSEXT_PATH': 'C:\\Program Files\\NVIDIA Corporation\\NvToolsExt\\', 'OMP_NUM_THREADS': '1', 'ONEDRIVE': 'C:\\Users\\user\\OneDrive', 'OS': 'Windows_NT', 'OV_MO_INSTALL_PATH': 'C:\\Users\\Public\\Documents\\National Instruments\\intel_model_optimizer\\', 'OV_NI_PLUGIN_DIR': 'C:\\Program Files\\National Instruments\\Shared\\OpenVINO\\', 'PADDLE_PDX_PADDLECLAS_PATH': 'C:\\Paddle\\PaddleX-release-3.0-beta2\\paddlex\\repo_manager\\repos\\PaddleClas', 'PADDLE_PDX_PADDLEDETECTION_PATH': 'C:\\Paddle\\PaddleX-release-3.0-beta2\\paddlex\\repo_manager\\repos\\PaddleDetection', 'PADDLE_PDX_PADDLENLP_PATH': 'C:\\Paddle\\PaddleX-release-3.0-beta2\\paddlex\\repo_manager\\repos\\PaddleNLP', 'PADDLE_PDX_PADDLEOCR_PATH': 'C:\\Paddle\\PaddleX-release-3.0-beta2\\paddlex\\repo_manager\\repos\\PaddleOCR', 'PADDLE_PDX_PADDLESEG_PATH': 'C:\\Paddle\\PaddleX-release-3.0-beta2\\paddlex\\repo_manager\\repos\\PaddleSeg', 'PADDLE_PDX_PADDLETS_PATH': 'C:\\Paddle\\PaddleX-release-3.0-beta2\\paddlex\\repo_manager\\repos\\PaddleTS', 'PATH': 'C:\\ProgramData\\anaconda3\\envs\\paddlex_det\\Lib\\site-packages\\cv2\\../../x64/vc14/bin;C:\\ProgramData\\anaconda3\\envs\\paddlex_det\\lib\\site-packages\\paddle\\base;C:\\ProgramData\\anaconda3\\envs\\paddlex_det\\lib\\site-packages\\paddle\\base\\..\\libs;C:\\ProgramData\\anaconda3\\envs\\paddlex_det\\lib\\site-packages\\paddle\\base;C:\\ProgramData\\anaconda3\\envs\\paddlex_det\\lib\\site-packages\\paddle\\base\\..\\libs;C:\\ProgramData\\anaconda3\\envs\\paddlex_det\\Lib\\site-packages\\cv2\\../../x64/vc14/bin;C:\\ProgramData\\anaconda3\\envs\\paddlex_det;C:\\ProgramData\\anaconda3\\envs\\paddlex_det\\Library\\mingw-w64\\bin;C:\\ProgramData\\anaconda3\\envs\\paddlex_det\\Library\\usr\\bin;C:\\ProgramData\\anaconda3\\envs\\paddlex_det\\Library\\bin;C:\\ProgramData\\anaconda3\\envs\\paddlex_det\\Scripts;C:\\ProgramData\\anaconda3\\envs\\paddlex_det\\bin;C:\\ProgramData\\anaconda3\\condabin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.8\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.8\\libnvvp;.;C:\\Program Files\\National Instruments\\Shared\\OpenVINO;C:\\Program Files (x86)\\Common Files\\MVS\\Runtime\\Win32_i86;C:\\Program Files (x86)\\Common Files\\MVS\\Runtime\\Win64_x64;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0;C:\\Windows\\System32\\OpenSSH;C:\\Program Files\\dotnet;C:\\Program Files (x86)\\National Instruments\\Shared\\LabVIEW CLI;C:\\Program Files\\Cognex\\VisionPro\\bin;C:\\Program Files\\Common Files\\Pleora\\eBUS SDK;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2022.3.0;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.8\\include;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.8\\TensorRT-8.5.1.7\\lib;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn;C:\\Program Files\\Microsoft SQL Server\\Client SDK\\ODBC\\170\\Tools\\Binn;C:\\Program Files (x86)\\Windows Kits\\10\\Windows Performance Toolkit;C:\\Program Files\\NVIDIA Corporation\\NVIDIA NvDLISR;C:\\Users\\user\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\user\\.dotnet\\tools', 'PATHEXT': '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC', 'PROCESSOR_ARCHITECTURE': 'AMD64', 'PROCESSOR_IDENTIFIER': 'Intel64 Family 6 Model 183 Stepping 1, GenuineIntel', 'PROCESSOR_LEVEL': '6', 'PROCESSOR_REVISION': 'b701', 'PROGRAMDATA': 'C:\\ProgramData', 'PROGRAMFILES': 'C:\\Program Files', 'PROGRAMFILES(X86)': 'C:\\Program Files (x86)', 'PROGRAMW6432': 'C:\\Program Files', 'PROMPT': '(paddlex_det) $P$G', 'PSMODULEPATH': 'C:\\Program Files\\WindowsPowerShell\\Modules;C:\\Windows\\system32\\WindowsPowerShell\\v1.0\\Modules', 'PUBLIC': 'C:\\Users\\Public', 'SESSIONNAME': 'Console', 'SSL_CERT_DIR': 'C:\\ProgramData\\anaconda3\\envs\\paddlex_det\\Library\\ssl\\certs', 'SSL_CERT_FILE': 'C:\\ProgramData\\anaconda3\\Library\\ssl\\cacert.pem', 'SYSTEMDRIVE': 'C:', 'SYSTEMROOT': 'C:\\Windows', 'TEMP': 'C:\\Users\\user\\AppData\\Local\\Temp', 'TMP': 'C:\\Users\\user\\AppData\\Local\\Temp', 'USERDOMAIN': 'AI2', 'USERDOMAIN_ROAMINGPROFILE': 'AI2', 'USERNAME': 'user', 'USERPROFILE': 'C:\\Users\\user', 'VPRO32_ROOT': 'C:\\Program Files (x86)\\Cognex\\VisionPro', 'VPRO_ROOT': 'C:\\Program Files\\Cognex\\VisionPro', 'WINDIR': 'C:\\Windows', 'ZES_ENABLE_SYSMAN': '1', '__CONDA_OPENSLL_CERT_FILE_SET': '"1"', '__CONDA_OPENSSL_CERT_DIR_SET': '"1"', 'CUSTOM_DEVICE_ROOT': '', 'POD_NAME': 'fqfbaa', 'PADDLE_MASTER': '127.0.0.1:53817', 'PADDLE_GLOBAL_SIZE': '2', 'PADDLE_LOCAL_SIZE': '2', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_NNODES': '1', 'PADDLE_CURRENT_ENDPOINT': '127.0.0.1:53818', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '2', 'PADDLE_RANK_IN_NODE': '0', 'PADDLE_TRAINER_ENDPOINTS': '127.0.0.1:53818,127.0.0.1:53819', 'FLAGS_selected_gpus': '0', 'PADDLE_LOG_DIR': 'D:\\model\\ccd2-1\\distributed_train_logs'}
LAUNCH INFO 2024-12-23 13:56:02,684 ------------------------- ERROR LOG DETAIL -------------------------
LAUNCH INFO 2024-12-23 13:56:02,684 Exit code 1
[2024/12/23 13:56:01] INFO:
------------Environment Information-------------
platform: Windows-10-10.0.19045-SP0
Python: 3.9.20 | packaged by conda-forge | (main, Sep 30 2024, 17:43:23) [MSC v.1929 64 bit (AMD64)]
Paddle compiled with cuda: True
NVCC: Build cuda_11.8.r11.8/compiler.31833905_0
cudnn: 8.9
GPUs used: 2
CUDA_VISIBLE_DEVICES: None
GPU: ['GPU 0: NVIDIA GeForce', 'GPU 1: NVIDIA GeForce']
PaddleSeg: 0.0.0.dev0
PaddlePaddle: 3.0.0-beta1
OpenCV: 4.5.5
------------------------------------------------
[2024/12/23 13:56:01] INFO:
---------------Config Information---------------
batch_size: 2
iters: 88000
train_dataset:
  dataset_root: D:\VB\CCD2
  mode: train
  num_classes: 15
  train_path: D:\VB\CCD2\train.txt
  transforms:
  - max_scale_factor: 1
    min_scale_factor: 1
    scale_step_size: 0.25
    type: ResizeStepScaling
  - crop_size:
    - 1600
    - 400
    type: RandomPaddingCrop
  - type: RandomHorizontalFlip
  - type: Normalize
  type: SegDataset
val_dataset:
  dataset_root: D:\VB\CCD2
  mode: val
  num_classes: 15
  transforms:
  - type: Normalize
  type: SegDataset
  val_path: D:\VB\CCD2\val.txt
optimizer:
  momentum: 0.9
  type: SGD
  weight_decay: 4.0e-05
lr_scheduler:
  end_lr: 0
  learning_rate: 0.001
  power: 0.9
  type: PolynomialDecay
loss:
  coef:
  - 1
  types:
  - type: CrossEntropyLoss
model:
  align_corners: false
  aspp_out_channels: 256
  aspp_ratios:
  - 1
  - 12
  - 24
  - 36
  backbone:
    multi_grid:
    - 1
    - 2
    - 4
    output_stride: 8
    pretrained: https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/Deeplabv3_Plus-R50_backbone_imagenet_pretrained.pdparams
    type: ResNet50_vd
  backbone_indices:
  - 0
  - 3
  num_classes: 15
  pretrained: null
  type: DeepLabV3P
pdx_model_name: Deeplabv3_Plus-R50
uniform_output_enabled: true
------------------------------------------------

[2024/12/23 13:56:01] INFO: Set device: gpu
[2024/12/23 13:56:01] INFO: Use the following config to build model
model:
  align_corners: false
  aspp_out_channels: 256
  aspp_ratios:
  - 1
  - 12
  - 24
  - 36
  backbone:
    multi_grid:
    - 1
    - 2
    - 4
    output_stride: 8
    pretrained: https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/Deeplabv3_Plus-R50_backbone_imagenet_pretrained.pdparams
    type: ResNet50_vd
  backbone_indices:
  - 0
  - 3
  num_classes: 15
  pretrained: null
  type: DeepLabV3P
W1223 13:56:01.486145  5624 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.7, Runtime API Version: 11.8
W1223 13:56:01.486145  5624 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
[2024/12/23 13:56:01] INFO: Loading pretrained model from https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/Deeplabv3_Plus-R50_backbone_imagenet_pretrained.pdparams
[2024/12/23 13:56:02] INFO: There are 275/275 variables loaded into ResNet_vd.
[2024/12/23 13:56:02] INFO: Convert bn to sync_bn
[2024/12/23 13:56:02] INFO: Use the following config to build train_dataset
train_dataset:
  dataset_root: D:\VB\CCD2
  mode: train
  num_classes: 15
  train_path: D:\VB\CCD2\train.txt
  transforms:
  - max_scale_factor: 1
    min_scale_factor: 1
    scale_step_size: 0.25
    type: ResizeStepScaling
  - crop_size:
    - 1600
    - 400
    type: RandomPaddingCrop
  - type: RandomHorizontalFlip
  - type: Normalize
  type: SegDataset
[2024/12/23 13:56:02] INFO: Use the following config to build val_dataset
val_dataset:
  dataset_root: D:\VB\CCD2
  mode: val
  num_classes: 15
  transforms:
  - type: Normalize
  type: SegDataset
  val_path: D:\VB\CCD2\val.txt
[2024/12/23 13:56:02] INFO: If the type is SGD and momentum in optimizer config, the type is changed to Momentum.
[2024/12/23 13:56:02] INFO: Use the following config to build optimizer
optimizer:
  momentum: 0.9
  type: Momentum
  weight_decay: 4.0e-05
[2024/12/23 13:56:02] INFO: Use the following config to build loss
loss:
  coef:
  - 1
  types:
  - type: CrossEntropyLoss
[2024-12-23 13:56:02,304] [    INFO] distributed_strategy.py:214 - distributed strategy initialized
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
FLAGS(name='FLAGS_win_cuda_bin_dir', current_value='C:\\ProgramData\\anaconda3\\envs\\paddlex_det\\lib\\site-packages\\paddle\\..\\nvidia', default_value='')
=======================================================================
I1223 13:56:02.305222  5624 tcp_utils.cc:181] The server starts to listen on IP_ANY:53817
I1223 13:56:02.305222  5624 tcp_utils.cc:130] Successfully connected to 127.0.0.1:53817
Traceback (most recent call last):
  File "C:\Paddle\PaddleX-release-3.0-beta2\paddlex\repo_manager\repos\PaddleSeg\tools\train.py", line 252, in <module>
    main(args)
  File "C:\Paddle\PaddleX-release-3.0-beta2\paddlex\repo_manager\repos\PaddleSeg\tools\train.py", line 222, in main
    train(model,
  File "C:\ProgramData\anaconda3\envs\paddlex_det\lib\site-packages\paddleseg\core\train.py", line 155, in train
    paddle.distributed.fleet.init(is_collective=True)
  File "C:\ProgramData\anaconda3\envs\paddlex_det\lib\site-packages\paddle\distributed\fleet\fleet.py", line 283, in init
    paddle.distributed.init_parallel_env()
  File "C:\ProgramData\anaconda3\envs\paddlex_det\lib\site-packages\paddle\distributed\parallel.py", line 1103, in init_parallel_env
    pg = _new_process_group_impl(
  File "C:\ProgramData\anaconda3\envs\paddlex_det\lib\site-packages\paddle\distributed\collective.py", line 158, in _new_process_group_impl
    pg = core.ProcessGroupNCCL.create(
AttributeError: module 'paddle.base.libpaddle' has no attribute 'ProcessGroupNCCL'
labv3_Plus-R50_backbone_imagenet_pretrained.pdparams
[2024/12/23 13:56:02] INFO: There are 275/275 variables loaded into ResNet_vd.
[2024/12/23 13:56:02] INFO: Convert bn to sync_bn
[2024/12/23 13:56:02] INFO: Use the following config to build train_dataset
train_dataset:
  dataset_root: D:\VB\CCD2
  mode: train
  num_classes: 15
  train_path: D:\VB\CCD2\train.txt
  transforms:
  - max_scale_factor: 1
    min_scale_factor: 1
    scale_step_size: 0.25
    type: ResizeStepScaling
  - crop_size:
    - 1600
    - 400
    type: RandomPaddingCrop
  - type: RandomHorizontalFlip
  - type: Normalize
  type: SegDataset
[2024/12/23 13:56:02] INFO: Use the following config to build val_dataset
val_dataset:
  dataset_root: D:\VB\CCD2
  mode: val
  num_classes: 15
  transforms:
  - type: Normalize
  type: SegDataset
  val_path: D:\VB\CCD2\val.txt
[2024/12/23 13:56:02] INFO: If the type is SGD and momentum in optimizer config, the type is changed to Momentum.
[2024/12/23 13:56:02] INFO: Use the following config to build optimizer
optimizer:
  momentum: 0.9
  type: Momentum
  weight_decay: 4.0e-05
[2024/12/23 13:56:02] INFO: Use the following config to build loss
loss:
  coef:
  - 1
  types:
  - type: CrossEntropyLoss
[2024-12-23 13:56:02,304] [    INFO] distributed_strategy.py:214 - distributed strategy initialized
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
FLAGS(name='FLAGS_win_cuda_bin_dir', current_value='C:\\ProgramData\\anaconda3\\envs\\paddlex_det\\lib\\site-packages\\paddle\\..\\nvidia', default_value='')
=======================================================================
I1223 13:56:02.305222  5624 tcp_utils.cc:181] The server starts to listen on IP_ANY:53817
I1223 13:56:02.305222  5624 tcp_utils.cc:130] Successfully connected to 127.0.0.1:53817
Traceback (most recent call last):
  File "C:\Paddle\PaddleX-release-3.0-beta2\paddlex\repo_manager\repos\PaddleSeg\tools\train.py", line 252, in <module>
    main(args)
  File "C:\Paddle\PaddleX-release-3.0-beta2\paddlex\repo_manager\repos\PaddleSeg\tools\train.py", line 222, in main
    train(model,
  File "C:\ProgramData\anaconda3\envs\paddlex_det\lib\site-packages\paddleseg\core\train.py", line 155, in train
    paddle.distributed.fleet.init(is_collective=True)
  File "C:\ProgramData\anaconda3\envs\paddlex_det\lib\site-packages\paddle\distributed\fleet\fleet.py", line 283, in init
    paddle.distributed.init_parallel_env()
  File "C:\ProgramData\anaconda3\envs\paddlex_det\lib\site-packages\paddle\distributed\parallel.py", line 1103, in init_parallel_env
    pg = _new_process_group_impl(
  File "C:\ProgramData\anaconda3\envs\paddlex_det\lib\site-packages\paddle\distributed\collective.py", line 158, in _new_process_group_impl
    pg = core.ProcessGroupNCCL.create(
AttributeError: module 'paddle.base.libpaddle' has no attribute 'ProcessGroupNCCL'
Traceback (most recent call last):
  File "C:\Paddle\PaddleX-release-3.0-beta2\paddlex\utils\result_saver.py", line 29, in wrap
    result = func(self, *args, **kwargs)
  File "C:\Paddle\PaddleX-release-3.0-beta2\paddlex\engine.py", line 41, in run
    self._model.train()
  File "C:\Paddle\PaddleX-release-3.0-beta2\paddlex\model.py", line 94, in train
    trainer.train()
  File "C:\Paddle\PaddleX-release-3.0-beta2\paddlex\modules\base\trainer.py", line 71, in train
    train_result = self.pdx_model.train(**train_args)
  File "C:\Paddle\PaddleX-release-3.0-beta2\paddlex\repo_apis\PaddleSeg_api\seg\model.py", line 178, in train
    return self.runner.train(
  File "C:\Paddle\PaddleX-release-3.0-beta2\paddlex\repo_apis\PaddleSeg_api\seg\runner.py", line 55, in train
    return self.run_cmd(
  File "C:\Paddle\PaddleX-release-3.0-beta2\paddlex\repo_apis\base\runner.py", line 355, in run_cmd
    raise CalledProcessError(
paddlex.utils.errors.others.CalledProcessError: Command ['C:\\ProgramData\\anaconda3\\envs\\paddlex_det\\python.exe', '-m', 'paddle.distributed.launch', '--devices', '0,1', '--log_dir', 'D:\\model\\ccd2-1\\distributed_train_logs', 'tools/train.py', '--do_eval', '--config', 'C:\\Users\\user\\.paddlex\\tmp2_33b798\\segmodel_Deeplabv3_Plus-R50.yml', '--batch_size', '2', '--learning_rate', '0.001', '--iters', '88000', '--device', 'gpu', '--use_vdl', '--save_dir', 'D:\\model\\ccd2-1', '--save_interval', '1100', '--log_iters', '10'] returned non-zero exit status 1.

训练配置文件如下:
b34f3e84048361fed6cc6c65ff63867

@CashBai
Copy link
Author

CashBai commented Dec 23, 2024

补充一下,用的是conda安装paddlepaddle
conda install paddlepaddle-gpu==3.0.0b2 paddlepaddle-cuda=11.8 -c paddle -c nvidia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants