About CUDA Version: 11.7 #45074

sam-johnson · 2022-08-11T11:03:19Z

需求描述 Feature Description

请问何时可以支持 CUDA Version 11.7 ?

替代实现 Alternatives

No response

paddle-bot · 2022-08-11T11:03:23Z

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

Tlntin · 2022-08-12T01:24:23Z

自己编译咯

zhwesky2010 · 2022-08-12T04:10:50Z

你好，目前可以支持源码编译，我们正在本地测试和修复一些问题，预计2.4就会发布，大概10月份

Tlntin · 2022-08-12T04:12:53Z

你好，目前可以支持源码编译，我们正在本地测试和修复一些问题，预计2.4就会发布，大概10月份

能否支持ubuntu 22.04?或者升级一下gcc到11？

zhwesky2010 · 2022-08-12T04:26:32Z

你好，目前可以支持源码编译，我们正在本地测试和修复一些问题，预计2.4就会发布，大概10月份

能否支持ubuntu 22.04?或者升级一下gcc到11？

是源码编译11.7出了问题吗

Tlntin · 2022-08-12T04:31:44Z

你好，目前可以支持源码编译，我们正在本地测试和修复一些问题，预计2.4就会发布，大概10月份

能否支持ubuntu 22.04?或者升级一下gcc到11？

是源码编译11.7出了问题吗

不是，ubuntu22.04没法直装官方编译好的whl包，只能通过源码去编译，比较麻烦。
并且，编译时，只能用gcc8.x，这样就要改默认gcc，会影响系统环境，所以我用的conda去安装gcc8.x。
并且，自己编译的paddle有时候会有bug，比如官方的2.3.1可以运行RocketQA项目，而自己编译的2.3.1会爆显存溢出（同一套代码）
所以，希望官方能升级一下，让paddle支持gcc 11以及ubuntu 22.04(可以直接pip安装，而不用自己编译)

zhwesky2010 · 2022-08-12T04:36:57Z

你好，目前可以支持源码编译，我们正在本地测试和修复一些问题，预计2.4就会发布，大概10月份

能否支持ubuntu 22.04?或者升级一下gcc到11？

是源码编译11.7出了问题吗

不是，ubuntu22.04没法直装官方编译好的whl包，只能通过源码去编译，比较麻烦。并且，编译时，只能用gcc8.x，这样就要改默认gcc，会影响系统环境，所以我用的conda去安装gcc8.x。并且，自己编译的paddle有时候会有bug，比如官方的2.3.1可以运行RocketQA项目，而自己编译的2.3.1会爆显存溢出（同一套代码）所以，希望官方能升级一下，让paddle支持gcc 11以及ubuntu 22.04(可以直接pip安装，而不用自己编译)

明白了，我们会评估这个Ubuntu编译安装升级的需求，感谢建议

jzhang533 · 2022-08-12T06:19:37Z

可以试试用NGC paddlepaddle container：

https://catalog.ngc.nvidia.com/orgs/nvidia/containers/paddlepaddle

这个镜像里是cuda11.7，详情请看：

https://docs.nvidia.com/deeplearning/frameworks/paddle-paddle-release-notes/rel_22-07.html#rel_22-07

这里有一份中文的安装文档：
https://www.paddlepaddle.org.cn/documentation/docs/zh/install/instalL_NGC_PaddlePaddle_ch.html

jzhang533 · 2022-08-12T06:22:51Z

你好，目前可以支持源码编译，我们正在本地测试和修复一些问题，预计2.4就会发布，大概10月份

能否支持ubuntu 22.04?或者升级一下gcc到11？

是源码编译11.7出了问题吗

不是，ubuntu22.04没法直装官方编译好的whl包，只能通过源码去编译，比较麻烦。并且，编译时，只能用gcc8.x，这样就要改默认gcc，会影响系统环境，所以我用的conda去安装gcc8.x。并且，自己编译的paddle有时候会有bug，比如官方的2.3.1可以运行RocketQA项目，而自己编译的2.3.1会爆显存溢出（同一套代码）所以，希望官方能升级一下，让paddle支持gcc 11以及ubuntu 22.04(可以直接pip安装，而不用自己编译)

ubuntu 22.04上，现在应该还不能用paddle。
因为存在跟glibc 2.34及以上版本的不兼容问题，应该在2.4版本才会解决。

Tlntin · 2022-08-16T01:46:51Z

可以试试用NGC paddlepaddle container：

https://catalog.ngc.nvidia.com/orgs/nvidia/containers/paddlepaddle

这个镜像里是cuda11.7，详情请看：

https://docs.nvidia.com/deeplearning/frameworks/paddle-paddle-release-notes/rel_22-07.html#rel_22-07

这里有一份中文的安装文档： https://www.paddlepaddle.org.cn/documentation/docs/zh/install/instalL_NGC_PaddlePaddle_ch.html

实测不行，运行paddle.utils.run_check()会报错，还是要老老实实用官方容器。

zlsh80826 · 2022-08-16T02:33:10Z

@Tlntin 可以分享你的 docker 運行 command 跟錯誤訊息嗎? docker command 是否有添加 --shm-size=1g --ulimit memlock=-1 或是 --ipc=host 的參數呢?

Tlntin · 2022-08-16T02:34:50Z

@Tlntin 可以分享你的 docker 運行 command 跟錯誤訊息嗎? docker command 是否有添加 --shm-size=1g --ulimit memlock=-1 或是 --ipc=host 的參數呢?

没添加这些参数（之前用官方容器也没添加）
我待会添加一下试试，看看是否还报错。

zlsh80826 · 2022-08-16T02:37:52Z

Note: In order to share data between ranks, NCCL may require shared system memory for IPC and pinned (page-locked) system memory resources. The operating system's limits on these resources may need to be increased accordingly. Refer to your system's documentation for details. In particular, Docker containers default to limited shared and pinned memory resources. When using NCCL inside a container, it is recommended that you increase these resources by issuing:

沒添加的話根據環境不同的確會在 python -c 'import paddle; paddle.utils.run_check()' 失敗，這是一個已知的 case

Tlntin · 2022-08-16T02:38:55Z

Note: In order to share data between ranks, NCCL may require shared system memory for IPC and pinned (page-locked) system memory resources. The operating system's limits on these resources may need to be increased accordingly. Refer to your system's documentation for details. In particular, Docker containers default to limited shared and pinned memory resources. When using NCCL inside a container, it is recommended that you increase these resources by issuing:

沒添加的話根據環境不同的確會在 python -c 'import paddle; paddle.utils.run_check()' 失敗，這是一個已知的 case

嗯嗯。应该是nccl的锅，我昨天看报错里面有提到nccl错误。

Tlntin · 2022-08-16T03:04:12Z

@Tlntin 可以分享你的 docker 運行 command 跟錯誤訊息嗎? docker command 是否有添加 --shm-size=1g --ulimit memlock=-1 或是 --ipc=host 的參數呢?

加了参数后可以了。不加会报错（但是官方paddle 2.3.1 cuda 11.2不加也可以）。
报错信息如下：

>>> paddle.utils.run_check()
Running verify PaddlePaddle program ...
W0816 03:00:08.974228  1388 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.7, Runtime API Version: 11.7
W0816 03:00:08.979508  1388 gpu_context.cc:306] device: 0, cuDNN Version: 8.4.
PaddlePaddle works well on 1 GPU.
W0816 03:00:10.755481  1388 parallel_executor.cc:642] Cannot enable P2P access from 0 to 1
W0816 03:00:10.755497  1388 parallel_executor.cc:642] Cannot enable P2P access from 1 to 0
[1660618812.601298] [ff602c8d99c8:1388 :0]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1660618812.601301] [ff602c8d99c8:1388 :1]           debug.c:1349 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[ff602c8d99c8:1388 :0:1508] Caught signal 7 (Bus error: nonexistent physical address)
[ff602c8d99c8:1388 :1:1509] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:   1509) ====
 0 0x0000000000014420 __funlockfile()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x000000000006929c ncclGroupEnd()  ???:0
 3 0x000000000006b9ae ncclGroupEnd()  ???:0
 4 0x0000000000050853 ncclGetUniqueId()  ???:0
 5 0x00000000000417b4 ???()  /lib/x86_64-linux-gnu/libnccl.so:0
 6 0x0000000000042c4d ???()  /lib/x86_64-linux-gnu/libnccl.so:0
 7 0x0000000000058b37 ncclRedOpDestroy()  ???:0
 8 0x0000000000008609 start_thread()  ???:0
 9 0x000000000011f133 clone()  ???:0
=================================
==== backtrace (tid:   1508) ====
 0 0x0000000000014420 __funlockfile()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x000000000006929c ncclGroupEnd()  ???:0
 3 0x000000000006b9ae ncclGroupEnd()  ???:0
 4 0x0000000000050853 ncclGetUniqueId()  ???:0
 5 0x00000000000417b4 ???()  /lib/x86_64-linux-gnu/libnccl.so:0
 6 0x0000000000042c4d ???()  /lib/x86_64-linux-gnu/libnccl.so:0
 7 0x0000000000058b37 ncclRedOpDestroy()  ???:0
 8 0x0000000000008609 start_thread()  ???:0
 9 0x000000000011f133 clone()  ???:0
=================================


--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
No stack trace in paddle, may be caused by external reasons.

----------------------
Error Message Summary:
----------------------
FatalError: `Access to an undefined portion of a memory object` is detected by the operating system.
  [TimeInfo: *** Aborted at 1660618812 (unix time) try "date -d @1660618812" if you are using GNU date ***]
  [SignalInfo: *** SIGBUS (@0x56c) received by PID 1388 (TID 0x7fe6c5fff700) from PID 1388 ***]

最终运行代码

nvidia-docker run -d --name nv_paddle_2.3.0 \
-p 16788:22 \
-p 28889:8888 \
--restart=always \
--shm-size=1g --ulimit memlock=-1 \
nvcr.io/nvidia/paddlepaddle:22.07-py3 \
bash -c "apt update && \
apt install openssh-server -y \
&& service ssh start  \
&& sleep 8640000"

update the command to launch the container, in order to be inline with NGC PaddlePaddle website. This editting was initiated by the feedbacks in PaddlePaddle/Paddle#45074

* Update install_NGC_PaddlePaddle_ch.rst update the command to launch the container, in order to be inline with NGC PaddlePaddle website. This editting was initiated by the feedbacks in PaddlePaddle/Paddle#45074 * Update install_NGC_PaddlePaddle_en.rst * Update install_NGC_PaddlePaddle_ch.rst * Update install_NGC_PaddlePaddle_en.rst * Update install_NGC_PaddlePaddle_ch.rst * Update install_NGC_PaddlePaddle_en.rst Co-authored-by: Dingjiawei <327396238@qq.com>

Tlntin · 2022-08-30T09:22:16Z

深度使用了一下cuda-11.7的容器，貌似不太好用。
cuda-11.2的官方容器能跑的代码，cuda-11.7的会报错。
同paddle 2.3.1，最新版paddlenlp。
同一套代码，cuda-11.7的容器运行报错。

step 0 (NotFound) No Input(qo_kv_seqlen) found for MHA operator.
  [Hint: Expected ctx->HasInput("qo_kv_seqlen") == true, but received ctx->HasInput("qo_kv_seqlen"):0 != true:1.] (at /opt/paddle/paddle/paddle/fluid/operators/mha/mha_op.cc:32)
  [operator < mha > error]

运行的代码为paddlenlp自带demo，将启动的ernie-1.0改成了ernie-3.0
代码链接
运行参数：

export CUDA_VISIBLE_DEVICES=0
python run_pretrain.py \
		--model_type "ernie" \
		--model_name_or_path "ernie-3.0-base-zh" \
		--tokenizer_name_or_path "ernie-3.0-base-zh" \
		--continue_training true \
		--input_dir "./data1" \
		--output_dir "output/ernie-3.0-dp8-gb512" \
		--split 94,50,1 \
		--max_seq_len 512 \
		--micro_batch_size 32 \
		--use_amp true \
		--max_lr 0.0001 \
		--min_lr 0.00001 \
		--max_steps 500000 \
		--save_steps 50000 \
		--checkpoint_steps 5000 \
		--decay_steps 990000 \
		--weight_decay 0.01 \
		--warmup_rate 0.01 \
		--grad_clip 1.0 \
		--logging_freq 20 \
		--num_workers 10 \
		--eval_freq 1000 \
		--device "gpu" \
		--share_folder false

zlsh80826 · 2022-08-31T07:56:05Z

Hi @Tlntin,
MHA op 的問題是已知的問題，在下個月發佈的新 container 會把 bug 修復

paddle-bot · 2023-09-26T06:31:45Z

Since you haven't replied for more than a year, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
由于您超过一年未回复，我们将关闭这个issue/pr。
若问题未解决或有后续问题，请随时重新打开，我们会继续跟进。

sam-johnson added status/new-issue 新建 type/feature-request 新需求申请 labels Aug 11, 2022

paddle-bot bot assigned zhwesky2010 Aug 11, 2022

zhwesky2010 added 安装 and removed status/new-issue 新建 type/feature-request 新需求申请 labels Aug 12, 2022

paddle-bot bot added the status/following-up 跟进中 label Aug 12, 2022

paddle-bot bot added status/reviewing 需求review中 and removed status/following-up 跟进中 labels Aug 12, 2022

onecatcn mentioned this issue Aug 19, 2022

Update instalL_NGC_PaddlePaddle_ch.rst PaddlePaddle/docs#5158

Merged

onecatcn mentioned this issue Aug 19, 2022

Update install_NGC_PaddlePaddle_ch.rst for PaddlePaddle 2.3 docs PaddlePaddle/docs#5159

Merged

paddle-bot bot added status/developing 开发中 and removed status/reviewing 需求review中 labels Aug 23, 2022

Ligoml added the status/developed 开发完成 label Sep 20, 2022

paddle-bot bot closed this as completed Sep 20, 2022

paddle-bot bot removed the status/developing 开发中 label Sep 20, 2022

Ligoml added status/developing 开发中 type/build 编译/安装问题 and removed 安装 labels Sep 20, 2022

paddle-bot bot reopened this Sep 20, 2022

paddle-bot bot removed the status/developed 开发完成 label Sep 20, 2022

paddle-bot bot closed this as completed Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About CUDA Version: 11.7 #45074

About CUDA Version: 11.7 #45074

sam-johnson commented Aug 11, 2022

paddle-bot bot commented Aug 11, 2022

Tlntin commented Aug 12, 2022

zhwesky2010 commented Aug 12, 2022

Tlntin commented Aug 12, 2022

zhwesky2010 commented Aug 12, 2022

Tlntin commented Aug 12, 2022

zhwesky2010 commented Aug 12, 2022

jzhang533 commented Aug 12, 2022 •

edited

Loading

jzhang533 commented Aug 12, 2022

Tlntin commented Aug 16, 2022

zlsh80826 commented Aug 16, 2022

Tlntin commented Aug 16, 2022

zlsh80826 commented Aug 16, 2022

Tlntin commented Aug 16, 2022

Tlntin commented Aug 16, 2022

Tlntin commented Aug 30, 2022

zlsh80826 commented Aug 31, 2022

paddle-bot bot commented Sep 26, 2023

About CUDA Version: 11.7 #45074

About CUDA Version: 11.7 #45074

Comments

sam-johnson commented Aug 11, 2022

需求描述 Feature Description

替代实现 Alternatives

paddle-bot bot commented Aug 11, 2022

Tlntin commented Aug 12, 2022

zhwesky2010 commented Aug 12, 2022

Tlntin commented Aug 12, 2022

zhwesky2010 commented Aug 12, 2022

Tlntin commented Aug 12, 2022

zhwesky2010 commented Aug 12, 2022

jzhang533 commented Aug 12, 2022 • edited Loading

jzhang533 commented Aug 12, 2022

Tlntin commented Aug 16, 2022

zlsh80826 commented Aug 16, 2022

Tlntin commented Aug 16, 2022

zlsh80826 commented Aug 16, 2022

Tlntin commented Aug 16, 2022

Tlntin commented Aug 16, 2022

Tlntin commented Aug 30, 2022

zlsh80826 commented Aug 31, 2022

paddle-bot bot commented Sep 26, 2023

jzhang533 commented Aug 12, 2022 •

edited

Loading