Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About CUDA Version: 11.7 #45074

Closed
sam-johnson opened this issue Aug 11, 2022 · 18 comments
Closed

About CUDA Version: 11.7 #45074

sam-johnson opened this issue Aug 11, 2022 · 18 comments
Assignees
Labels
status/developing 开发中 type/build 编译/安装问题

Comments

@sam-johnson
Copy link

需求描述 Feature Description

请问 何时可以支持 CUDA Version 11.7 ?

替代实现 Alternatives

No response

@paddle-bot
Copy link

paddle-bot bot commented Aug 11, 2022

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

@Tlntin
Copy link

Tlntin commented Aug 12, 2022

自己编译咯

@zhwesky2010
Copy link
Contributor

你好,目前可以支持源码编译,我们正在本地测试和修复一些问题,预计2.4就会发布,大概10月份

@Tlntin
Copy link

Tlntin commented Aug 12, 2022

你好,目前可以支持源码编译,我们正在本地测试和修复一些问题,预计2.4就会发布,大概10月份

能否支持ubuntu 22.04?或者升级一下gcc到11?

@zhwesky2010
Copy link
Contributor

你好,目前可以支持源码编译,我们正在本地测试和修复一些问题,预计2.4就会发布,大概10月份

能否支持ubuntu 22.04?或者升级一下gcc到11?

是源码编译11.7出了问题吗

@Tlntin
Copy link

Tlntin commented Aug 12, 2022

你好,目前可以支持源码编译,我们正在本地测试和修复一些问题,预计2.4就会发布,大概10月份

能否支持ubuntu 22.04?或者升级一下gcc到11?

是源码编译11.7出了问题吗

不是,ubuntu22.04没法直装官方编译好的whl包,只能通过源码去编译,比较麻烦。
并且,编译时,只能用gcc8.x,这样就要改默认gcc,会影响系统环境,所以我用的conda去安装gcc8.x。
并且,自己编译的paddle有时候会有bug,比如官方的2.3.1可以运行RocketQA项目,而自己编译的2.3.1会爆显存溢出(同一套代码)
所以,希望官方能升级一下,让paddle支持gcc 11以及ubuntu 22.04(可以直接pip安装,而不用自己编译)

@zhwesky2010
Copy link
Contributor

你好,目前可以支持源码编译,我们正在本地测试和修复一些问题,预计2.4就会发布,大概10月份

能否支持ubuntu 22.04?或者升级一下gcc到11?

是源码编译11.7出了问题吗

不是,ubuntu22.04没法直装官方编译好的whl包,只能通过源码去编译,比较麻烦。 并且,编译时,只能用gcc8.x,这样就要改默认gcc,会影响系统环境,所以我用的conda去安装gcc8.x。 并且,自己编译的paddle有时候会有bug,比如官方的2.3.1可以运行RocketQA项目,而自己编译的2.3.1会爆显存溢出(同一套代码) 所以,希望官方能升级一下,让paddle支持gcc 11以及ubuntu 22.04(可以直接pip安装,而不用自己编译)

明白了,我们会评估这个Ubuntu编译安装升级的需求,感谢建议

@zhwesky2010 zhwesky2010 added 安装 and removed status/new-issue 新建 type/feature-request 新需求申请 labels Aug 12, 2022
@paddle-bot paddle-bot bot added the status/following-up 跟进中 label Aug 12, 2022
@jzhang533
Copy link
Contributor

jzhang533 commented Aug 12, 2022

可以试试用NGC paddlepaddle container:

https://catalog.ngc.nvidia.com/orgs/nvidia/containers/paddlepaddle

这个镜像里是cuda11.7,详情请看:

https://docs.nvidia.com/deeplearning/frameworks/paddle-paddle-release-notes/rel_22-07.html#rel_22-07

这里有一份中文的安装文档:
https://www.paddlepaddle.org.cn/documentation/docs/zh/install/instalL_NGC_PaddlePaddle_ch.html

@jzhang533
Copy link
Contributor

你好,目前可以支持源码编译,我们正在本地测试和修复一些问题,预计2.4就会发布,大概10月份

能否支持ubuntu 22.04?或者升级一下gcc到11?

是源码编译11.7出了问题吗

不是,ubuntu22.04没法直装官方编译好的whl包,只能通过源码去编译,比较麻烦。 并且,编译时,只能用gcc8.x,这样就要改默认gcc,会影响系统环境,所以我用的conda去安装gcc8.x。 并且,自己编译的paddle有时候会有bug,比如官方的2.3.1可以运行RocketQA项目,而自己编译的2.3.1会爆显存溢出(同一套代码) 所以,希望官方能升级一下,让paddle支持gcc 11以及ubuntu 22.04(可以直接pip安装,而不用自己编译)

ubuntu 22.04上,现在应该还不能用paddle。
因为存在跟glibc 2.34及以上版本的不兼容问题,应该在2.4版本才会解决。

@paddle-bot paddle-bot bot added status/reviewing 需求review中 and removed status/following-up 跟进中 labels Aug 12, 2022
@Tlntin
Copy link

Tlntin commented Aug 16, 2022

可以试试用NGC paddlepaddle container:

https://catalog.ngc.nvidia.com/orgs/nvidia/containers/paddlepaddle

这个镜像里是cuda11.7,详情请看:

https://docs.nvidia.com/deeplearning/frameworks/paddle-paddle-release-notes/rel_22-07.html#rel_22-07

这里有一份中文的安装文档: https://www.paddlepaddle.org.cn/documentation/docs/zh/install/instalL_NGC_PaddlePaddle_ch.html

实测不行,运行paddle.utils.run_check()会报错,还是要老老实实用官方容器。

@zlsh80826
Copy link
Collaborator

@Tlntin 可以分享你的 docker 運行 command 跟錯誤訊息嗎? docker command 是否有添加 --shm-size=1g --ulimit memlock=-1 或是 --ipc=host 的參數呢?

@Tlntin
Copy link

Tlntin commented Aug 16, 2022

@Tlntin 可以分享你的 docker 運行 command 跟錯誤訊息嗎? docker command 是否有添加 --shm-size=1g --ulimit memlock=-1 或是 --ipc=host 的參數呢?

没添加这些参数(之前用官方容器也没添加)
我待会添加一下试试,看看是否还报错。

@zlsh80826
Copy link
Collaborator

Note: In order to share data between ranks, NCCL may require shared system memory for IPC and pinned (page-locked) system memory resources. The operating system's limits on these resources may need to be increased accordingly. Refer to your system's documentation for details. In particular, Docker containers default to limited shared and pinned memory resources. When using NCCL inside a container, it is recommended that you increase these resources by issuing:

沒添加的話根據環境不同的確會在 python -c 'import paddle; paddle.utils.run_check()' 失敗,這是一個已知的 case

@Tlntin
Copy link

Tlntin commented Aug 16, 2022

Note: In order to share data between ranks, NCCL may require shared system memory for IPC and pinned (page-locked) system memory resources. The operating system's limits on these resources may need to be increased accordingly. Refer to your system's documentation for details. In particular, Docker containers default to limited shared and pinned memory resources. When using NCCL inside a container, it is recommended that you increase these resources by issuing:

沒添加的話根據環境不同的確會在 python -c 'import paddle; paddle.utils.run_check()' 失敗,這是一個已知的 case

嗯嗯。应该是nccl的锅,我昨天看报错里面有提到nccl错误。

@Tlntin
Copy link

Tlntin commented Aug 16, 2022

@Tlntin 可以分享你的 docker 運行 command 跟錯誤訊息嗎? docker command 是否有添加 --shm-size=1g --ulimit memlock=-1 或是 --ipc=host 的參數呢?

加了参数后可以了。不加会报错(但是官方paddle 2.3.1 cuda 11.2不加也可以)。
报错信息如下:

>>> paddle.utils.run_check()
Running verify PaddlePaddle program ...
W0816 03:00:08.974228  1388 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.7, Runtime API Version: 11.7
W0816 03:00:08.979508  1388 gpu_context.cc:306] device: 0, cuDNN Version: 8.4.
PaddlePaddle works well on 1 GPU.
W0816 03:00:10.755481  1388 parallel_executor.cc:642] Cannot enable P2P access from 0 to 1
W0816 03:00:10.755497  1388 parallel_executor.cc:642] Cannot enable P2P access from 1 to 0
[1660618812.601298] [ff602c8d99c8:1388 :0]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1660618812.601301] [ff602c8d99c8:1388 :1]           debug.c:1349 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[ff602c8d99c8:1388 :0:1508] Caught signal 7 (Bus error: nonexistent physical address)
[ff602c8d99c8:1388 :1:1509] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:   1509) ====
 0 0x0000000000014420 __funlockfile()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x000000000006929c ncclGroupEnd()  ???:0
 3 0x000000000006b9ae ncclGroupEnd()  ???:0
 4 0x0000000000050853 ncclGetUniqueId()  ???:0
 5 0x00000000000417b4 ???()  /lib/x86_64-linux-gnu/libnccl.so:0
 6 0x0000000000042c4d ???()  /lib/x86_64-linux-gnu/libnccl.so:0
 7 0x0000000000058b37 ncclRedOpDestroy()  ???:0
 8 0x0000000000008609 start_thread()  ???:0
 9 0x000000000011f133 clone()  ???:0
=================================
==== backtrace (tid:   1508) ====
 0 0x0000000000014420 __funlockfile()  ???:0
 1 0x000000000018bb41 __nss_database_lookup()  ???:0
 2 0x000000000006929c ncclGroupEnd()  ???:0
 3 0x000000000006b9ae ncclGroupEnd()  ???:0
 4 0x0000000000050853 ncclGetUniqueId()  ???:0
 5 0x00000000000417b4 ???()  /lib/x86_64-linux-gnu/libnccl.so:0
 6 0x0000000000042c4d ???()  /lib/x86_64-linux-gnu/libnccl.so:0
 7 0x0000000000058b37 ncclRedOpDestroy()  ???:0
 8 0x0000000000008609 start_thread()  ???:0
 9 0x000000000011f133 clone()  ???:0
=================================


--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
No stack trace in paddle, may be caused by external reasons.

----------------------
Error Message Summary:
----------------------
FatalError: `Access to an undefined portion of a memory object` is detected by the operating system.
  [TimeInfo: *** Aborted at 1660618812 (unix time) try "date -d @1660618812" if you are using GNU date ***]
  [SignalInfo: *** SIGBUS (@0x56c) received by PID 1388 (TID 0x7fe6c5fff700) from PID 1388 ***]

最终运行代码

nvidia-docker run -d --name nv_paddle_2.3.0 \
-p 16788:22 \
-p 28889:8888 \
--restart=always \
--shm-size=1g --ulimit memlock=-1 \
nvcr.io/nvidia/paddlepaddle:22.07-py3 \
bash -c "apt update && \
apt install openssh-server -y \
&& service ssh start  \
&& sleep 8640000"

onecatcn added a commit to onecatcn/docs that referenced this issue Aug 19, 2022
update the command to launch the container, in order to be inline with NGC PaddlePaddle website. This editting was initiated by the feedbacks in PaddlePaddle/Paddle#45074
XieYunshen pushed a commit to PaddlePaddle/docs that referenced this issue Aug 19, 2022
* Update install_NGC_PaddlePaddle_ch.rst

update the command to launch the container, in order to be inline with NGC PaddlePaddle website. This editting was initiated by the feedbacks in PaddlePaddle/Paddle#45074

* Update install_NGC_PaddlePaddle_en.rst

* Update install_NGC_PaddlePaddle_ch.rst

* Update install_NGC_PaddlePaddle_en.rst

* Update install_NGC_PaddlePaddle_ch.rst

* Update install_NGC_PaddlePaddle_en.rst

Co-authored-by: Dingjiawei <327396238@qq.com>
@paddle-bot paddle-bot bot added status/developing 开发中 and removed status/reviewing 需求review中 labels Aug 23, 2022
@Tlntin
Copy link

Tlntin commented Aug 30, 2022

深度使用了一下cuda-11.7的容器,貌似不太好用。
cuda-11.2的官方容器能跑的代码,cuda-11.7的会报错。
同paddle 2.3.1, 最新版paddlenlp。
同一套代码,cuda-11.7的容器运行报错。

step 0 (NotFound) No Input(qo_kv_seqlen) found for MHA operator.
  [Hint: Expected ctx->HasInput("qo_kv_seqlen") == true, but received ctx->HasInput("qo_kv_seqlen"):0 != true:1.] (at /opt/paddle/paddle/paddle/fluid/operators/mha/mha_op.cc:32)
  [operator < mha > error]

运行的代码为paddlenlp自带demo,将启动的ernie-1.0改成了ernie-3.0
代码链接
运行参数:

export CUDA_VISIBLE_DEVICES=0
python run_pretrain.py \
		--model_type "ernie" \
		--model_name_or_path "ernie-3.0-base-zh" \
		--tokenizer_name_or_path "ernie-3.0-base-zh" \
		--continue_training true \
		--input_dir "./data1" \
		--output_dir "output/ernie-3.0-dp8-gb512" \
		--split 94,50,1 \
		--max_seq_len 512 \
		--micro_batch_size 32 \
		--use_amp true \
		--max_lr 0.0001 \
		--min_lr 0.00001 \
		--max_steps 500000 \
		--save_steps 50000 \
		--checkpoint_steps 5000 \
		--decay_steps 990000 \
		--weight_decay 0.01 \
		--warmup_rate 0.01 \
		--grad_clip 1.0 \
		--logging_freq 20 \
		--num_workers 10 \
		--eval_freq 1000 \
		--device "gpu" \
		--share_folder false

@zlsh80826
Copy link
Collaborator

Hi @Tlntin,
MHA op 的問題是已知的問題,在下個月發佈的新 container 會把 bug 修復

@Ligoml Ligoml added the status/developed 开发完成 label Sep 20, 2022
@paddle-bot paddle-bot bot closed this as completed Sep 20, 2022
@paddle-bot paddle-bot bot removed the status/developing 开发中 label Sep 20, 2022
@Ligoml Ligoml added status/developing 开发中 type/build 编译/安装问题 and removed 安装 labels Sep 20, 2022
@paddle-bot paddle-bot bot reopened this Sep 20, 2022
@paddle-bot paddle-bot bot removed the status/developed 开发完成 label Sep 20, 2022
@paddle-bot paddle-bot bot closed this as completed Sep 26, 2023
@paddle-bot
Copy link

paddle-bot bot commented Sep 26, 2023

Since you haven't replied for more than a year, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
由于您超过一年未回复,我们将关闭这个issue/pr。
若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/developing 开发中 type/build 编译/安装问题
Projects
None yet
Development

No branches or pull requests

6 participants