-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About CUDA Version: 11.7 #45074
Comments
您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快~ Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API,FAQ,Github Issue and AI community to get the answer.Have a nice day! |
自己编译咯 |
你好,目前可以支持源码编译,我们正在本地测试和修复一些问题,预计2.4就会发布,大概10月份 |
能否支持ubuntu 22.04?或者升级一下gcc到11? |
是源码编译11.7出了问题吗 |
不是,ubuntu22.04没法直装官方编译好的whl包,只能通过源码去编译,比较麻烦。 |
明白了,我们会评估这个Ubuntu编译安装升级的需求,感谢建议 |
可以试试用NGC paddlepaddle container: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/paddlepaddle 这个镜像里是cuda11.7,详情请看: https://docs.nvidia.com/deeplearning/frameworks/paddle-paddle-release-notes/rel_22-07.html#rel_22-07 这里有一份中文的安装文档: |
ubuntu 22.04上,现在应该还不能用paddle。 |
实测不行,运行paddle.utils.run_check()会报错,还是要老老实实用官方容器。 |
@Tlntin 可以分享你的 docker 運行 command 跟錯誤訊息嗎? docker command 是否有添加 |
没添加这些参数(之前用官方容器也没添加) |
沒添加的話根據環境不同的確會在 |
嗯嗯。应该是nccl的锅,我昨天看报错里面有提到nccl错误。 |
加了参数后可以了。不加会报错(但是官方paddle 2.3.1 cuda 11.2不加也可以)。 >>> paddle.utils.run_check()
Running verify PaddlePaddle program ...
W0816 03:00:08.974228 1388 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.7, Runtime API Version: 11.7
W0816 03:00:08.979508 1388 gpu_context.cc:306] device: 0, cuDNN Version: 8.4.
PaddlePaddle works well on 1 GPU.
W0816 03:00:10.755481 1388 parallel_executor.cc:642] Cannot enable P2P access from 0 to 1
W0816 03:00:10.755497 1388 parallel_executor.cc:642] Cannot enable P2P access from 1 to 0
[1660618812.601298] [ff602c8d99c8:1388 :0] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[1660618812.601301] [ff602c8d99c8:1388 :1] debug.c:1349 UCX WARN ucs_debug_disable_signal: signal 8 was not set in ucs
[ff602c8d99c8:1388 :0:1508] Caught signal 7 (Bus error: nonexistent physical address)
[ff602c8d99c8:1388 :1:1509] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 1509) ====
0 0x0000000000014420 __funlockfile() ???:0
1 0x000000000018bb41 __nss_database_lookup() ???:0
2 0x000000000006929c ncclGroupEnd() ???:0
3 0x000000000006b9ae ncclGroupEnd() ???:0
4 0x0000000000050853 ncclGetUniqueId() ???:0
5 0x00000000000417b4 ???() /lib/x86_64-linux-gnu/libnccl.so:0
6 0x0000000000042c4d ???() /lib/x86_64-linux-gnu/libnccl.so:0
7 0x0000000000058b37 ncclRedOpDestroy() ???:0
8 0x0000000000008609 start_thread() ???:0
9 0x000000000011f133 clone() ???:0
=================================
==== backtrace (tid: 1508) ====
0 0x0000000000014420 __funlockfile() ???:0
1 0x000000000018bb41 __nss_database_lookup() ???:0
2 0x000000000006929c ncclGroupEnd() ???:0
3 0x000000000006b9ae ncclGroupEnd() ???:0
4 0x0000000000050853 ncclGetUniqueId() ???:0
5 0x00000000000417b4 ???() /lib/x86_64-linux-gnu/libnccl.so:0
6 0x0000000000042c4d ???() /lib/x86_64-linux-gnu/libnccl.so:0
7 0x0000000000058b37 ncclRedOpDestroy() ???:0
8 0x0000000000008609 start_thread() ???:0
9 0x000000000011f133 clone() ???:0
=================================
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
No stack trace in paddle, may be caused by external reasons.
----------------------
Error Message Summary:
----------------------
FatalError: `Access to an undefined portion of a memory object` is detected by the operating system.
[TimeInfo: *** Aborted at 1660618812 (unix time) try "date -d @1660618812" if you are using GNU date ***]
[SignalInfo: *** SIGBUS (@0x56c) received by PID 1388 (TID 0x7fe6c5fff700) from PID 1388 ***] 最终运行代码 nvidia-docker run -d --name nv_paddle_2.3.0 \
-p 16788:22 \
-p 28889:8888 \
--restart=always \
--shm-size=1g --ulimit memlock=-1 \
nvcr.io/nvidia/paddlepaddle:22.07-py3 \
bash -c "apt update && \
apt install openssh-server -y \
&& service ssh start \
&& sleep 8640000" |
update the command to launch the container, in order to be inline with NGC PaddlePaddle website. This editting was initiated by the feedbacks in PaddlePaddle/Paddle#45074
* Update install_NGC_PaddlePaddle_ch.rst update the command to launch the container, in order to be inline with NGC PaddlePaddle website. This editting was initiated by the feedbacks in PaddlePaddle/Paddle#45074 * Update install_NGC_PaddlePaddle_en.rst * Update install_NGC_PaddlePaddle_ch.rst * Update install_NGC_PaddlePaddle_en.rst * Update install_NGC_PaddlePaddle_ch.rst * Update install_NGC_PaddlePaddle_en.rst Co-authored-by: Dingjiawei <327396238@qq.com>
深度使用了一下cuda-11.7的容器,貌似不太好用。 step 0 (NotFound) No Input(qo_kv_seqlen) found for MHA operator.
[Hint: Expected ctx->HasInput("qo_kv_seqlen") == true, but received ctx->HasInput("qo_kv_seqlen"):0 != true:1.] (at /opt/paddle/paddle/paddle/fluid/operators/mha/mha_op.cc:32)
[operator < mha > error] 运行的代码为paddlenlp自带demo,将启动的ernie-1.0改成了ernie-3.0 export CUDA_VISIBLE_DEVICES=0
python run_pretrain.py \
--model_type "ernie" \
--model_name_or_path "ernie-3.0-base-zh" \
--tokenizer_name_or_path "ernie-3.0-base-zh" \
--continue_training true \
--input_dir "./data1" \
--output_dir "output/ernie-3.0-dp8-gb512" \
--split 94,50,1 \
--max_seq_len 512 \
--micro_batch_size 32 \
--use_amp true \
--max_lr 0.0001 \
--min_lr 0.00001 \
--max_steps 500000 \
--save_steps 50000 \
--checkpoint_steps 5000 \
--decay_steps 990000 \
--weight_decay 0.01 \
--warmup_rate 0.01 \
--grad_clip 1.0 \
--logging_freq 20 \
--num_workers 10 \
--eval_freq 1000 \
--device "gpu" \
--share_folder false |
Hi @Tlntin, |
Since you haven't replied for more than a year, we have closed this issue/pr. |
需求描述 Feature Description
请问 何时可以支持 CUDA Version 11.7 ?
替代实现 Alternatives
No response
The text was updated successfully, but these errors were encountered: