NVIDIA / nccl Public

Notifications You must be signed in to change notification settings
Fork 835
Star 3.3k

Code
Issues 649
Pull requests 65
Actions
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Security
Insights

Issues: NVIDIA/nccl

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

649 Open 689 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

Explicit support for comments in nccl.conf

#1540 opened Dec 12, 2024 by sclarkson

NCCL 2.22.3 core dump when specify NCCL_IB_ROCE_VERSION_NUM

#1538 opened Dec 12, 2024 by nkflash

k8s nccl ib slow

#1536 opened Dec 8, 2024 by JunjieLl

Question about using nccl group

#1535 opened Dec 7, 2024 by chenhongyu2048

what's gdrcopy sync’s advantage?

#1533 opened Dec 6, 2024 by nope8

introduce USDT for the perf tracing

#1532 opened Dec 6, 2024 by gangxie112

When the number of nodes increases, the bandwidth performance of alltoall is unstable

#1531 opened Dec 5, 2024 by fj1425fj

Error Using Different GPUs for Two Containers on the Same Node

#1529 opened Dec 2, 2024 by cyberpunk-admin

Runtime ERROR: NCCL WARN Cuda failure 'named symbol not found. unhandled cuda error (run with NCCL_DEBUG=INFO for details)

#1528 opened Dec 2, 2024 by Seqaeon

cannot use two roce nics

#1527 opened Nov 29, 2024 by gongysh2004

NCCL error (vendor err) during multi-node training with mixed HCA vendors (Mellanox and Broadcom)

#1526 opened Nov 28, 2024 by asdfry

What memory is the acc variable located in?

#1525 opened Nov 27, 2024 by JuiceLemonLemon

local access violation work queue error when upgrade to v2.20.3-1

#1524 opened Nov 26, 2024 by gangxie112

Questions about the FIFO of simple protocol

#1523 opened Nov 25, 2024 by JK-Jiagn

Any possibility/plan to support fused kernels?

#1522 opened Nov 25, 2024 by dearsxx0918

Why group calls (ncclGroupStart() and ncclGroupEnd()) are invoked in ncclSend() and ncclRecv()

#1521 opened Nov 21, 2024 by ZhiyiHu1999

Is it safe or recommended to use multiple communicators for real distributed training

#1520 opened Nov 19, 2024 by ZhiyiHu1999

Unable to use multiple NICs

#1519 opened Nov 19, 2024 by thecodingwizard

torch.distributed.DistBackendError: NCCL error

#1517 opened Nov 18, 2024 by Chevolier

Could not enable P2P between devices

#1516 opened Nov 18, 2024 by ZhiyiHu1999

Nccl socketStartConnect: Connect to x.x.x.x<xxxx> failed : Software caused connection abort

#1515 opened Nov 16, 2024 by 913871734

torch.distributed.DistBackendError: NCCL error in ProcessGroupNCCL.cpp:1275

#1514 opened Nov 14, 2024 by shenshaowei

nccl capture error

#1513 opened Nov 14, 2024 by freshduer

nccl-test Indicates a performance problem

#1511 opened Nov 10, 2024 by yalbaba

GPU Direct RDMA Disabled for HCA

#1510 opened Nov 9, 2024 by hiennguyennq

Previous 1 2 3 4 5 … 25 26 Next

Previous Next

ProTip! Updated in the last three days: updated:>2024-12-09.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly