-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Completes bfloat16 dtype for collective api in eager mode #45844
Changes from all commits
ca83fbc
68865ed
4c2aac1
c4c6260
35eec2b
74d862a
911d02e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -996,6 +996,9 @@ void* GetPointerByOffset(void* raw_pointer, | |||||||||||||
} else if (type == experimental::DataType::BOOL) { | ||||||||||||||
return reinterpret_cast<void*>(reinterpret_cast<bool*>(raw_pointer) + | ||||||||||||||
offset); | ||||||||||||||
} else if (type == experimental::DataType::BFLOAT16) { | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AllReduce uint16 data? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As the code below, they use Paddle/paddle/phi/common/bfloat16.h Lines 74 to 79 in 75528ad
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And it seems that we cannot use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This issue mentioned the uint16 problem, #34927 |
||||||||||||||
return reinterpret_cast<void*>(reinterpret_cast<uint16_t*>(raw_pointer) + | ||||||||||||||
offset); | ||||||||||||||
} else { | ||||||||||||||
PADDLE_THROW(platform::errors::Unimplemented( | ||||||||||||||
"This datatype in nccl is not supported.")); | ||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这样是不是有问题,gloo支持bfloat吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不加这行就会报错,加上这行之后就跑起来了🤔从昨天的测试结果来看好像没问题
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不是,不加这行肯定会报错,就我看gloo内部好像不支持bf16,比较好奇为什么这么可以过测试
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有没有一种可能,paddle 的 bf16 tensor 里面装的实际上是 uint16,种种迹象表明他在 host 上好像并没有真正用 bf16?因为用的实际上是 uint16 所以能跑起来
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那nccl里支持的bfloat16和直接用uint16传有啥区别吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nccl好像会判断cuda然后看能不能用bf16,gloo可能直接就用uint16了?