-
Notifications
You must be signed in to change notification settings - Fork 183
DEVICE/API: Update NIXL device API return status #881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEVICE/API: Update NIXL device API return status #881
Conversation
Signed-off-by: Michal Shalev <mshalev@nvidia.com>
|
👋 Hi michal-shalev! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
|
/build |
| if (status == UCS_INPROGRESS) { | ||
| return NIXL_IN_PROG; | ||
| } | ||
| printf("UCX returned error: %d\n", status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| printf("UCX returned error: %d\n", status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree with you, we need an error log here. It doesn't add another branch, we already know that it's an error and NIXL_ERR_BACKEND is too generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logging method is not suitable in this context.
I think we don't need such logging in the device code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover, using naked printf as a logging method is generally not convenient, since the logging level and output are not controlled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it not suitable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Printing the UCX error code does not provide useful information to the user, as it requires access to the UCX source code. The error information should be logged by UCX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the concern about device API logging. However, I'm keeping this printf for now for the following reasons:
- UCX doesn't consistently log errors in device code paths, leaving us with no visibility when things fail.
NIXL_ERR_BACKENDalone provides zero actionable information - at minimum we need the UCX error code to debug.- We're actively debugging integration issues with this code right now, and this printf has been essential for troubleshooting
- Proper logging infrastructure is tracked separately - I've opened a ticket to implement a proper logging macro for device code
This is a pragmatic debugging aid during urgent integration work. I'll address the logging infrastructure properly in the follow-up ticket, but for now this printf is necessary and helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's push the device debugging macro soon, for now leave this print and change it when macro is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the concern about device API logging. However, I'm keeping this printf for now for the following reasons:
- UCX doesn't consistently log errors in device code paths, leaving us with no visibility when things fail.
NIXL_ERR_BACKENDalone provides zero actionable information - at minimum we need the UCX error code to debug.- We're actively debugging integration issues with this code right now, and this printf has been essential for troubleshooting
- Proper logging infrastructure is tracked separately - I've opened a ticket to implement a proper logging macro for device code
This is a pragmatic debugging aid during urgent integration work. I'll address the logging infrastructure properly in the follow-up ticket, but for now this printf is necessary and helpful.
- This issue should be addressed on UCX level, and not as w/a in NIXL.
- As I previously mentioned, UCX error code without explanation only makes sense in conjunction with the UCX source code, and is almost useless in terms of more general NIXL usage.
- (4.) I cannot agree that it is acceptable to make such changes to the main branch. Such code can only be used in personal branches during development.
|
/build |
Signed-off-by: Michal Shalev <mshalev@nvidia.com>
Signed-off-by: Michal Shalev <mshalev@nvidia.com>
|
/build |
|
/build |
What?
Following openucx/ucx#10928
Update NIXL device API documentation to reflect that post operations return
NIXL_IN_PROG.Why?
Post operations now return
UCS_INPROGRESSfrom underlying UCX API, which maps toNIXL_IN_PROG.How?