Skip to content

Conversation

@michal-shalev
Copy link
Contributor

@michal-shalev michal-shalev commented Oct 8, 2025

What?

Following openucx/ucx#10928
Update NIXL device API documentation to reflect that post operations return NIXL_IN_PROG.

Why?

Post operations now return UCS_INPROGRESS from underlying UCX API, which maps to NIXL_IN_PROG.

How?

  • Updated docs for all post functions
  • Added an error log

Signed-off-by: Michal Shalev <mshalev@nvidia.com>
@github-actions
Copy link

github-actions bot commented Oct 8, 2025

👋 Hi michal-shalev! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@michal-shalev
Copy link
Contributor Author

/build

ovidiusm
ovidiusm previously approved these changes Oct 8, 2025
if (status == UCS_INPROGRESS) {
return NIXL_IN_PROG;
}
printf("UCX returned error: %d\n", status);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
printf("UCX returned error: %d\n", status);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with you, we need an error log here. It doesn't add another branch, we already know that it's an error and NIXL_ERR_BACKEND is too generic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logging method is not suitable in this context.
I think we don't need such logging in the device code.

Copy link
Contributor

@rakhmets rakhmets Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, using naked printf as a logging method is generally not convenient, since the logging level and output are not controlled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it not suitable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Printing the UCX error code does not provide useful information to the user, as it requires access to the UCX source code. The error information should be logged by UCX.

Copy link
Contributor Author

@michal-shalev michal-shalev Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the concern about device API logging. However, I'm keeping this printf for now for the following reasons:

  1. UCX doesn't consistently log errors in device code paths, leaving us with no visibility when things fail.
  2. NIXL_ERR_BACKEND alone provides zero actionable information - at minimum we need the UCX error code to debug.
  3. We're actively debugging integration issues with this code right now, and this printf has been essential for troubleshooting
  4. Proper logging infrastructure is tracked separately - I've opened a ticket to implement a proper logging macro for device code

This is a pragmatic debugging aid during urgent integration work. I'll address the logging infrastructure properly in the follow-up ticket, but for now this printf is necessary and helpful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's push the device debugging macro soon, for now leave this print and change it when macro is merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the concern about device API logging. However, I'm keeping this printf for now for the following reasons:

  1. UCX doesn't consistently log errors in device code paths, leaving us with no visibility when things fail.
  2. NIXL_ERR_BACKEND alone provides zero actionable information - at minimum we need the UCX error code to debug.
  3. We're actively debugging integration issues with this code right now, and this printf has been essential for troubleshooting
  4. Proper logging infrastructure is tracked separately - I've opened a ticket to implement a proper logging macro for device code

This is a pragmatic debugging aid during urgent integration work. I'll address the logging infrastructure properly in the follow-up ticket, but for now this printf is necessary and helpful.

  1. This issue should be addressed on UCX level, and not as w/a in NIXL.
  2. As I previously mentioned, UCX error code without explanation only makes sense in conjunction with the UCX source code, and is almost useless in terms of more general NIXL usage.
  3. (4.) I cannot agree that it is acceptable to make such changes to the main branch. Such code can only be used in personal branches during development.

@michal-shalev
Copy link
Contributor Author

/build

Signed-off-by: Michal Shalev <mshalev@nvidia.com>
Signed-off-by: Michal Shalev <mshalev@nvidia.com>
Signed-off-by: Michal Shalev <mshalev@nvidia.com>
@michal-shalev
Copy link
Contributor Author

/build

@michal-shalev michal-shalev enabled auto-merge (squash) October 8, 2025 18:25
@michal-shalev
Copy link
Contributor Author

/build

@michal-shalev michal-shalev merged commit f7254f9 into ai-dynamo:main Oct 8, 2025
20 of 21 checks passed
@michal-shalev michal-shalev deleted the nixl-device-api-return-status-update branch October 8, 2025 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants