Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fabtests: Disable fi_rdm_tagged_peek for cleanup failure for psm3 and ucx #10124

Merged
merged 2 commits into from
Jun 26, 2024

Conversation

zachdworkin
Copy link
Contributor

@zachdworkin zachdworkin commented Jun 25, 2024

fi_rdm_tagged_peek fails to cleanup with "munmap_chunk(): invalid pointer" when trying to free hfi_nids in psm_ep.c:1161.
This test is successful when FI_PROVIDER is unset and fails when it is set to "psm3" or "PSM3". There is an open issue in ofiwg/libfabric to track this bug. When it is resolved we can re-enable this test.

Issue opened: #10123

fi_rdm_tagged_peek fails to cleanup with "segmentation failt" when trying to cleanup the endpoint.
This failure is a race condition and has no known 100% fail case.

Issue opened: #10126

fi_rdm_tagged_peek fails to cleanup with "munmap_chunk(): invalid pointer"
when trying to free hfi_nids in psm_ep.c:1161.
This test is successful when FI_PROVIDER is unset and fails when it is
set to "psm3" or "PSM3". There is an open issue in ofiwg/libfabric to track
this bug. When it is resolved we can re-enable this test.

Issue opened: 10123

Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
fi_rdm_tagged_peek is failing on cleanup path.
ft_free_res() -> ft_close_fids() -> fi_close() -> ucx_ep_close()
-> ucp_worker_destroy() -> ucp_worker_discard_uct_ep_progress()
-> ucp_ep_destroy_base() -> __funlockfile()

The reported error is:
"Segmentation fault: address not mapped to object at address 0x8"

This is a race condition and does not occur every time.
To reproduce run:
server: fi_rdm_tagged_peek -p ucx -E
client: fi_rdm_tagged_peek -p ucx -E server_address

Issue 10126 is tracking this bug. Re-enable this test when it is resolved.

Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
@zachdworkin zachdworkin changed the title fabtests/psm3: Disable fi_rdm_tagged_peek for cleanup failure fabtests: Disable fi_rdm_tagged_peek for cleanup failure for psm3 and ucx Jun 25, 2024
@zachdworkin
Copy link
Contributor Author

zachdworkin commented Jun 26, 2024

bot:aws:retest

@shijin-aws
Copy link
Contributor

@zachdworkin AWS CI currently is broken due to a dependency issue. I will fix it shortly

@zachdworkin
Copy link
Contributor Author

@shijin-aws Thanks for the head's up! Can you please replay this PR when its fixed?

@shijin-aws
Copy link
Contributor

Yep, will do

@zachdworkin
Copy link
Contributor Author

@shijin-aws since these changes are to the .exclude files for fabtests do we need to wait for aws ci?

@shijin-aws
Copy link
Contributor

Yeah I think you can feel free to merge it.

@shijin-aws
Copy link
Contributor

AWS CI doesn't run psm3 and ucx tests

@zachdworkin zachdworkin merged commit 8697853 into ofiwg:main Jun 26, 2024
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants