Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btl/usnic: move btl_usnic_lock initialization #12355

Merged
merged 1 commit into from
Feb 23, 2024

Conversation

lrbison
Copy link
Contributor

@lrbison lrbison commented Feb 20, 2024

Previously it was in component_init, but after
#12057 I noticed MPI finalize would find an uninitialized object and segfault. This change moves the initialization to component_open, so that component_close() can rely on the opal object initialization having been completed.


My reproducer on main was:

 mpirun -n 2 --mca pml cm  --mca pml_base_verbose 10 -- ./hello_world

which previously crashed with:

#0  opal_obj_run_destructors (object=0xffff86302dc8 <btl_usnic_lock>) at ../../../../opal/class/opal_object.h:470
#1  0x0000ffff862d87cc in usnic_component_close () at btl_usnic_component.c:215
#2  0x0000ffff8d7a042c in mca_base_component_close (component=0xffff86302788 <mca_btl_usnic_component>, output_id=-1) at mca_base_components_close.c:52
#3  0x0000ffff8d7a054c in mca_base_components_close (output_id=-1, components=0xffff8d845d20 <opal_btl_base_framework+80>, skip=0x0) at mca_base_components_close.c:89
#4  0x0000ffff8d7a04f0 in mca_base_framework_components_close (framework=0xffff8d845cd0 <opal_btl_base_framework>, skip=0x0) at mca_base_components_close.c:70
#5  0x0000ffff8d80293c in mca_btl_base_close () at base/btl_base_frame.c:231
#6  0x0000ffff8d7a23ec in mca_base_framework_close (framework=0xffff8d845cd0 <opal_btl_base_framework>) at mca_base_framework.c:252
#7  0x0000ffff8db7d7b8 in mca_bml_base_close () at base/bml_base_frame.c:130
#8  0x0000ffff8d7a23ec in mca_base_framework_close (framework=0xffff8dc45608 <ompi_bml_base_framework>) at mca_base_framework.c:252
#9  0x0000ffff8dadf754 in ompi_mpi_instance_finalize_common () at instance/instance.c:945
#10 0x0000ffff8dadf850 in ompi_mpi_instance_finalize (instance=0xffff8dc5a938 <ompi_mpi_instance_default>) at instance/instance.c:975
#11 0x0000ffff8dad1bfc in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:294
#12 0x0000ffff8db1c480 in PMPI_Finalize () at finalize.c:52

(gdb) p btl_usnic_lock
$1 = {super = {obj_class = 0x0, obj_reference_count = 0}, m_lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
    __size = '\000' <repeats 47 times>, __align = 0}, m_lock_atomic = 0}

My reproducer is now fixed, however I wonder if this is the correct fix, since #12057 removed the static init, and now we only have this object initialization.

In particular I'm not familiar with enough with MPI_init_thread to know if multiple threads may try to init the object at the same time, nor do I have a usnic test system.

Previously it was in component_init, but after
open-mpi#12057 I noticed MPI finalize
would find an uninitialized object and segfault.  This change moves
the initialization to component_open, so that component_close() can
rely on the opal object initialization having been completed.

Signed-off-by: Luke Robison <lrbison@amazon.com>
Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with the usnic btl but it makes sense to always construct the lock since it's always destroyed unconditionally.

@jsquyres jsquyres self-requested a review February 20, 2024 21:48
@jsquyres
Copy link
Member

@lrbison Many thanks for this. I did not realize that #12057 touched usnic. I'll review this shortly.

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this fix!

@wenduwan wenduwan merged commit cb00772 into open-mpi:main Feb 23, 2024
10 of 11 checks passed
@lrbison lrbison deleted the usnic_mutex branch February 24, 2024 05:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants