Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock release for write fails on Cori with more than 1 client #10

Closed
QiaoK opened this issue Mar 8, 2021 · 1 comment
Closed

Lock release for write fails on Cori with more than 1 client #10

QiaoK opened this issue Mar 8, 2021 · 1 comment

Comments

@QiaoK
Copy link
Contributor

QiaoK commented Mar 8, 2021

To reproduce, follow my steps to install PDC on Cori. Then goto the bin folder in the install folder. Run the following. You can see the segmentation fault.
./mpi_test.sh ./write_obj srun 1 2 o 1 int

qkt561@nid00009:/global/cscratch1/sd/qkt561/FS_1M_169/bin> ./mpi_test.sh ./write_obj srun 1 2 o 1 int
Input arguments are the followings
o 1 int
testing: ./write_obj
srun -n 1 ./pdc_server.exe &
srun -n 2 ./write_obj o 1 int
Writing a 1 MB object [o_0] with 2 clients.
Writing a 1 MB object [o_1] with 2 clients.
==PDC_CLIENT: PDC_DEBUG set to 0!
==PDC_CLIENT[0]: Config file from default location [./pdc_tmp/server.cfg] not available, waiting 1 seconds
==PDC_CLIENT[0]: Config file from default location [./pdc_tmp/server.cfg] not available, waiting 2 seconds
==PDC_CLIENT[0]: Config file from default location [./pdc_tmp/server.cfg] not available, waiting 4 seconds
==PDC_CLIENT[0]: Config file from default location [./pdc_tmp/server.cfg] not available, waiting 8 seconds
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process

==PDC_SERVER[0]: using [./pdc_tmp/] as tmp dir. 0 OSTs per data file, 0% to BB
==PDC_SERVER[0]: using ofi+tcp
==PDC_SERVER[0]: without multi-thread!
==PDC_SERVER[0]: Read cache enabled!
==PDC_SERVER[0]: Successfully established connection to 0 other PDC servers
==PDC_SERVER[0]: Server ready!


==PDC_CLIENT[0]: Found 1 PDC Metadata servers, running with 2 PDC clients
==PDC_CLIENT: using ofi+tcp
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
==PDC_CLIENT[0]: Client lookup all servers at start time!
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
==PDC_CLIENT[0]: using [./pdc_tmp] as tmp dir, 2 clients per server
create a new pdc
create a new pdc
my_data_size at rank 0 is 524288
my_data_size at rank 1 is 524288
rank 0 offset = 0, length = 524288, unit size = 4
rank 1 offset = 524288, length = 524288, unit size = 4
Error in /global/homes/q/qkt561/test_install/pdc/src/api/pdc_client_server_common.c:1946
 # buf_map_region_release_bulk_transfer_cb(): Error in region_release_bulk_transfer_cb()
srun: error: nid00009: task 0: Segmentation fault
srun: Terminating job step 40407325.8
srun: error: nid00009: task 1: Segmentation fault
srun: Terminating job step 40407325.7
slurmstepd: error: *** STEP 40407325.7 ON nid00009 CANCELLED AT 2021-03-07T18:38:44 ***
srun: error: nid00009: task 0: Killed
srun: Force Terminated job step 40407325.7
==PDC_CLIENT: PDC_DEBUG set to 0!
==PDC_CLIENT[0]: Found 1 PDC Metadata servers, running with 1 PDC clients
==PDC_CLIENT: using ofi+tcp
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
==PDC_CLIENT[0]: Client lookup all servers at start time!
srun: error: nid00009: task 0: Segmentation fault
srun: Terminating job step 40407325.9
houjun added a commit that referenced this issue Mar 11, 2021
@houjun
Copy link
Member

houjun commented Mar 11, 2021

Fixed the test code, the mapping should use local offset (all ranks use 0), and global offset (differ in different ranks) for buffer mapping, and the lock and release should use "global_region" instead of "local_region".

@QiaoK QiaoK closed this as completed Mar 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants