-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupted data using parallel hdf5 #12718
Comments
@tpadioleau thank you for the bug report, what file system is this on? I will have a look in the next few days, and a reproducer is definitely super helpful |
we do have an optimization in the code specifically for cartesian communicators, I am wondering whether something in that logic is slightly off for 3 process, which is a bit of an unusual number for cartesian communicators. |
I have just edited the issue to add the missing archive. I am working on my laptop, no parallel filesystem. I can also mention that I was not able to reproduce the error on this supercomputer https://mesocentre.pages.centralesupelec.fr/user_doc/ruche/01_cluster_overview with Open MPI.
I could also try with 4 processes and it also gives corrupted results after a few attempts. |
I did some preliminary analysis of this issue, and I am not yet sure what to make of it. I ran the testcode with 3 processes on my local workstation using the romio component as a reference, and all relevant collective components of ompio for comparison. The output file is according to h5diff always identical, so either they are all wrong (including romio on Open MPI), or all correct. Please note, that they do not match the sample h5 file that was in the tar file, not sure whether that was supposed to be an example for the correct output, or for an erroneous one.
|
Thank you for your time. The difficulty is that I also get the correct result from time to time. The sample h5 in the tar is an example of an erroneous result. If you are interested, I can try to generate a docker image to get closer to my environment ? |
Here is an archive environment.zip that contains a Dockerfile that was generated (slightly modified to create a toto user) from a spack.yaml environment. Inside the container and mounting the directory that contains the reproducer in The expected result with 3 mpi processes should look like this HDF5 "distributed_array.h5" {
GROUP "/" {
DATASET "distributed_array" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 3, 5 ) / ( 3, 5 ) }
DATA {
(0,0): 1, 1, 1, 1, 1,
(1,0): 2, 2, 2, 2, 2,
(2,0): 3, 3, 3, 3, 3
}
}
}
} |
Ok, I can confirm that inside of the docker image I can reproduce the issue with the fcoll/vulcan component. Using the other fcoll components (i.e. individual, dynamic_gen2) produces the correct output. So the question is why is that occurring. I noticed that the precompiled Open MPI library in the docker image is configured quite differently than what I usually do, I am wondering whether one of them is contributing to this (e.g. --disable-builtin-atomics, --without-cma, --enable-mpi1-compatibility). I will look into this later this week. |
I know what is triggering the issue. I just need to decide whether an if-statement in the code is erroneous or whether I need to add some locking protection around a particular write operation. Both make the test pass reliably, but since the code that includes the if-statement in question was written many years ago, I don't remember all the details (which would be important to decide whether the if-statement is erroneous or not). Either way, it is a legitimate bug, not a fluke or configure option issue. |
That is good news, thank you! |
luckily the commit message from 5 years ago was helpful, the if-statement is correct in that it does what it was supposed to do. |
Do you know if the bug can affect other communication/write operations ? |
yes, it could, but it depends on the file system how likely it is. I will have a fix ready either later today or tomorrow, and I will backport it to both 5.0.x and 4.1.x series |
The fs/ufs component by default disabled all file locking before read/write operations (except for NFS file systems). This was based on the assumption, that the file system itself performs the required locking operation and hence we don't have to add to it. This assumption is incorrect when using data sieving. In data sieving, the code 'ignore' small gaps when we write to a file, and perform instead a read-modify-write sequence ourselves for performance reasons. The problem is however that even within a collective operation not all aggregators might want to use data sieving. Hence, enabling locking just for the data-sieving routines is insufficient, all processes have to perform the locking. Therefore, our two options are: a) either disable write data-sieving by default, or b) enable range-locking by default. After some testing, I think enabling range-locking be default is the safer and better approach. It doesn't seem to show any significant performance impact on my test systems. Note, that on Lustre file systems, we can keep the default to no-locking as far as I can see, since the collective algorithm used by Lustre is unlikely to produce this pattern. I did add in however an mca parameter that allows us to control the locking algorithm used by the Lustre component as well, in case we need to change that for a particular use-case or platform. Fixes Issue open-mpi#12718 Signed-off-by: Edgar Gabriel <Edgar.Gabriel@amd.com>
The fs/ufs component by default disabled all file locking before read/write operations (except for NFS file systems). This was based on the assumption, that the file system itself performs the required locking operation and hence we don't have to add to it. This assumption is incorrect when using data sieving. In data sieving, the code 'ignore' small gaps when we write to a file, and perform instead a read-modify-write sequence ourselves for performance reasons. The problem is however that even within a collective operation not all aggregators might want to use data sieving. Hence, enabling locking just for the data-sieving routines is insufficient, all processes have to perform the locking. Therefore, our two options are: a) either disable write data-sieving by default, or b) enable range-locking by default. After some testing, I think enabling range-locking be default is the safer and better approach. It doesn't seem to show any significant performance impact on my test systems. Note, that on Lustre file systems, we can keep the default to no-locking as far as I can see, since the collective algorithm used by Lustre is unlikely to produce this pattern. I did add in however an mca parameter that allows us to control the locking algorithm used by the Lustre component as well, in case we need to change that for a particular use-case or platform. Fixes Issue open-mpi#12718 Signed-off-by: Edgar Gabriel <Edgar.Gabriel@amd.com>
@tpadioleau I filed a pr that fixes the issue. I spent quite some time thinking about the issue and the various options, I am 99% sure that real application scenario will not hit this problem. Part of the reason why you saw this error is actually because the data volume is so small that it all ended up in a single file system block, which caused the inconsistency. In a real life application scenario with data volumes are not this tiny, I don't think this issue would have occurred. That being said, we still want to fix it, hence the PR. |
The fs/ufs component by default disabled all file locking before read/write operations (except for NFS file systems). This was based on the assumption, that the file system itself performs the required locking operation and hence we don't have to add to it. This assumption is incorrect when using data sieving. In data sieving, the code 'ignore' small gaps when we write to a file, and perform instead a read-modify-write sequence ourselves for performance reasons. The problem is however that even within a collective operation not all aggregators might want to use data sieving. Hence, enabling locking just for the data-sieving routines is insufficient, all processes have to perform the locking. Therefore, our two options are: a) either disable write data-sieving by default, or b) enable range-locking by default. After some testing, I think enabling range-locking be default is the safer and better approach. It doesn't seem to show any significant performance impact on my test systems. Note, that on Lustre file systems, we can keep the default to no-locking as far as I can see, since the collective algorithm used by Lustre is unlikely to produce this pattern. I did add in however an mca parameter that allows us to control the locking algorithm used by the Lustre component as well, in case we need to change that for a particular use-case or platform. Fixes Issue open-mpi#12718 Signed-off-by: Edgar Gabriel <Edgar.Gabriel@amd.com> (cherry picked from commit c697f28)
The fs/ufs component by default disabled all file locking before read/write operations (except for NFS file systems). This was based on the assumption, that the file system itself performs the required locking operation and hence we don't have to add to it. This assumption is incorrect when using data sieving. In data sieving, the code 'ignore' small gaps when we write to a file, and perform instead a read-modify-write sequence ourselves for performance reasons. The problem is however that even within a collective operation not all aggregators might want to use data sieving. Hence, enabling locking just for the data-sieving routines is insufficient, all processes have to perform the locking. Therefore, our two options are: a) either disable write data-sieving by default, or b) enable range-locking by default. After some testing, I think enabling range-locking be default is the safer and better approach. It doesn't seem to show any significant performance impact on my test systems. Note, that on Lustre file systems, we can keep the default to no-locking as far as I can see, since the collective algorithm used by Lustre is unlikely to produce this pattern. I did add in however an mca parameter that allows us to control the locking algorithm used by the Lustre component as well, in case we need to change that for a particular use-case or platform. Fixes Issue open-mpi#12718 Signed-off-by: Edgar Gabriel <Edgar.Gabriel@amd.com> (cherry picked from commit c697f28)
You may be right, I only noticed this issue when developing on my laptop hence with small test cases. Thank you again for your time! |
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v5.0.3
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Using spack 0.22.1
Please describe the system on which you are running
Details of the problem
I am using parallel hdf5 to write a 2D distributed array. If I pass a cartesian communicator to hdf5, I sometimes notice that the dataset in the hdf5 file is corrupted when using 3 processes. You can find attached (hdf5_reproducer.tar.gz) a small reproducer in C (< 100 LOC) with a hdf5 file I got running the reproducer. You will also find the result of the
ompi_info
command.Without understanding the logic behind, I also noticed different situations where I seem to never get corrupted data:
MPI_THREAD_MULTIPLE
during MPI initialization,Thank you,
Thomas
The text was updated successfully, but these errors were encountered: