forked from open-mpi/ompi
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpich additions for Romio341 refresh #5
Merged
ggouaillardet
merged 10 commits into
ggouaillardet:topic/romio341_refresh
from
markalle:romio341_refresh_mpich_additions
Apr 14, 2021
Merged
mpich additions for Romio341 refresh #5
ggouaillardet
merged 10 commits into
ggouaillardet:topic/romio341_refresh
from
markalle:romio341_refresh_mpich_additions
Apr 14, 2021
+131
−60
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ubtype creation > Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee When Type_create_darray() uses the block()/cyclic() function to construct a subtype for whatever dimension it's processing, it needs to resize the resulting type before it goes into the type construction as the input for processing the next dimension. The same problem is in the main path's type creation, and in romio's mirroring of it. Gist for a testcase: https://gist.github.com/markalle/940de93d64fd779e304ee124855b8e6a The darray_bug_romio.c testcase creates a darray using * 4 ranks in a 2x2 grid * looking at the type made for rank 0 * inner dimension: 4 ints distributed block over 2 ranks with 2 items each * outer dimension: 6 of the above distributed cyclic over 2 ranks with 2 items each The type created by processing the first dimension should look like this [ x x . . ] And then distributing those for the second dimension becomes [ x x x x ] [ . . . . ] [ . . . . ] [ . . . . ] [ x x x x ] [ . . . . ] Going to the MPI standard to justify why the first layout is right, it's where the definiton of the cyclic() function has a ub_marker of gsize*ex, eg 4 ints for that first type. Signed-off-by: Mark Allen <markalle@us.ibm.com>
> Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee ADIOI_Count_contiguous_blocks and ADIOI_Flatten are very similar functions, but they diverge a lot around the edges and in degenerate cases. In Spectrum MPI I spent some time making them consistent with each other but found that to be a losing battle. So the approach I used here is to not have Count() be as definitive, and rather let Flatten() have the last word on what the final flat->count really is. Eg Flatten's *curr_index is the real count. The changes I made are 1. Fix a couple places in Flatten where *curr_index was updated out of sync with what was actually being added to flat->indices[] and flat->blocklens[]. That's one of the important book-keeping rules Flatten should follow. There were a couple places (when counts[] arrays have 0s) where that wasn't the case (see the "nonzeroth" var in this commit). 2. The main change I made was to reset flat->count based on Flatten's curr_index. This is because the divergence between the two functions is too great to reliably fix. 3. A third change is just a safety valve, using a flatlist_node_grow() function just in case the Count function returns a smaller value than Flatten ends up trying to write into the array, and related to this I got rid of the various --(flat->count) lines, since that var now represents the allocated size of the array, until right after the Flatten function when it's reset to represent the data written to the array like it used to be. I don't think we even need ADIOI_Count_contiguous_blocks() anymore, but I didn't remove it. Signed-off-by: Mark Allen <markalle@us.ibm.com>
> Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee The ADIOI_GEN_WriteStrided funcion uses data sieving on non-contiguous types. That is, if it wants to write data at locations [W...X...Y...Z...] it reads the whole buffer [dddddddddddddddd] changes the locations it wants to write to [WdddXdddYdddZddd] then writes the whole thing back. It uses locks to make this safe, but the problem is this only protects against other parts of the product that are using locks. And without this PR a peer who is simultaneously making a simple non-contiguous write wouldn't have locked. A testcase to demonstrate the original problem is here: https://gist.github.com/markalle/d7da240c19e57f095c5d1b13240dae24 % mpicc -o x romio_write_timing.c % mpirun -np 4 ./x Note: you need to use a filesystem that uses ADIOI_GEN_WriteStrided to hit the bug. I was using GPFS. This commit is pulled from wkliao after discussing where to put the new lock. It adds locks to contiguous writes in independent write functions when data sieving write is not disabled Signed-off-by: Mark Allen <markalle@us.ibm.com>
…ments > Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee typecasting a (void*) due to compiler complaint when -Wpedantic is used Signed-off-by: Mark Allen <markalle@us.ibm.com>
> Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee Configuring mpich with --with-file-system=nfs+ufs+gpfs it wouldn't build because the last arg of ADIOI_OneSidedWriteAggregation() changed, so this PR adds an & to it Signed-off-by: Mark Allen <markalle@us.ibm.com>
> Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee Configuring mpich with --with-file-system=nfs+ufs+gpfs it wouldn't run because there are a couple free() calls that reported null pointers, so I changed a couple places to "if (x) { ADIOI_Free(x); }" Signed-off-by: Mark Allen <markalle@us.ibm.com>
> Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee Romio GPFS had a runtime problem due to a .mem_ptrs field that was uninitialized that it was potentially trying to free. Signed-off-by: Mark Allen <markalle@us.ibm.com>
> Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee adio.h must be included before any system headers so that config settings are taken into account. Fixes an issue where aligned_alloc may be used without a declaration. See similar fix in [4fcffbe695cb]. Signed-off-by: Mark Allen <markalle@us.ibm.com>
> Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee mpl.h must be included before any system headers so that config settings are taken into account. Fixes an issue where aligned_alloc may be used without a declaration. Signed-off-by: Mark Allen <markalle@us.ibm.com>
> Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee There was a problem with romio's use of MPL and the include order sometimes having system files before the config files generated by configure. So for example at configure time _GNU_SOURCE was used and thus configure concluded that aligned_memory() was already declared, but then later in romio sometimes stdlib.h or similar sys includes were coming before mplconfig.h and maybe even before romioconf.h. My understanding of the design described by hzhou is that universlaly putting the config files at the top is the best solution to this ordering. All over the romio code they seem to use adio.h as their top level include, which has romioconf.h as its first include. So in this PR I'm moved adio.h to the top of a few more files. And I also made romioconf.h include mplconfig.h as its first include. Otherwise what we have for romio's use of MPL is adio.h: includes romioconf.h <-- we're relying on _GNU_SOURCE being here too includes many things including system files includes adioi.h includes mpl.h includes mplconfig.h with _GNU_SOURCE so having _GNU_SOURCE down in mplconfig.h doesn't really ensure it's in front of the system files unless we can rely on it being there in romioconf.h (Maybe we can? I'm not sure). Anyway as long as the top level romioconf.h knows it's consuming MPL it can put mplconfig.h at its top and then all the rest of the romio code only needs to follow the rule of putting adio.h first and then both romio and mpl will get their config settings included at the top Signed-off-by: Mark Allen <markalle@us.ibm.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a batch of romio fixes from OMPI that we've contributed back to mpich main. It's from the time period after we stopped fixing things in our copy of romio and instead started fixing them over in actual mpich romio.
At the time of this PR the below fixes from us have all been accepted into mpich main, but haven't yet appeared in numbered releases like romio 3.4.1.
This is meant to be an add-on to the romio 3.4.1 refresh by @ggouaillardet
open-mpi#8740