Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SST metadata aggregation does not scale above 2GB (PIConGPU: more than 7k nodes on Frontier) #3846

Open
franzpoeschel opened this issue Oct 17, 2023 · 13 comments

Comments

@franzpoeschel
Copy link
Contributor

franzpoeschel commented Oct 17, 2023

Describe the bug
CP_consolidateDataToRankZero() in source/adios2/toolkit/sst/cp/cp_common.c collects the metadata to rank 0 upon EndStep. In PIConGPU, a single rank's contribution is ~38948 bytes.

On 7000 Frontier nodes with 8 GPUs per node: 38948B*7000*8 = 2080MB

Looking into CP_consolidateDataToRankZero():

    if (Stream->Rank == 0)
    {
        int TotalLen = 0;
        Displs = malloc(Stream->CohortSize * sizeof(*Displs));

        Displs[0] = 0;
        TotalLen = (RecvCounts[0] + 7) & ~7;

        for (int i = 1; i < Stream->CohortSize; i++)
        {
            int RoundUp = (RecvCounts[i] + 7) & ~7;
            Displs[i] = TotalLen;
            TotalLen += RoundUp;
        }

        RecvBuffer = malloc(TotalLen * sizeof(char));
    }

    /*
     * Now we have the receive buffer, counts, and displacements, and
     * can gather the data
     */

    SMPI_Gatherv(Buffer, DataSize, SMPI_CHAR, RecvBuffer, RecvCounts, Displs, SMPI_CHAR, 0,
                 Stream->mpiComm);

Since the Displs is a vector of int, the maximum supported dest buffer size for this method is 2GB.

To Reproduce
-- no reproducer --

Expected behavior
Some method to handle SST metadata aggregation at large scale

Desktop (please complete the following information):

Additional context
I'm setting MarshalMethod = bp5 in SST

Following up
Was the issue fixed? Please report back.

@eisenhauer
Copy link
Member

No worries. Likely we just need to replicate BP5-file-engine-style techniques in SST.

@franzpoeschel
Copy link
Contributor Author

No worries. Likely we just need to replicate BP5-file-engine-style techniques in SST.

Hey Greg, thank you for the fast reply. Is this something that can already be tested today by setting some hidden flag?
Otherwise, I might split the causing SMPI_Gatherv call into multiple smaller calls as a workaround for now.

@franzpoeschel
Copy link
Contributor Author

Also, does it make a difference that I'm using branch #3588 on Frontier? (I need that branch for a scalability fix of the MPI DP)

@eisenhauer
Copy link
Member

No worries. Likely we just need to replicate BP5-file-engine-style techniques in SST.

Hey Greg, thank you for the fast reply. Is this something that can already be tested today by setting some hidden flag? Otherwise, I might split the causing SMPI_Gatherv call into multiple smaller calls as a workaround for now.

Unfortunately no, not yet. In BP5Writer.cpp there's code that starts with the comment "Two-step metadata aggregation" that implements this for BP5, but it hasn't been done yet for SST. (Here, we're exploiting some characteristics of BP5 metadata. In particular, many time multiple ranks have identical meta-metadata and we can discard those, keeping only one unique copy. This reduces overall metadata size dramatically, at a cost of having to do aggregation in multiple stages. Norbert implemented a fix for this in the BP5 writer, but probably it should be reworked so that it can be shared between engines that use BP5 serialization. Doing that right (so that we use a simple approach for small scale and only go to more complex measures when necessary) isn't wildly hard, but it's non-trivial (and something I probably can't get to this week).

@eisenhauer
Copy link
Member

Also, does it make a difference that I'm using branch #3588 on Frontier? (I need that branch for a scalability fix of the MPI DP)

No, this should be independent of those changes.

@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Oct 17, 2023

In the meantime, I'll try if using this as a workaround might help. This should fix the GatherV call at the cost of a slightly higher latency, but I don't know if there is any 32bit indexing going on later on that will break things again.

@eisenhauer
Copy link
Member

In the meantime, I'll try if using this as a workaround might help

I'd think that that would function as a workaround. As far as I know there's no 32-bit indexing, only the limits of MPI. Longer-term I'd like to implement something smarter, but if this gets you through, let me know.

@pnorbert
Copy link
Contributor

pnorbert commented Oct 17, 2023 via email

@eisenhauer
Copy link
Member

I can't look at this right now but note that the two level aggregation did not help with the attributes, only with meta-meta data. That is if an attribute is defined on all processes, that blows up the aggregation size. If that is the reason you reach the limit, two level aggregation does not decrease it.

That is absolutely true...

@franzpoeschel
Copy link
Contributor Author

but if this gets you through, let me know.

The job now ran through without crashing at 7168 nodes. I'll now try going full scale.

I can't look at this right now but note that the two level aggregation did not help with the attributes, only with meta-meta data. That is if an attribute is defined on all processes, that blows up the aggregation size. If that is the reason you reach the limit, two level aggregation does not decrease it.

We were at some point thinking about optimizing parallel attribute writes, e.g. by just disabling them on any rank but rank 0. It looks like we should do this.
(Even though that would only push out the 2Gb limit a bit further, the workaround that I'm now using avoids it entirely)

@franzpoeschel
Copy link
Contributor Author

Update: I've successfully run SST full-scale for the first time on Frontier with this (9126 nodes, i.e. quasi full-scale)

@eisenhauer
Copy link
Member

Update: I've successfully run SST full-scale for the first time on Frontier with this (9126 nodes, i.e. quasi full-scale)

Excellent... Adding an issue to address these things #3852 across the board.

@eisenhauer
Copy link
Member

but if this gets you through, let me know.

The job now ran through without crashing at 7168 nodes. I'll now try going full scale.
Great

We were at some point thinking about optimizing parallel attribute writes, e.g. by just disabling them on any rank but rank 0. It looks like we should do this. (Even though that would only push out the 2Gb limit a bit further, the workaround that I'm now using avoids it entirely)

Yes, you should absolutely do this. At least currently, all attributes from all ranks are stored and installed by the reader, with duplicates doing nothing. Setting the same attributes on all nodes just adds overhead.

franzpoeschel added a commit to franzpoeschel/openPMD-api that referenced this issue Oct 19, 2023
franzpoeschel added a commit to franzpoeschel/openPMD-api that referenced this issue Nov 16, 2023
franzpoeschel added a commit to franzpoeschel/openPMD-api that referenced this issue Nov 20, 2023
ax3l pushed a commit to openPMD/openPMD-api that referenced this issue Dec 5, 2023
* ADIOS2: Optionally write attributes only from given ranks

Ref.
ornladios/ADIOS2#3846 (comment)

* ADIOS2 < v2.9 compatibility in tests
* Documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants