SST metadata aggregation does not scale above 2GB (PIConGPU: more than 7k nodes on Frontier) #3846

franzpoeschel · 2023-10-17T11:36:42Z

Describe the bug
CP_consolidateDataToRankZero() in source/adios2/toolkit/sst/cp/cp_common.c collects the metadata to rank 0 upon EndStep. In PIConGPU, a single rank's contribution is ~38948 bytes.

On 7000 Frontier nodes with 8 GPUs per node: 38948B*7000*8 = 2080MB

Looking into CP_consolidateDataToRankZero():

    if (Stream->Rank == 0)
    {
        int TotalLen = 0;
        Displs = malloc(Stream->CohortSize * sizeof(*Displs));

        Displs[0] = 0;
        TotalLen = (RecvCounts[0] + 7) & ~7;

        for (int i = 1; i < Stream->CohortSize; i++)
        {
            int RoundUp = (RecvCounts[i] + 7) & ~7;
            Displs[i] = TotalLen;
            TotalLen += RoundUp;
        }

        RecvBuffer = malloc(TotalLen * sizeof(char));
    }

    /*
     * Now we have the receive buffer, counts, and displacements, and
     * can gather the data
     */

    SMPI_Gatherv(Buffer, DataSize, SMPI_CHAR, RecvBuffer, RecvCounts, Displs, SMPI_CHAR, 0,
                 Stream->mpiComm);

Since the Displs is a vector of int, the maximum supported dest buffer size for this method is 2GB.

To Reproduce
-- no reproducer --

Expected behavior
Some method to handle SST metadata aggregation at large scale

Desktop (please complete the following information):

Frontier, using PR Fix MPI Data plane cohort handling #3588

Additional context
I'm setting MarshalMethod = bp5 in SST

Following up
Was the issue fixed? Please report back.

The text was updated successfully, but these errors were encountered:

eisenhauer · 2023-10-17T11:39:03Z

No worries. Likely we just need to replicate BP5-file-engine-style techniques in SST.

franzpoeschel · 2023-10-17T11:43:54Z

No worries. Likely we just need to replicate BP5-file-engine-style techniques in SST.

Hey Greg, thank you for the fast reply. Is this something that can already be tested today by setting some hidden flag?
Otherwise, I might split the causing SMPI_Gatherv call into multiple smaller calls as a workaround for now.

franzpoeschel · 2023-10-17T11:46:25Z

Also, does it make a difference that I'm using branch #3588 on Frontier? (I need that branch for a scalability fix of the MPI DP)

eisenhauer · 2023-10-17T11:59:24Z

No worries. Likely we just need to replicate BP5-file-engine-style techniques in SST.

Hey Greg, thank you for the fast reply. Is this something that can already be tested today by setting some hidden flag? Otherwise, I might split the causing SMPI_Gatherv call into multiple smaller calls as a workaround for now.

Unfortunately no, not yet. In BP5Writer.cpp there's code that starts with the comment "Two-step metadata aggregation" that implements this for BP5, but it hasn't been done yet for SST. (Here, we're exploiting some characteristics of BP5 metadata. In particular, many time multiple ranks have identical meta-metadata and we can discard those, keeping only one unique copy. This reduces overall metadata size dramatically, at a cost of having to do aggregation in multiple stages. Norbert implemented a fix for this in the BP5 writer, but probably it should be reworked so that it can be shared between engines that use BP5 serialization. Doing that right (so that we use a simple approach for small scale and only go to more complex measures when necessary) isn't wildly hard, but it's non-trivial (and something I probably can't get to this week).

eisenhauer · 2023-10-17T11:59:57Z

Also, does it make a difference that I'm using branch #3588 on Frontier? (I need that branch for a scalability fix of the MPI DP)

No, this should be independent of those changes.

franzpoeschel · 2023-10-17T13:03:50Z

In the meantime, I'll try if using this as a workaround might help. This should fix the GatherV call at the cost of a slightly higher latency, but I don't know if there is any 32bit indexing going on later on that will break things again.

eisenhauer · 2023-10-17T13:19:46Z

In the meantime, I'll try if using this as a workaround might help

I'd think that that would function as a workaround. As far as I know there's no 32-bit indexing, only the limits of MPI. Longer-term I'd like to implement something smarter, but if this gets you through, let me know.

pnorbert · 2023-10-17T17:33:33Z

I can't look at this right now but note that the two level aggregation did not help with the attributes, only with meta-meta data. That is if an attribute is defined on all processes, that blows up the aggregation size. If that is the reason you reach the limit, two level aggregation does not decrease it.

…

On Tue, Oct 17, 2023, 3:19 PM Greg Eisenhauer ***@***.***> wrote: In the meantime, I'll try if using this <853ff0d> as a workaround might help I'd think that that would function as a workaround. As far as I know there's no 32-bit indexing, only the limits of MPI. Longer-term I'd like to implement something smarter, but if this gets you through, let me know. — Reply to this email directly, view it on GitHub <#3846 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAYYYLPG6WYKALFC3Q5BMIDX72AX5AVCNFSM6AAAAAA6DWCEE2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRWGQYDIOJXGE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

eisenhauer · 2023-10-17T17:45:38Z

I can't look at this right now but note that the two level aggregation did not help with the attributes, only with meta-meta data. That is if an attribute is defined on all processes, that blows up the aggregation size. If that is the reason you reach the limit, two level aggregation does not decrease it.

That is absolutely true...

franzpoeschel · 2023-10-18T08:25:18Z

but if this gets you through, let me know.

The job now ran through without crashing at 7168 nodes. I'll now try going full scale.

I can't look at this right now but note that the two level aggregation did not help with the attributes, only with meta-meta data. That is if an attribute is defined on all processes, that blows up the aggregation size. If that is the reason you reach the limit, two level aggregation does not decrease it.

We were at some point thinking about optimizing parallel attribute writes, e.g. by just disabling them on any rank but rank 0. It looks like we should do this.
(Even though that would only push out the 2Gb limit a bit further, the workaround that I'm now using avoids it entirely)

franzpoeschel · 2023-10-18T14:02:49Z

Update: I've successfully run SST full-scale for the first time on Frontier with this (9126 nodes, i.e. quasi full-scale)

eisenhauer · 2023-10-18T15:45:40Z

Update: I've successfully run SST full-scale for the first time on Frontier with this (9126 nodes, i.e. quasi full-scale)

Excellent... Adding an issue to address these things #3852 across the board.

eisenhauer · 2023-10-18T20:18:06Z

but if this gets you through, let me know.

The job now ran through without crashing at 7168 nodes. I'll now try going full scale.
Great

We were at some point thinking about optimizing parallel attribute writes, e.g. by just disabling them on any rank but rank 0. It looks like we should do this. (Even though that would only push out the 2Gb limit a bit further, the workaround that I'm now using avoids it entirely)

Yes, you should absolutely do this. At least currently, all attributes from all ranks are stored and installed by the reader, with duplicates doing nothing. Setting the same attributes on all nodes just adds overhead.

Ref. ornladios/ADIOS2#3846 (comment)

* ADIOS2: Optionally write attributes only from given ranks Ref. ornladios/ADIOS2#3846 (comment) * ADIOS2 < v2.9 compatibility in tests * Documentation

franzpoeschel mentioned this issue Oct 18, 2023

Fix MPI Data plane cohort handling #3588

Merged

franzpoeschel added a commit to franzpoeschel/openPMD-api that referenced this issue Oct 19, 2023

ADIOS2: Optionally write attributes only from given ranks

88242af

Ref. ornladios/ADIOS2#3846 (comment)

franzpoeschel mentioned this issue Oct 19, 2023

ADIOS2: Optionally write attributes only from given ranks openPMD/openPMD-api#1542

Merged

2 tasks

franzpoeschel added a commit to franzpoeschel/openPMD-api that referenced this issue Nov 16, 2023

ADIOS2: Optionally write attributes only from given ranks

3d35d8d

Ref. ornladios/ADIOS2#3846 (comment)

franzpoeschel added a commit to franzpoeschel/openPMD-api that referenced this issue Nov 20, 2023

ADIOS2: Optionally write attributes only from given ranks

d7003f3

Ref. ornladios/ADIOS2#3846 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SST metadata aggregation does not scale above 2GB (PIConGPU: more than 7k nodes on Frontier) #3846

SST metadata aggregation does not scale above 2GB (PIConGPU: more than 7k nodes on Frontier) #3846

franzpoeschel commented Oct 17, 2023 •

edited

Loading

eisenhauer commented Oct 17, 2023

franzpoeschel commented Oct 17, 2023

franzpoeschel commented Oct 17, 2023

eisenhauer commented Oct 17, 2023

eisenhauer commented Oct 17, 2023

franzpoeschel commented Oct 17, 2023 •

edited

Loading

eisenhauer commented Oct 17, 2023

pnorbert commented Oct 17, 2023 via email

eisenhauer commented Oct 17, 2023

franzpoeschel commented Oct 18, 2023

franzpoeschel commented Oct 18, 2023

eisenhauer commented Oct 18, 2023

eisenhauer commented Oct 18, 2023

SST metadata aggregation does not scale above 2GB (PIConGPU: more than 7k nodes on Frontier) #3846

SST metadata aggregation does not scale above 2GB (PIConGPU: more than 7k nodes on Frontier) #3846

Comments

franzpoeschel commented Oct 17, 2023 • edited Loading

eisenhauer commented Oct 17, 2023

franzpoeschel commented Oct 17, 2023

franzpoeschel commented Oct 17, 2023

eisenhauer commented Oct 17, 2023

eisenhauer commented Oct 17, 2023

franzpoeschel commented Oct 17, 2023 • edited Loading

eisenhauer commented Oct 17, 2023

pnorbert commented Oct 17, 2023 via email

eisenhauer commented Oct 17, 2023

franzpoeschel commented Oct 18, 2023

franzpoeschel commented Oct 18, 2023

eisenhauer commented Oct 18, 2023

eisenhauer commented Oct 18, 2023

franzpoeschel commented Oct 17, 2023 •

edited

Loading

franzpoeschel commented Oct 17, 2023 •

edited

Loading