Aggregation Service scaling needs #62

idtoufik · 2024-06-27T14:56:28Z

We have established a fully automated workflow that collects ARA production reports, processes them, and forwards them to the Aggregation Service. Additionally, we are considering utilizing PAA to process bid request data to enhance our bidding models by collecting data about lost auctions. However, this approach would substantially increase the workload on the Aggregation Service (AS), as the volume of bid requests far exceeds the data used for attribution. Specifically, we exceed 400 million rows with 200 million domain buckets as indicated in the aggregation service sizing guide.

We used 1 day of our prod bidding data as input and we modified aggregation service tool to generate valid PAA reports with debug mode enabled, In the end, we had approximately 2.23 billion reports with an associated domain file of 685 Million distinct buckets.

We cannot batch PAA reports the same way we do for Attribution Reporting summary reports. For ARA summary reports we group individual reports by shared_info.attribution_destination to create report batches, for each group we create the associated domain by listing all the possible buckets of the advertiser identified using shared_info.attribution_destination field.

There is no such field in PAA, the only remaining batching option we have is to batch by the reception time, either daily or hourly, by design, the more data we aggregate the less noise we have so it’s always better to launch a daily aggregation. to stress test the AS we first split our daily data into 100 batches and run 100 different aggregations we then try to scale up and run a daily aggregation by first using the domain file for the whole day. The number of 100 was picked arbitrarily.

Attempt 1: Splitting both of reports and domains

Daily data was split into 100 batches resulting in approx. 22 Million reports and 6.8 Million domains per batch, this falls within the instance recommendation matrix.

AWS setup

The configuration used m5.12xlarge EC2 instances with a maximum auto-scaling capacity of 15 instances. A custom script simultaneously triggered all 100 aggregation jobs, which were executed with debug mode enabled

Results

Aggregation jobs were launched sequentially All the executions completed with the status DEBUG_SUCCESS_WITH_PRIVACY_BUDGET_EXHAUSTED except for one batch, which finished with a SUCCESS status

Each execution took about 30~35min to finish, the whole aggregation took approximately 4h to execute.

Number of visible messages in AWS SQS

The graph represents the number of jobs that remain to be processed on AWS. The first query was received at approx 11:50 and the last job finished at 15:48.

Note:

Almost all of our executions completed with DEBUG_SUCCESS_WITH_PRIVACY_BUDGET_EXHAUSTED, the batching strategy that we used for the load test is not viable in production.

Attempt 2: 100 batches with a unique domain

As we wanted to run the aggregation on the entire domain we only batched the reports and we kept the domain as it is, it resulted in 100 aggregation jobs with 22 million input reports and 684 Million input domains.

Job executions resulted in INPUT_DATA_READ_FAILED error, AS logs aren’t very explicit but this seems to be related to the domain being too large. We used an m5.12xlarge and then m5.24xlarge instances, results were the same. Below is the job stack trace from AS.

com.google.aggregate.adtech.worker.exceptions.AggregationJobProcessException: Exception while reading domain input data.
com.google.aggregate.adtech.worker.aggregation.concurrent.ConcurrentAggregationProcessor.process(ConcurrentAggregationProcessor.java:305)
com.google.aggregate.adtech.worker.WorkerPullWorkService.run(WorkerPullWorkService.java:142)
com.google.common.util.concurrent.AbstractExecutionThreadService$1.lambda$doStart$1(AbstractExecutionThreadService.java:57)
The root cause is: software.amazon.awssdk.services.s3.model.S3Exception: The requested range is not satisfiable (Service: S3, Status Code: 416, Request ID: XXXXXXXXXXXXXXXX, Extended Request ID: XXXXXXXXXXXXXXXX
software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleErrorResponse(CombinedResponseHandler.java:125)
software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleResponse(CombinedResponseHandler.java:82)
software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handle(CombinedResponseHandler.java:60)"

Load test conclusions

When inputting report/domain volumes that fall within the instance recommendation matrix, aggregation finishes successfully with reasonable time frames that are not far from those shared here
When the input report/domain volume is above those mentioned in the sizing guidance, the aggregation service encounters failures even with the largest instances. It poses a problem for Criteo as it puts at risk our target use case for the Private Aggregation API
The sizing shared in the doc is the true maximum of the AS, any use case requiring more domains should be revisited
As we have seen before when investigating ARA issues but even more pronounced here (because of new errors that we had not encountered so far), the errors of the AS are hard to understand and do not tell us why the jobs are failing.

To address these issues and improve the feasibility for our use case, the following actions are recommended

Review and improve the scalability of the Aggregation Service to handle higher volumes of data, both in terms of reports and especially domains.
Enhance the error messages and logs to provide more explicit information about the causes of failures. This will help in diagnosing and addressing issues more efficiently.

The text was updated successfully, but these errors were encountered:

nlrussell · 2024-07-02T19:13:14Z

Thank you for your taking the time to provide this detailed feedback! We are currently working on both troubleshooting documentation and scaling the output domain, so we will provide an update as those efforts progress.

nlrussell · 2024-07-24T19:45:01Z

@idtoufik Follow up question - how are you splitting your daily batches into 100? PAA does not have additional fields to split batches so we would expect that a daily batch could be split into 24 at most. If you're splitting into 100 you may have overlapping shared IDs across multiple jobs (which could lead to privacy budget exhausted issues).

anishahmd mentioned this issue Aug 13, 2024

Support for debugging PRIVACY_BUDGET_EXHAUSTED scenarios: Feedback Requested #69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregation Service scaling needs #62

Aggregation Service scaling needs #62

idtoufik commented Jun 27, 2024

nlrussell commented Jul 2, 2024

nlrussell commented Jul 24, 2024

Aggregation Service scaling needs #62

Aggregation Service scaling needs #62

Comments

idtoufik commented Jun 27, 2024

Attempt 1: Splitting both of reports and domains

AWS setup

Results

Attempt 2: 100 batches with a unique domain

Load test conclusions

nlrussell commented Jul 2, 2024

nlrussell commented Jul 24, 2024