Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregation Service scaling needs #62

Open
idtoufik opened this issue Jun 27, 2024 · 2 comments
Open

Aggregation Service scaling needs #62

idtoufik opened this issue Jun 27, 2024 · 2 comments

Comments

@idtoufik
Copy link

We have established a fully automated workflow that collects ARA production reports, processes them, and forwards them to the Aggregation Service. Additionally, we are considering utilizing PAA to process bid request data to enhance our bidding models by collecting data about lost auctions. However, this approach would substantially increase the workload on the Aggregation Service (AS), as the volume of bid requests far exceeds the data used for attribution. Specifically, we exceed 400 million rows with 200 million domain buckets as indicated in the aggregation service sizing guide.

We used 1 day of our prod bidding data as input and we modified aggregation service tool to generate valid PAA reports with debug mode enabled, In the end, we had approximately 2.23 billion reports with an associated domain file of 685 Million distinct buckets.

We cannot batch PAA reports the same way we do for Attribution Reporting summary reports. For ARA summary reports we group individual reports by shared_info.attribution_destination to create report batches, for each group we create the associated domain by listing all the possible buckets of the advertiser identified using shared_info.attribution_destination field.

There is no such field in PAA, the only remaining batching option we have is to batch by the reception time, either daily or hourly, by design, the more data we aggregate the less noise we have so it’s always better to launch a daily aggregation. to stress test the AS we first split our daily data into 100 batches and run 100 different aggregations we then try to scale up and run a daily aggregation by first using the domain file for the whole day. The number of 100 was picked arbitrarily.

Attempt 1: Splitting both of reports and domains

Daily data was split into 100 batches resulting in approx. 22 Million reports and 6.8 Million domains per batch, this falls within the instance recommendation matrix.

AWS setup

The configuration used m5.12xlarge EC2 instances with a maximum auto-scaling capacity of 15 instances. A custom script simultaneously triggered all 100 aggregation jobs, which were executed with debug mode enabled

Results

Aggregation jobs were launched sequentially All the executions completed with the status DEBUG_SUCCESS_WITH_PRIVACY_BUDGET_EXHAUSTED except for one batch, which finished with a SUCCESS status

Each execution took about 30~35min to finish, the whole aggregation took approximately 4h to execute.

Number of visible messages in AWS SQS

The graph represents the number of jobs that remain to be processed on AWS. The first query was received at approx 11:50 and the last job finished at 15:48.

Note:

Almost all of our executions completed with DEBUG_SUCCESS_WITH_PRIVACY_BUDGET_EXHAUSTED, the batching strategy that we used for the load test is not viable in production.

Attempt 2: 100 batches with a unique domain

As we wanted to run the aggregation on the entire domain we only batched the reports and we kept the domain as it is, it resulted in 100 aggregation jobs with 22 million input reports and 684 Million input domains.

Job executions resulted in INPUT_DATA_READ_FAILED error, AS logs aren’t very explicit but this seems to be related to the domain being too large. We used an m5.12xlarge and then m5.24xlarge instances, results were the same. Below is the job stack trace from AS.

com.google.aggregate.adtech.worker.exceptions.AggregationJobProcessException: Exception while reading domain input data.
com.google.aggregate.adtech.worker.aggregation.concurrent.ConcurrentAggregationProcessor.process(ConcurrentAggregationProcessor.java:305)
com.google.aggregate.adtech.worker.WorkerPullWorkService.run(WorkerPullWorkService.java:142)
com.google.common.util.concurrent.AbstractExecutionThreadService$1.lambda$doStart$1(AbstractExecutionThreadService.java:57)
The root cause is: software.amazon.awssdk.services.s3.model.S3Exception: The requested range is not satisfiable (Service: S3, Status Code: 416, Request ID: XXXXXXXXXXXXXXXX, Extended Request ID: XXXXXXXXXXXXXXXX
software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleErrorResponse(CombinedResponseHandler.java:125)
software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleResponse(CombinedResponseHandler.java:82)
software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handle(CombinedResponseHandler.java:60)"

Load test conclusions

  1. When inputting report/domain volumes that fall within the instance recommendation matrix, aggregation finishes successfully with reasonable time frames that are not far from those shared here

  2. When the input report/domain volume is above those mentioned in the sizing guidance, the aggregation service encounters failures even with the largest instances. It poses a problem for Criteo as it puts at risk our target use case for the Private Aggregation API

  3. The sizing shared in the doc is the true maximum of the AS, any use case requiring more domains should be revisited

  4. As we have seen before when investigating ARA issues but even more pronounced here (because of new errors that we had not encountered so far), the errors of the AS are hard to understand and do not tell us why the jobs are failing.

To address these issues and improve the feasibility for our use case, the following actions are recommended

  • Review and improve the scalability of the Aggregation Service to handle higher volumes of data, both in terms of reports and especially domains.
  • Enhance the error messages and logs to provide more explicit information about the causes of failures. This will help in diagnosing and addressing issues more efficiently.
@nlrussell
Copy link

Thank you for your taking the time to provide this detailed feedback! We are currently working on both troubleshooting documentation and scaling the output domain, so we will provide an update as those efforts progress.

@nlrussell
Copy link

@idtoufik Follow up question - how are you splitting your daily batches into 100? PAA does not have additional fields to split batches so we would expect that a daily batch could be split into 24 at most. If you're splitting into 100 you may have overlapping shared IDs across multiple jobs (which could lead to privacy budget exhausted issues).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants