Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple prefixes in aggregation jobs #67

Open
wojtek-rybak opened this issue Jul 19, 2024 · 2 comments
Open

Allow multiple prefixes in aggregation jobs #67

wojtek-rybak opened this issue Jul 19, 2024 · 2 comments

Comments

@wojtek-rybak
Copy link

According to the API specification, when creating a job, we must provide the bucket name and the prefix of the files with aggregatable reports to be included in the aggregation. Since I want to perform the aggregation every hour, it seems necessary to have a separate prefix for each hour. For example:

  • /data/2024-07-19/00/...
  • /data/2024-07-19/01/...
  • /data/2024-07-19/02/...
  • /data/2024-07-19/03/...
  • etc.

However, if I need to perform an aggregation over a 6-hour interval (using different filtering id), I encounter a problem. The API only allows one prefix, which means I would need to copy the data to a new location. This approach seems impractical and inefficient.

It would be highly beneficial if the aggregation service could accept a list of prefixes. This change would allow more flexibility in specifying the data intervals for aggregation without needing to duplicate data.

@nlrussell
Copy link

Hi @wojtek-rybak, thanks for providing this feedback. Can you say more about how costly it is to do this and just how much of a blocker this is, so we can consider that information in determining the priority of this request?

@wojtek-rybak
Copy link
Author

Hi @nlrussell

At RTB House, we are currently in the testing phase, working with a small amount of data from a subset of users, which amounts to tens of gigabytes per day. At this stage, the issue described is only a minor inconvenience.

However, we plan to start working on the final, production-ready solution around early October. For that phase, we estimate processing tens of terabytes of data per day. It would be highly beneficial if the feature allowing for multiple prefixes in the aggregation service could be added by then. This addition would enable us to avoid the need for data copying in our final design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants