-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Aggregation of execution context when partitioning [BATCH-53] #3524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Tommy C. Trang commented Shouldn't we keep this simple and just have the statistics for the current run, restart or not? If we need to know the overall value across multiple runs, then a report can be generated to aggregate the information. This will keep the architecture simple and not be statistics smart around statistics across multiple runs. |
Dave Syer commented It actually doesn't require a partitioning container to encounter this problem. The SimpleBatchContainer can be used to process chunks in parallel using a TaskExecutorRepeatTemplate as its stepOperations property. The problem is less severe because the aggregation can in principle be done using standard concurrent techniques in process (e.g. synchronize access to the counters). But it might not actually be thread-safe in the current implementation. |
Wayne Lund commented Tommy's response is the same position that Lucas took in our discussion. I wanted Tsay to weigh in because I thought this was an issue for the harvested batch. Maybe if the queries are preset for aggregation of the results, which is easy enough to do, then this should suffice. I agree, it's simpler. Good point by Dave on the problem being more general. As soon as processing is split we need a way to aggregate the results and the options are the same; 1) increment aggregate counters while processing or 2) aggregate the results at the end of processing. #1 provides better real time information at the cost of potential concurrency impacts and #2 gives post processing information at the end of the run. If that meets operational needs I'm fine with the latter. |
Dave Syer commented Aggregation has to be handled by a single-threaded, or synchronized process. I think we understand the problem well enough to push it off until a later version now. |
Lucas Ward commented I don't see how this could be addressed for 1.1, this seems like a 2.0 issue? |
Dave Syer commented StepExecution aggregation is now a separate strategy in the prototype. We probably need to address the ExecutionContext as well. |
Wayne Lund opened BATCH-53 and commented
Batch Statistics Aggregation - we had a discussion some months back about modeling our schema so that the parent tables stored the summary, aggregated data for the detail at the next level down. We are getting questioned about how we are going to do this in our partitioning scenarios by clients. Chong and Lucas discussed yesterday and when Lucas and I reviewed we realized that we were looking at the issue a little differently. I remember a considerable noise at one of our clients because we didn't cover all of the scenarios correctly on aggregating statistics. For example, we had an issue getting statistics aggregated correctly on jobs that had been restarted multiple times to provide the true number of records processed. Here's some thoughts on what I think we're aggregating.
a. Skips & Records processed for the job ::= statistics for all records processed by the job regardless of how many skips. I remember in our early performance testing Mike Tsay kept a table of records down to the tracking of how many we were processing per time unit. We had five steps running on the CDX batch jobs, each having their own set of record types they processed (e.g. participant, addresses, employers, court cases, etc). POINT OF AGGREGATION
b. Skips & Records processed for the Partitioned Step ::= All records and skips for aggregation of all partitioned steps. POINT OF AGGREGATON
c. Skips & Records processed for a Step ::= all records within a single step. Different than B because he's only concerned about his own records.
A & B need to produce the same results whether a job has been restarted or not, in other words, the statistics and summaries should be the same whether restarts were involved or not. Would we meet project needs if we didn't expose status and statistics at this level of detail?
This issue is a sub-task of BATCH-677
The text was updated successfully, but these errors were encountered: