-
Notifications
You must be signed in to change notification settings - Fork 2.4k
job/step setup should be separated from execution [BATCH-372] #3207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Robert Kasanicky commented Correction: there is no longer the findOrCreateJobInstance repository method mentioned in the description, but (equally vague) createJobExecution which hides the difference between existing and new job instance |
Dave Syer commented I agree that there is some awkwardness with the decision about whether we are re-starting or starting. But createJobExecution() has to be atomic, in case someone else is trying to make the same decision somewhere else, even on another physical machine sharing the same database. So the code you wrote above is probably already in the implementation of createJobExecution(). I think we have to stick with it (or maybe the logic can be tidied up?). Maybe you could take a look at the equivalent pattern with StepExecution, which doesn't need to be atomic for the same reason, but might for another. There are some methods in the Step interface that seem outr of place, and they are only there so that a Job implementation can decide whether to run the Step based on some configuration. Maybe that could be more encapsulated? |
Robert Kasanicky commented My main idea is that we should handle all restart-related concerns at once in something like "job initialization phase" - once the job starts processing records it wouldn't need to look back anymore. Namely steps wouldn't deal with getting restart data at all (why should they anyway? When ChunkedStep was implemented it had to duplicate this logic from TaskletStep). I think this translates to creating step executions (with status NOT_YET_RUN or whatever) together with job execution i.e. when we know whether we are creating job instance or restarting. Maybe a useful point of view is this: we have an item processing framework with persistent execution context => it allows us to build restart around it rather than into it. The same applies e.g. for statistics |
Lucas Ward commented I think I agree with Dave, this should really be handled by the JobRepository. However, I think it needs to be a bit more of a change from the bottom up. The problem right now is that we need to check whether or not we've restarted to determine whether or not we call restoreFrom(ExecutionContext) I think this is a large part of the problem. I would go with Dave's suggestion of creating the StepExecution in the Repository like we do with the JobExecution. Further, I would give it the ExecutionContext from the last execution, assuming there was one. If not, i would put an empty one in it. I think we could then drop restoreFrom and just have open(ExecutionContext), which we'll call regardless. Afterall, why do we care? The ItemReader and ItemWriter can easily make this determination. If the ExecutionContext is empty, or the key it's looking for isn't there, it assumes it should start from 0. It helps with another issue I've ran into with clients. For example, if someone is reading in from a database and writing to a file, and using the process indicator approach on the read, our current approach has issues. This is because the 'saveExecutionContext' attribute is on the Step. It could just as easily be on the readers and writer themselves. If the value is false, it simply won't add it's state to the context. In this way, the step doesn't need to care about determining if it was a restart, since it always gives the context to the streams, leaving the repository to ensure that the correct context is attached in restart scenarios. |
Robert Kasanicky commented I think one of the subtleties involved is we should revert the relationship of JobExecution and StepExecution - JobExecution consists of StepExecutions, while StepExecution does not need to care what JobExecution it belongs to. This gives us a self-contained JobExecution that can be run. Currently JobExecution creates StepExecutions which makes no sense to me. Concerning the open(ExecutionsContext) it's essentially the same as calling restoreFrom(..) under all circumstances. I think it would eventually blur the picture that is just beginning to be clear - that the first run (and maybe re-run) is a fundamentally different scenario from restart and we need to reflect that in design and execution logic, instead of having single "main line" that adjusts on the fly. I'm also starting to like the idea of having separate launch methods for starting and restarting the job - it makes sense to me that the user should tell what he wants rather than framework interpreting the same command differently according to circumstances. The current mechanism seems to be non-sense anyway - if job is not restartable a new job instance is always created resulting in separate job instances that differ only in id - I have no sensible explanation what a job instance represents here. On the other hand you can't re-run a restartable job instance, you can only restart it - I think re-run is a sensible requirement e.g. you lost the output data and want to recreate it (I guess the isAllowStartIfComplete is an unfortunate ad-hoc fix for this). |
Lucas Ward commented Robert, Wayne, Scott, Doug and myself had a long discussion about this issue today and came up with the following relative consensus: The two comments by Robert and I represent two distinct issues: The first is, how stepexecutions are made and how their execution context are loaded. Everyone seemed to agree that it makes the most sense for StepExecutions to be created by the Repository. JobExecutions are already handled this way, so symmetry would be restored, also, the Repository could then take the responsibility for putting the correct ExecutionContext onto the StepExecution when created. The second issue is that of how restart is handled from the JobInstance perspective. First, I'll summarize the present state: When someone launches a job, the JobParameters are compared with the JobParameters that have been used for any previous instances of the Job, and if they exist then it creates a new JobExecution tied to the existing JobInstance. If the steps were all complete ( and not set to allowStartIfComplete) then nothing will happen. However, if not the job should restart from where it left. This is all assuming restartable = true. If restartable equals false, no matter what happens, a new JobInstance will be created. This means, in theory, there could be an infinite number of JobInstances for the same job with the same JobParameters. Robert very correctly pointed out that this is a bit weird, since we've always maintained that Job + JobParameters = JobInstance. This is broken by the above. Robert's solution outlined above, was generally not received well. Creating two execution paths from the Launcher 'smells funny'. This is because Robert's basic concensus is that you should be able to have a JobInstance with any number of executions, irrelevant of whether or not they are successful. It would have to be managed by the client, as they would be the only one who knows whether or not something would be a restart. That is because, with this approach, there is no way you can determine from simply looking at the meta-data whether or not something is a restart, therefore everyone using the framework would have to make the determination. This would be problematic because it breaks standard batch patterns. For example, there is not 'restart' or an equivalent in JCL. Further, due to the 'headless' nature of batch, the thing launching the job may not really know whether or not it should be a restart. This would require manual intervention in every case where a failure happens. Furthermore, it would require information about restart to percolate all the way up the call stack. Meaning, the JobLauncher interfce would need runForRestart, as would the Job, Step, and on down to the ItemStreams. That would be the only way to run the job and ensure everyone knows it's a restart as opposed to a normal launch. It's much easier to let the repository look back and pull data if it needs to, using the approach I outlined in an above comment. Having said all of that, there may still be merit in providing a manual method on the Launcher for restart, but it's sole purpose would have to be indicating that a new Instance should be created. This still leaves the problem of 'restartable' and how it breaks what a JobInstance is. It was mentioned that the use case restartable is trying to solve could easily be solved by tacking on another JobParameter. However, Doug surmised that have one 'top level toggle' to always indicate a new instance should be created is useful in many scenarios. It was thought then, that it would just as easy for the Repository to take on the extra parameter in the case of restartable=false (or whatever it ends up being named) That way, it would still have a different parameter for the different instance. All of this confusion essentially stems from poor definition of the JobInstance. It represents a uniquely identifiable job run, that could have many executions. One could look at the JobExecutions, and see "fail, fail, interrupted, completed" and get a complete idea about the lifespan of a job. However, if you don't plan to hold restart data, it's just as good to have one instance per run. |
Wayne Lund commented Nice summary by Lucas by I wanted to summarize the actions for Robert and others:
Can everybody stack hands on that? |
Douglas C. Kaminsky commented I agree. To maintain the JobName+JobParameters uniqueness invariant, the new job instance in the case where the job is not restartable should have an automatically injected parameter from the job repository. My hand's in, but mine is not really the important one here ;) On another note, should we keep a list of agreed-upon invariants to aid in development? |
Robert Kasanicky commented I still feel uncomfortable about the proposed solution. Some thoughts to consider:
|
Robert Kasanicky commented Maybe we are missing the concept of JobExecutor? When launching for restart a RestartJobExecutor would be instantiated that would find the crashed step and restore its execution context (only for this single step!). It would ignore successfully completed steps, restart the crashed step and run the following steps. DefaultJobExecutor instantiated on ordinary launch would just run steps in order. For those who don't like having a separate restart launcher we can always put a facade in front of the two launchers to make the decision on behalf of the user. |
Lucas Ward commented Robert, I have two points:
|
Robert Kasanicky commented Lucas, answer to your two points:
|
nebhale commented (In reply to comment #9)
I'm totally in agreement here. We shouldn't need to be clever in this case, but rather very clear and explicit. I'm personally don't have a problem with Robert's described behavior here, but at the same time, I don't really have a strong opinion either way. The most important thing here is that the framework isn't being clever behind the scenes in a way that will have to be documented and understood by the users. I can only imagine what happens when a user tries to query for a Job Id and finds that we've mucked with it.
This is actually a huge concern of mine as well. I'm with Robert on the idea that a repository is for data acess only, not to encapsulate any sort of business behavior.
Agreed! If we're worried about atomicity in a single VM, we can synchronize at a higher level of code, rather than trying to bundle it completely under the repository. If we're worried about atomicity between multiple VMs, then proper database locking is the answer here. |
nebhale commented (In reply to comment #10)
I don't think that's quite necessary. I like the idea of not having multiple execution implementations (for this purpose, different environments are another question).
I don't think a separate launcher is necessary either. I see a situation where the sole launcher has some well defined semantics with the possibility for explicit override (--force equivalents). But inside the launcher it does the following:
In a situation like this the job is protected across different VMs, but it doesn't require us to push all of that behavior down into the repository, or split launchers/executions. |
nebhale commented
I think this can be clarified a bit. I think there should be a single line of execution, but there should be a decision block very early that either avoids or gets the system to the same state that it would be in on a fresh start. Then the main-line execution can run.
I refute this fact! :) There is an intermediary between the client and the repository and that's the launcher. I think the launcher should be responsible for attempting to retrieve an existing instance and if one doesn't exist, creating a new one via a factory. As discussed earlier, this can be done in a nice atomic way all the way down to the database level.
Which is a great reason to move it out of the repository which should be used for persistance behaviors.
The more I think about it, I think there should be a very simple default behavior (I'm still working out what I think would be best) with explicit alternate (simple) behaviors in the launcher. So for example, the launcher might always fail if an instance already exists unless the user puts in a --force argument. Then if the batch writer knew that they always wanted to restart existing instances (when they existed), they could just make sure that their cron did a --force. |
Dave Syer commented I don't want to comment on everything that's been said on this issue, but one thing concerns me the most: the repository is very explicitly has responsibility for atomically creating the JobExecution (across all clients sharing the database, relying on database locking as described in the JavaDocs). This probably cannot be delegated to a higher level without either breaking the atomicity, or extending the transaction boundary to cover more processing than necessary, making it more likely that a job will block and maybe time out unnecessarily. |
Lucas Ward commented "I refute this fact! :) There is an intermediary between the client and the repository and that's the launcher. I think the launcher should be responsible for attempting to retrieve an existing instance and if one doesn't exist, creating a new one via a factory. As discussed earlier, this can be done in a nice atomic way all the way down to the database level." You actually didn't refute what I was saying. My point was that either we handle it somewhere, or the client (user) has to. This can be a scheduler or manual process. However, I agree with Dave, it's difficult to keep 'atomicity' if it's done in multiple steps through the launcher. But if there was a decent way to do it, I would have no issues with the Launcher having that logic. "The more I think about it, I think there should be a very simple default behavior (I'm still working out what I think would be best) with explicit alternate (simple) behaviors in the launcher. So for example, the launcher might always fail if an instance already exists unless the user puts in a --force argument. Then if the batch writer knew that they always wanted to restart existing instances (when they existed), they could just make sure that their cron did a --force." What's the difference between --force and job.restartable(volatile) = true? it looks about the same to me. I still don't like pushing these things to the users, which are inevitably batch scheduler and cron tabs. It's just generally not how batch is done. Sure, there are parameters for a job, and some are variable for business reasons, but having to provide an additional argument to a script because a job failed in order to make it restart is very different. However, I am okay with forcing an additional runtime parameter if you have a job that should normally restart from the position it was at when it failed, but you want to override that behavior and have it start from the beginning. |
Robert Kasanicky commented Concerning the "two execution paths" - maybe it is just a scary name for a completely innocent idea. What I'm really advocating is moving the restart concerns from step to job level. Job can coordinate steps in a meaningful way in restart scenario i.e. skip completed steps, restart just the single crashed step and run remaining steps in ordinary fashion. Rather than forking the execution logic for each step, we would just fork it once at the job level. The effect would be:
It just makes sense, don't you think? |
Lucas Ward commented I agree, the StepExecutor shouldn't have to think about it, but I think the solution is to load the StepExecution with an empty ExecutionContext if it's not a restart scenario, and with the context of the last execution if it is a restart. I'm absolutely okay with the Job making this determination instead of the repository, but I don't like having restoreFrom(EC) on the Step, or even executeForRestart(StepExecution). |
nebhale commented What about simply never calling the Step at all if the job determines that the step shouldn't be restarted? The removes the need for the no-op call and any logic for handling empty arguments, etc. |
Robert Kasanicky commented I would prefer having very explicit execution flow i.e. having restoreFrom(EC) on Step. In restart scenario job wouldn't run completed steps at all, call restoreFrom+execute on the crashed step and just execute on remaining steps. I think it would be straightforward to implement and the semantics would make perfect sense. Using the open(EC) approach on readers and writers means making lots of redundant if-checks in the open(EC) method and I don't really see what benefits it brings. I think it artificially binds together purely technical initialization with restoring from saved state (= semantically significant logic). Fundamentally it means pushing restart concerns from Step down to ItemStream, rather than up to the Job. I'd say it means cleaning up the step on behalf of polluting ItemStream. Consequences for number of "execution paths":
|
nebhale commented I definitely agree with number 1 Robert. |
Lucas Ward commented Robert and Ben, I cannot disagree more. in number 3, you're asserted a 'fork' in every itemStream, which isn't really the case. Further, the Job can still handle restart, even with the current approach be determining if the ExecutionContext should be loaded from a previous one, that's fine, and if it makes sense for it to be there rather than the repository, then that's fine. However, I disagree with this argument that the 'ItemStream are forking' in the current approach. With what you're advocating, the step and ItemStream interface would have to look like the following: Step: restart(ExecutionContext executionContext) ItemStream: open(ExecutionContext ex) It should be noted that the open would still be needed because of the recent change where there is one ExecutionContext per step, rather than trying to aggregate disperate ones. This is a separate issue, which I'm definitely convinced of the value of. Given that, the only advantage from having the two methods on ItemStream (plus the extra one on step) is that open(EC) wouldn't need to do a contains key check. I think this very trivial advantage is greatly outweighed by the disadvantages of having the 2 extra methods. I just don't see why the ItemStream should care. To me, this is much much more 'forked', and in a worse way, the API now has to know about the fork. Even the logic in both approaches gains nothing from the fork in the api: One Open: open(EC){ if(ContainsRestartKey()){ 'Forked' ItemStream: open(EC){ openForRestart(EC){ To me the bottom approach is much more complicated, and even worse, has to add the complication to the API as well. |
Wayne Lund commented I was surprised to see this issue still getting so much commentary. I'm going to jump back to Robert's initial pushback here: "1. Using technical tricks like automatically appending parameters to create unique JobInstance makes alarm bell ring - we should not be trying to do clever arbitrary things in classes that form the foundation of the framework. If job instance has a lifecycle that ends with first successful execution we should refuse to start the same job instance ever after." This is a very confusing response to me. We described before that there are times that operators, for whatever reason, what's to restart a job with the same instance. It could be something as stupid as accidentally deleting the output file. So it's like getting a new copy. This isn't a violation of the job instance lifecylce. It respects it by requiring the user to specify through job parameters that they really intended to do this. If someone has a cleaner way to do it I'm certainly open to that but I'm not hearing any alarm bells. "2. The repository is quickly becoming a magic box with huge responsibility. My humble bet is Eric Evans has hardly ever seen a repository as capable as our SimpleJobRepository. My understanding is the repository is a data store abstraction - it should provide and persist data i.e. do what it is told to do without thinking, rather than cooking magic recipes. It's purpose is to hide persistence specifics not domain semantics." The repository is already more of a service object that DAO so I agree with your comments about our capable repository. But the blur between persistence specifics and domain semantics is very subjective and I don't have a problem with some logic in the repository and keeping the transactional semantics as close to the data as possible. "3. I understand job execution creation needs to be atomic, but I don't think it implies the whole logic around creating job execution must be handled by repository." |
Dave Syer commented This has been a really interesting discussion. The consensus on the status call today was that the repository is part of the domain (not just a dao), so it is appropriate for it to reason about domain objects. Hence we close this one as won't fix (there is another interesting thread on the side about whether ExecutionContext can be handled completely separately from the StepExecution at the repository level). |
Robert Kasanicky opened BATCH-372 and commented
Currently setup of the job/step execution (figuring out whether it should be run and whether it is being restarted) is part of the execute() method rather than being dealt with separately. This has some bad consequences - even when we create a fresh job instance, both job and step investigate their execution counts and last executions. It doesn't make sense as it is clear execution count must be zero and there is no last execution. Assuming a job instance is usually executed successfully and restart is an exceptional scenario makes the current state look even more awkward.
I imagine the basic execution flow should be something like:
if (findJobInstance(job, params) == null) {
createJobInstance()
setupVanillaExecution()
}
else {
restoreFromLastExecution()
}
execute() //no dealing with past here anymore, just straightforward execution
rather than current:
1.findOrCreateJobInstance(..)
2.execute(..) // checking what scenario I'm in all the time
The proposed approach clearly points finger at the DDD-hostile findOrCreateJobInstance(..) method which should be broken into two, but also moves towards removing lastExecution and executionCount properties from JobInstance as they become relevant only in the narrow context of restoreFromLastExecution().
Issue Links:
BATCH-375 separate optional features from core framework
BATCH-375 separate optional features from core framework
The text was updated successfully, but these errors were encountered: