Skip to content

job/step setup should be separated from execution [BATCH-372] #3207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
spring-projects-issues opened this issue Feb 19, 2008 · 25 comments
Closed
Labels
status: declined Features that we don't intend to implement or Bug reports that are invalid or missing enough details type: bug
Milestone

Comments

@spring-projects-issues
Copy link
Collaborator

Robert Kasanicky opened BATCH-372 and commented

Currently setup of the job/step execution (figuring out whether it should be run and whether it is being restarted) is part of the execute() method rather than being dealt with separately. This has some bad consequences - even when we create a fresh job instance, both job and step investigate their execution counts and last executions. It doesn't make sense as it is clear execution count must be zero and there is no last execution. Assuming a job instance is usually executed successfully and restart is an exceptional scenario makes the current state look even more awkward.

I imagine the basic execution flow should be something like:

if (findJobInstance(job, params) == null) {
createJobInstance()
setupVanillaExecution()
}
else {
restoreFromLastExecution()
}
execute() //no dealing with past here anymore, just straightforward execution

rather than current:

1.findOrCreateJobInstance(..)
2.execute(..) // checking what scenario I'm in all the time

The proposed approach clearly points finger at the DDD-hostile findOrCreateJobInstance(..) method which should be broken into two, but also moves towards removing lastExecution and executionCount properties from JobInstance as they become relevant only in the narrow context of restoreFromLastExecution().


Issue Links:

  • BATCH-375 separate optional features from core framework

  • BATCH-375 separate optional features from core framework

@spring-projects-issues
Copy link
Collaborator Author

Robert Kasanicky commented

Correction: there is no longer the findOrCreateJobInstance repository method mentioned in the description, but (equally vague) createJobExecution which hides the difference between existing and new job instance

@spring-projects-issues
Copy link
Collaborator Author

Dave Syer commented

I agree that there is some awkwardness with the decision about whether we are re-starting or starting. But createJobExecution() has to be atomic, in case someone else is trying to make the same decision somewhere else, even on another physical machine sharing the same database. So the code you wrote above is probably already in the implementation of createJobExecution(). I think we have to stick with it (or maybe the logic can be tidied up?).

Maybe you could take a look at the equivalent pattern with StepExecution, which doesn't need to be atomic for the same reason, but might for another. There are some methods in the Step interface that seem outr of place, and they are only there so that a Job implementation can decide whether to run the Step based on some configuration. Maybe that could be more encapsulated?

@spring-projects-issues
Copy link
Collaborator Author

Robert Kasanicky commented

My main idea is that we should handle all restart-related concerns at once in something like "job initialization phase" - once the job starts processing records it wouldn't need to look back anymore. Namely steps wouldn't deal with getting restart data at all (why should they anyway? When ChunkedStep was implemented it had to duplicate this logic from TaskletStep). I think this translates to creating step executions (with status NOT_YET_RUN or whatever) together with job execution i.e. when we know whether we are creating job instance or restarting.

Maybe a useful point of view is this: we have an item processing framework with persistent execution context => it allows us to build restart around it rather than into it. The same applies e.g. for statistics

@spring-projects-issues
Copy link
Collaborator Author

Lucas Ward commented

I think I agree with Dave, this should really be handled by the JobRepository. However, I think it needs to be a bit more of a change from the bottom up. The problem right now is that we need to check whether or not we've restarted to determine whether or not we call restoreFrom(ExecutionContext) I think this is a large part of the problem. I would go with Dave's suggestion of creating the StepExecution in the Repository like we do with the JobExecution. Further, I would give it the ExecutionContext from the last execution, assuming there was one. If not, i would put an empty one in it. I think we could then drop restoreFrom and just have open(ExecutionContext), which we'll call regardless. Afterall, why do we care? The ItemReader and ItemWriter can easily make this determination. If the ExecutionContext is empty, or the key it's looking for isn't there, it assumes it should start from 0. It helps with another issue I've ran into with clients. For example, if someone is reading in from a database and writing to a file, and using the process indicator approach on the read, our current approach has issues. This is because the 'saveExecutionContext' attribute is on the Step. It could just as easily be on the readers and writer themselves. If the value is false, it simply won't add it's state to the context. In this way, the step doesn't need to care about determining if it was a restart, since it always gives the context to the streams, leaving the repository to ensure that the correct context is attached in restart scenarios.

@spring-projects-issues
Copy link
Collaborator Author

Robert Kasanicky commented

I think one of the subtleties involved is we should revert the relationship of JobExecution and StepExecution - JobExecution consists of StepExecutions, while StepExecution does not need to care what JobExecution it belongs to. This gives us a self-contained JobExecution that can be run. Currently JobExecution creates StepExecutions which makes no sense to me.

Concerning the open(ExecutionsContext) it's essentially the same as calling restoreFrom(..) under all circumstances. I think it would eventually blur the picture that is just beginning to be clear - that the first run (and maybe re-run) is a fundamentally different scenario from restart and we need to reflect that in design and execution logic, instead of having single "main line" that adjusts on the fly.

I'm also starting to like the idea of having separate launch methods for starting and restarting the job - it makes sense to me that the user should tell what he wants rather than framework interpreting the same command differently according to circumstances. The current mechanism seems to be non-sense anyway - if job is not restartable a new job instance is always created resulting in separate job instances that differ only in id - I have no sensible explanation what a job instance represents here. On the other hand you can't re-run a restartable job instance, you can only restart it - I think re-run is a sensible requirement e.g. you lost the output data and want to recreate it (I guess the isAllowStartIfComplete is an unfortunate ad-hoc fix for this).

@spring-projects-issues
Copy link
Collaborator Author

Lucas Ward commented

Robert, Wayne, Scott, Doug and myself had a long discussion about this issue today and came up with the following relative consensus:

The two comments by Robert and I represent two distinct issues:

The first is, how stepexecutions are made and how their execution context are loaded. Everyone seemed to agree that it makes the most sense for StepExecutions to be created by the Repository. JobExecutions are already handled this way, so symmetry would be restored, also, the Repository could then take the responsibility for putting the correct ExecutionContext onto the StepExecution when created.

The second issue is that of how restart is handled from the JobInstance perspective. First, I'll summarize the present state:

When someone launches a job, the JobParameters are compared with the JobParameters that have been used for any previous instances of the Job, and if they exist then it creates a new JobExecution tied to the existing JobInstance. If the steps were all complete ( and not set to allowStartIfComplete) then nothing will happen. However, if not the job should restart from where it left. This is all assuming restartable = true.

If restartable equals false, no matter what happens, a new JobInstance will be created. This means, in theory, there could be an infinite number of JobInstances for the same job with the same JobParameters. Robert very correctly pointed out that this is a bit weird, since we've always maintained that Job + JobParameters = JobInstance. This is broken by the above.

Robert's solution outlined above, was generally not received well. Creating two execution paths from the Launcher 'smells funny'. This is because Robert's basic concensus is that you should be able to have a JobInstance with any number of executions, irrelevant of whether or not they are successful. It would have to be managed by the client, as they would be the only one who knows whether or not something would be a restart. That is because, with this approach, there is no way you can determine from simply looking at the meta-data whether or not something is a restart, therefore everyone using the framework would have to make the determination. This would be problematic because it breaks standard batch patterns. For example, there is not 'restart' or an equivalent in JCL. Further, due to the 'headless' nature of batch, the thing launching the job may not really know whether or not it should be a restart. This would require manual intervention in every case where a failure happens. Furthermore, it would require information about restart to percolate all the way up the call stack. Meaning, the JobLauncher interfce would need runForRestart, as would the Job, Step, and on down to the ItemStreams. That would be the only way to run the job and ensure everyone knows it's a restart as opposed to a normal launch.

It's much easier to let the repository look back and pull data if it needs to, using the approach I outlined in an above comment. Having said all of that, there may still be merit in providing a manual method on the Launcher for restart, but it's sole purpose would have to be indicating that a new Instance should be created.

This still leaves the problem of 'restartable' and how it breaks what a JobInstance is. It was mentioned that the use case restartable is trying to solve could easily be solved by tacking on another JobParameter. However, Doug surmised that have one 'top level toggle' to always indicate a new instance should be created is useful in many scenarios. It was thought then, that it would just as easy for the Repository to take on the extra parameter in the case of restartable=false (or whatever it ends up being named) That way, it would still have a different parameter for the different instance.

All of this confusion essentially stems from poor definition of the JobInstance. It represents a uniquely identifiable job run, that could have many executions. One could look at the JobExecutions, and see "fail, fail, interrupted, completed" and get a complete idea about the lifespan of a job. However, if you don't plan to hold restart data, it's just as good to have one instance per run.

@spring-projects-issues
Copy link
Collaborator Author

Wayne Lund commented

Nice summary by Lucas by I wanted to summarize the actions for Robert and others:

  1. Robert would resolve the issue by adding the restartCount = current++ in the case of restart = false so that JobInstance will now be consistently JobInstance = Job + JobParameters
  2. Robert will update the JavaDocs of JobInstance to bring the docs up to date with this clarified definition.
  3. Doug will fix the HOWTOs to show the various restart scenarios and the various permutations between job and step configurations on restartability (e.g. global on off switches, etc.)
  4. Wayne will fix the docbook glossary and include the HOWTOs at some point (not sure where its going into the docbook yet).

Can everybody stack hands on that?

@spring-projects-issues
Copy link
Collaborator Author

Douglas C. Kaminsky commented

I agree. To maintain the JobName+JobParameters uniqueness invariant, the new job instance in the case where the job is not restartable should have an automatically injected parameter from the job repository.

My hand's in, but mine is not really the important one here ;)

On another note, should we keep a list of agreed-upon invariants to aid in development?

@spring-projects-issues
Copy link
Collaborator Author

Robert Kasanicky commented

I still feel uncomfortable about the proposed solution. Some thoughts to consider:

  1. Using technical tricks like automatically appending parameters to create unique JobInstance makes alarm bell ring - we should not be trying to do clever arbitrary things in classes that form the foundation of the framework. If job instance has a lifecycle that ends with first successful execution we should refuse to start the same job instance ever after.

  2. The repository is quickly becoming a magic box with huge responsibility. My humble bet is Eric Evans has hardly ever seen a repository as capable as our SimpleJobRepository. My understanding is the repository is a data store abstraction - it should provide and persist data i.e. do what it is told to do without thinking, rather than cooking magic recipes. It's purpose is to hide persistence specifics not domain semantics.

  3. I understand job execution creation needs to be atomic, but I don't think it implies the whole logic around creating job execution must be handled by repository.

@spring-projects-issues
Copy link
Collaborator Author

Robert Kasanicky commented

Maybe we are missing the concept of JobExecutor? When launching for restart a RestartJobExecutor would be instantiated that would find the crashed step and restore its execution context (only for this single step!). It would ignore successfully completed steps, restart the crashed step and run the following steps. DefaultJobExecutor instantiated on ordinary launch would just run steps in order.

For those who don't like having a separate restart launcher we can always put a facade in front of the two launchers to make the decision on behalf of the user.

@spring-projects-issues
Copy link
Collaborator Author

Lucas Ward commented

Robert,

I have two points:

  1. Creating separate execution paths for restart and normal processing is a bad solution. There was very good reason to remove the executor interface, and I think the kind of change you're suggesting is a step backwards. I still feel that having the job and step's act the same regardless of restart is the best option. As mentioned above, pushing this knowledge to the readers and writers makes much more sense to me.

  2. The repository should have all of the responsibility you mention. Your argument is perhaps more for a rename of the class (it's definitely not simple) than a real redesign. It boils down to one irrefutable fact: If the repository is incapable of determining on it's own whether or not a job instance is being restarted, then the responsibility has to be pushed to the client, which is unacceptable. I know we discussed there being a wrapper potentially, but if the repository can't figure it out, then neither could any type of wrapper or any other code we could write. Any users of our framework would have to write code figure it out, or require users to know the difference, which in most cases they wouldn't. Having the repository code for making the determination really isn't that complex. It's a fairly trivial calculation. In my mind, it's still much much preferable to create some rules about how we determine a JobInstance is being restarted, than to require users to do it for us.

@spring-projects-issues
Copy link
Collaborator Author

Robert Kasanicky commented

Lucas,

answer to your two points:

  1. I understand you prefer single execution "main line", but don't understand why. I would much prefer having two execution paths that make perfect sense to a single path with poor semantics - single execution path can't avoid asking questions about which of the two start/restart scenarios applies, although the answer is clear ever after the job is up and running. To emphasize - the two execution paths needn't be visible for the users if we don't want to. Internally we simply have to deal with two scenarios - why not reflect that clearly in the implementation?

  2. I don't see why repository should have responsibility beyond persisting data and fetching persisted data. I think a "Repository" shouldn't be "figuring something out" and making decisions - we should have a separate class (maybe JobExecutor, maybe something else) that encapsulates the decision making and has a reference to a dumb repository. I see separation of concerns as the issue here.

@spring-projects-issues
Copy link
Collaborator Author

nebhale commented

(In reply to comment #9)

  1. Using technical tricks like automatically appending parameters to create
    unique JobInstance makes alarm bell ring - we should not be trying to do clever
    arbitrary things in classes that form the foundation of the framework. If job
    instance has a lifecycle that ends with first successful execution we should
    refuse to start the same job instance ever after.

I'm totally in agreement here. We shouldn't need to be clever in this case, but rather very clear and explicit. I'm personally don't have a problem with Robert's described behavior here, but at the same time, I don't really have a strong opinion either way. The most important thing here is that the framework isn't being clever behind the scenes in a way that will have to be documented and understood by the users. I can only imagine what happens when a user tries to query for a Job Id and finds that we've mucked with it.

  1. The repository is quickly becoming a magic box with huge responsibility.

This is actually a huge concern of mine as well. I'm with Robert on the idea that a repository is for data acess only, not to encapsulate any sort of business behavior.

  1. I understand job execution creation needs to be atomic, but I don't think it
    implies the whole logic around creating job execution must be handled by
    repository.

Agreed! If we're worried about atomicity in a single VM, we can synchronize at a higher level of code, rather than trying to bundle it completely under the repository. If we're worried about atomicity between multiple VMs, then proper database locking is the answer here.

@spring-projects-issues
Copy link
Collaborator Author

nebhale commented

(In reply to comment #10)

Maybe we are missing the concept of JobExecutor?

I don't think that's quite necessary. I like the idea of not having multiple execution implementations (for this purpose, different environments are another question).

For those who don't like having a separate restart launcher we can always put a
facade in front of the two launchers to make the decision on behalf of the user.

I don't think a separate launcher is necessary either. I see a situation where the sole launcher has some well defined semantics with the possibility for explicit override (--force equivalents). But inside the launcher it does the following:

  1. Request DB lock on the Job tables
  2. Attempt to retrieve a Job based on an identifier (passed in or generated)
  3. If null is returned, create a new Job and persist it
  4. Release lock
  5. Execute Job

In a situation like this the job is protected across different VMs, but it doesn't require us to push all of that behavior down into the repository, or split launchers/executions.

@spring-projects-issues
Copy link
Collaborator Author

nebhale commented

  1. Creating separate execution paths for restart and normal processing is a bad
    solution.

I think this can be clarified a bit. I think there should be a single line of execution, but there should be a decision block very early that either avoids or gets the system to the same state that it would be in on a fresh start. Then the main-line execution can run.

It boils down to one irrefutable fact: If the repository
is incapable of determining on it's own whether or not a job instance is being
restarted, then the responsibility has to be pushed to the client, which is
unacceptable.

I refute this fact! :) There is an intermediary between the client and the repository and that's the launcher. I think the launcher should be responsible for attempting to retrieve an existing instance and if one doesn't exist, creating a new one via a factory. As discussed earlier, this can be done in a nice atomic way all the way down to the database level.

Having the repository code for making the determination really isn't
that complex. It's a fairly trivial calculation.

Which is a great reason to move it out of the repository which should be used for persistance behaviors.

In my mind, it's still much
much preferable to create some rules about how we determine a JobInstance is
being restarted, than to require users to do it for us.

The more I think about it, I think there should be a very simple default behavior (I'm still working out what I think would be best) with explicit alternate (simple) behaviors in the launcher. So for example, the launcher might always fail if an instance already exists unless the user puts in a --force argument. Then if the batch writer knew that they always wanted to restart existing instances (when they existed), they could just make sure that their cron did a --force.

@spring-projects-issues
Copy link
Collaborator Author

Dave Syer commented

I don't want to comment on everything that's been said on this issue, but one thing concerns me the most: the repository is very explicitly has responsibility for atomically creating the JobExecution (across all clients sharing the database, relying on database locking as described in the JavaDocs). This probably cannot be delegated to a higher level without either breaking the atomicity, or extending the transaction boundary to cover more processing than necessary, making it more likely that a job will block and maybe time out unnecessarily.

@spring-projects-issues
Copy link
Collaborator Author

Lucas Ward commented

"I refute this fact! :) There is an intermediary between the client and the repository and that's the launcher. I think the launcher should be responsible for attempting to retrieve an existing instance and if one doesn't exist, creating a new one via a factory. As discussed earlier, this can be done in a nice atomic way all the way down to the database level."

You actually didn't refute what I was saying. My point was that either we handle it somewhere, or the client (user) has to. This can be a scheduler or manual process. However, I agree with Dave, it's difficult to keep 'atomicity' if it's done in multiple steps through the launcher. But if there was a decent way to do it, I would have no issues with the Launcher having that logic.

"The more I think about it, I think there should be a very simple default behavior (I'm still working out what I think would be best) with explicit alternate (simple) behaviors in the launcher. So for example, the launcher might always fail if an instance already exists unless the user puts in a --force argument. Then if the batch writer knew that they always wanted to restart existing instances (when they existed), they could just make sure that their cron did a --force."

What's the difference between --force and job.restartable(volatile) = true? it looks about the same to me. I still don't like pushing these things to the users, which are inevitably batch scheduler and cron tabs. It's just generally not how batch is done. Sure, there are parameters for a job, and some are variable for business reasons, but having to provide an additional argument to a script because a job failed in order to make it restart is very different. However, I am okay with forcing an additional runtime parameter if you have a job that should normally restart from the position it was at when it failed, but you want to override that behavior and have it start from the beginning.

@spring-projects-issues
Copy link
Collaborator Author

Robert Kasanicky commented

Concerning the "two execution paths" - maybe it is just a scary name for a completely innocent idea. What I'm really advocating is moving the restart concerns from step to job level. Job can coordinate steps in a meaningful way in restart scenario i.e. skip completed steps, restart just the single crashed step and run remaining steps in ordinary fashion. Rather than forking the execution logic for each step, we would just fork it once at the job level.

The effect would be:

  1. simplification of the overly complex step execution logic allowing the step to focus on processing items (which is hard enough)
  2. promoting job to a first-class citizen with non-trivial responsibility - rather than being a dumb collection of smart steps

It just makes sense, don't you think?

@spring-projects-issues
Copy link
Collaborator Author

Lucas Ward commented

I agree, the StepExecutor shouldn't have to think about it, but I think the solution is to load the StepExecution with an empty ExecutionContext if it's not a restart scenario, and with the context of the last execution if it is a restart. I'm absolutely okay with the Job making this determination instead of the repository, but I don't like having restoreFrom(EC) on the Step, or even executeForRestart(StepExecution).

@spring-projects-issues
Copy link
Collaborator Author

nebhale commented

What about simply never calling the Step at all if the job determines that the step shouldn't be restarted? The removes the need for the no-op call and any logic for handling empty arguments, etc.

@spring-projects-issues
Copy link
Collaborator Author

Robert Kasanicky commented

I would prefer having very explicit execution flow i.e. having restoreFrom(EC) on Step. In restart scenario job wouldn't run completed steps at all, call restoreFrom+execute on the crashed step and just execute on remaining steps. I think it would be straightforward to implement and the semantics would make perfect sense.

Using the open(EC) approach on readers and writers means making lots of redundant if-checks in the open(EC) method and I don't really see what benefits it brings. I think it artificially binds together purely technical initialization with restoring from saved state (= semantically significant logic). Fundamentally it means pushing restart concerns from Step down to ItemStream, rather than up to the Job. I'd say it means cleaning up the step on behalf of polluting ItemStream.

Consequences for number of "execution paths":

  1. Job handles restart (what I would love to see) => single semantically sound fork in execution logic
  2. Step handles restart (current) => fork in every step, no matter whether job was actually restarted
  3. ItemStream handles restart (what we're moving towards) => fork in every restartable ItemStream, no matter whether job was actually restarted

@spring-projects-issues
Copy link
Collaborator Author

nebhale commented

I definitely agree with number 1 Robert.

@spring-projects-issues
Copy link
Collaborator Author

Lucas Ward commented

Robert and Ben,

I cannot disagree more. in number 3, you're asserted a 'fork' in every itemStream, which isn't really the case. Further, the Job can still handle restart, even with the current approach be determining if the ExecutionContext should be loaded from a previous one, that's fine, and if it makes sense for it to be there rather than the repository, then that's fine. However, I disagree with this argument that the 'ItemStream are forking' in the current approach. With what you're advocating, the step and ItemStream interface would have to look like the following:

Step:

restart(ExecutionContext executionContext)

ItemStream:

open(ExecutionContext ex)
openForRestart(ExecutionContext ex)

It should be noted that the open would still be needed because of the recent change where there is one ExecutionContext per step, rather than trying to aggregate disperate ones. This is a separate issue, which I'm definitely convinced of the value of. Given that, the only advantage from having the two methods on ItemStream (plus the extra one on step) is that open(EC) wouldn't need to do a contains key check. I think this very trivial advantage is greatly outweighed by the disadvantages of having the 2 extra methods. I just don't see why the ItemStream should care. To me, this is much much more 'forked', and in a worse way, the API now has to know about the fork. Even the logic in both approaches gains nothing from the fork in the api:

One Open:

open(EC){

if(ContainsRestartKey()){
openAtBeginning();
else{
openAt(EC.location());
}

'Forked' ItemStream:

open(EC){
openAtBeginning();
}

openForRestart(EC){
if(containsResartKey(){
openAt(EC.location());
}
else{
throw new Exception("Restart key wasn't found");
}

To me the bottom approach is much more complicated, and even worse, has to add the complication to the API as well.

@spring-projects-issues
Copy link
Collaborator Author

Wayne Lund commented

I was surprised to see this issue still getting so much commentary. I'm going to jump back to Robert's initial pushback here:

"1. Using technical tricks like automatically appending parameters to create unique JobInstance makes alarm bell ring - we should not be trying to do clever arbitrary things in classes that form the foundation of the framework. If job instance has a lifecycle that ends with first successful execution we should refuse to start the same job instance ever after."

This is a very confusing response to me. We described before that there are times that operators, for whatever reason, what's to restart a job with the same instance. It could be something as stupid as accidentally deleting the output file. So it's like getting a new copy. This isn't a violation of the job instance lifecylce. It respects it by requiring the user to specify through job parameters that they really intended to do this. If someone has a cleaner way to do it I'm certainly open to that but I'm not hearing any alarm bells.

"2. The repository is quickly becoming a magic box with huge responsibility. My humble bet is Eric Evans has hardly ever seen a repository as capable as our SimpleJobRepository. My understanding is the repository is a data store abstraction - it should provide and persist data i.e. do what it is told to do without thinking, rather than cooking magic recipes. It's purpose is to hide persistence specifics not domain semantics."

The repository is already more of a service object that DAO so I agree with your comments about our capable repository. But the blur between persistence specifics and domain semantics is very subjective and I don't have a problem with some logic in the repository and keeping the transactional semantics as close to the data as possible.

"3. I understand job execution creation needs to be atomic, but I don't think it implies the whole logic around creating job execution must be handled by repository."

@spring-projects-issues
Copy link
Collaborator Author

Dave Syer commented

This has been a really interesting discussion. The consensus on the status call today was that the repository is part of the domain (not just a dao), so it is appropriate for it to reason about domain objects. Hence we close this one as won't fix (there is another interesting thread on the side about whether ExecutionContext can be handled completely separately from the StepExecution at the repository level).

@spring-projects-issues spring-projects-issues added type: bug status: declined Features that we don't intend to implement or Bug reports that are invalid or missing enough details labels Dec 17, 2019
@spring-projects-issues spring-projects-issues added this to the 1.0.0.m5 milestone Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: declined Features that we don't intend to implement or Bug reports that are invalid or missing enough details type: bug
Projects
None yet
Development

No branches or pull requests

1 participant