Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate the Map-based JobRepository/JobExplorer implementations #3780

Closed
fmbenhassine opened this issue Sep 16, 2020 · 0 comments
Closed
Labels
Milestone

Comments

@fmbenhassine
Copy link
Contributor

fmbenhassine commented Sep 16, 2020

The Map-based job repository was never intended for production use. However, even though this is clearly documented, people use (or, more precisely, misuse) it in production and complain about thread-safety and performance issues.

When there is no need to persist metadata, we have always recommended using the JDBC-based job repository with an in-memory database. See the following items on Stack Overflow:

The Map-based job repository suffers from many drawbacks:

1. Thread Unsafety

Even though thread-safe data structures back some Map-based DAOs, this job repository is not safe to use in a multi-threaded job with splits, as mentioned in the Javadoc.

2. Poor Performance

The Map-based job repository is very slow in a partitioned step. This is due to the jobRepsoitory.saveAll(stepExecutions) call in StepExecutionSplitter, which takes 20+ minutes (and fails with OOM, even with -Xmx8g) for 5000 partitions versus only 0.42 seconds when using the JDBC job repository with an embedded H2 (see attached benchmark [1]). This is also due to creating several copies of step and job execution data through reflection and the serialization and deserialization of execution contexts.

Using a partitioned step is a very common use case, and many people were hit by this performance issue (for example, see Step initialization time too long using Partitioner in Spring-Batch?).

3. Unflexibility

Since the Map-based job repository is the default, some people continue using it in a 24/7 running JVM with all jobs running in it. This leads to huge memory consumption by the job repository, and people start cleaning metadata older than a given date or removing the metadata for a specific job.

This is impossible with the Map-based job repository, as it provides a single clear() method to wipe the entire entities graph. It is, however, possible with an in-memory database (you can get a handle to the datasource and run any deletion query).

4. Incompatibility with Spring Boot

In Spring Batch, we claim that you can run a job without a data source. However, Spring Boot requires the datasource.

This inconsistency is confusing and leads to a poor user experience on start.spring.io: People come from the batch world (with datasource being optional in mind) and want to migrate to Boot, download a project with only Batch dependency, and expect things to work out-of-the-box. Unfortunately, this is not the case, see example 1, example 2.

5. Confusing Configuration

@EnableBatchProcessing does a good job of setting batch artifacts, including the default Map-based job repository. However, it does so only when the application context does not contain a DataSource bean. As soon as you have a datasource but you do not want to use it for batch metadata, things seem to become complicated and confusing to many people, even if the documentation says to use a custom BatchConfigurer in this case.

"How can I use the Map-based job repository with an application context that contains a DataSource?" is one of the most
frequently asked questions on StackOverflow/Github/Gitter:

It is concerning that people end up with an ugly empty setter for the datasource:

Conclusion

In sum, the Map-based job repository (and all the Map-based DAOs behind it) are creating more problems than they solve.
For all these reasons, we plan to deprecate them in v4.3 and remove them in v5.

Now what is the alternative? The alternative is to use the JDBC-based job repository with an-in memory database. For production, this should not be an issue. Any production-grade application should already define a Datasource that can be used for batch processing. If you have no need to persist or use batch metadata, you can always define another embedded datasource and use it for batch (Spring Boot provides the @BatchDataSource to make doing so easy) or provide a "NoOp" implementation of the JobRepository interface (as long as it honors the contract). For testing and prototyping, you can use an embedded database (With Boot, this is as simple as putting one of the supported embedded databases in the classpath) or a containerized one (using testcontainers.org for instance).


[1] Job repository benchmark: JobRepositoryBenchmark.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant