Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory consumption during long running jobs #3790

Closed
PauliusPeciura opened this issue Oct 12, 2020 · 13 comments · May be fixed by #4705
Closed

High memory consumption during long running jobs #3790

PauliusPeciura opened this issue Oct 12, 2020 · 13 comments · May be fixed by #4705
Labels
for: backport-to-4.3.x Issues that will be back-ported to the 4.3.x line has: minimal-example Bug reports that provide a minimal complete reproducible example has: votes Issues that have votes in: core related-to: performance type: enhancement
Milestone

Comments

@PauliusPeciura
Copy link

PauliusPeciura commented Oct 12, 2020

Bug description
We found that memory consumption is fairly high on one of the service nodes that uses the Spring Batch. Even though both data nodes did a similar amount of work, the memory consumption across nodes was not even - 15GB vs 1.5GB (see memory use screenshot).

We have some jobs that could run for seconds while others might run for hours, so we set the polling interval (MessageChannelPartitionHandler#setPollInterval) to 1 second rather than 10 seconds that is the default. In a large running job scenario, we ended up creating 837 step executions.

What I found was that MessageChannelPartitionHandler#pollReplies gets a full StepExecution representation for each step, which contains a JobExecution which would also contain StepExecutions for each. However, they are retrieved at different times and stages. This means that we end up with square number of StepExecution objects, e.g. 837*837=700569 StepExecutions (see screenshot below)

Environment
Initially reproduced on Spring Batch 4.1.4.

Expected behavior
My proposal would be to:

  1. Issue a SQL query to get the count of running StepExecutions instead of retrieving DTOs. This way there is less objects loaded into the heap.
  2. Once all steps are finished, then query for all StepExecutions for that job. We can then assign the same JobExecution to each step.

Memory usage graph comparison between two service nodes, doing roughly equal amount of work:

memoryUse - redacted

My apologies for a messy screenshot, but it does explain the number of StepExecution objects:

stepExecutions - redacted

@ssanghavi-appdirect
Copy link

We are facing same issue. When number of steps in a job increases it leads to OOM, killing the manager jvm.
Is there a plan to fix this?

@fmbenhassine
Copy link
Contributor

@PauliusPeciura Thank you for reporting this issue and for opening a PR! I would like to be able to reproduce the issue first in order to validate a fix if any. From your usage of MessageChannelPartitionHandler, I understand that this is related to a remote partitioning setup. However, you did not share your job/step configuration. Is a job with a single partitioned step configured with a high number of worker steps enough to reproduce the issue? Do you think the same problem would happen locally with a TaskExecutorPartitionHandler (this would be easier to test in comparison to a remote partitioning setup)? I would be grateful if you could share more details on your configuration or provide a minimal example.

@ssanghavi-appdirect Yes. If we can reproduce the issue in a reliable manner, we will plan a fix for one of the upcoming releases.

@fmbenhassine fmbenhassine added status: waiting-for-reporter Issues for which we are waiting for feedback from the reporter and removed status: waiting-for-triage Issues that we did not analyse yet labels Mar 22, 2021
@ssanghavi-appdirect
Copy link

@benas I am able to reproduce with TaskExecutorPartitionHandler as well. However the fix provided by @PauliusPeciura is very specific to DB polling and won't fix what I reproduced with TaskExecutorPartitionHandler
Basically this issue can occur in any code path that is holding references to StepExecution objects returned by JobExplorer.getStepExecution. Similar code exists in RemoteStepExecutionAggregator.aggregate() and MessageChannelPartitionHandler.pollReplies.

Scenario to reproduce: create a job with more than 900 remote partitions, wait for it to complete. Observe that manager jvm fails with OOM if -Xmx is set else memory consumption keeps on increasing.
Issue can be reproduced with both MessageChannelPartitionHandler and TaskExecutorPartitionHandler. We are able to reproduce issue both using DB polling and request-reply channel while using MessageChannelPartitionHandler.

What is the most convenient way to share code that reproduces issue?

@ssanghavi-appdirect
Copy link

Attaching a spring boot project that can reproduce issue with TaskExecutorPartitionHandler. It requires maven and java 11 to run.

Steps to execute the program

  1. Download the attached zip file and extract the contents
  2. Navigate to spring-batch-remoting directory that is created by step# 1
  3. Run maven command to build mvn clean install
  4. Start java process with java -Xmx250m -jar target/spring-batch-remoting-0.0.1-SNAPSHOT.jar

spring-batch-remoting.zip

@fmbenhassine fmbenhassine added the has: minimal-example Bug reports that provide a minimal complete reproducible example label Mar 31, 2021
@cazacmarin
Copy link

Will this picture help, guys? will it indicate that using last Spring batch version, you will really agree that you have a memory leak inside?

image

@fmbenhassine
Copy link
Contributor

Thank you all for for your feedback here! This is a valid performance issue. There is definitely no need to load the entire object graph of step executions when polling the status of workers.

Ideally, polling for running workers could be done with a single query, and once they are all done, we should grab shallow copies of step executions with the minimum required to do the aggregation.

I will plan the fix for the upcoming 5.0.1 / 4.3.8.

@fmbenhassine fmbenhassine added for: backport-to-4.3.x Issues that will be back-ported to the 4.3.x line and removed status: waiting-for-reporter Issues for which we are waiting for feedback from the reporter labels Feb 22, 2023
@fmbenhassine fmbenhassine added this to the 5.0.1 milestone Feb 22, 2023
@fmbenhassine fmbenhassine added the has: votes Issues that have votes label Feb 22, 2023
fmbenhassine pushed a commit that referenced this issue Feb 22, 2023
@galovics
Copy link

galovics commented Mar 29, 2023

@fmbenhassine I'm afraid the issue is still present. I've checked the commit you made but since it's still working with entities, the associations are still there.

Here's a snapshot from a heap dump I've taken:
image
image
image

And here's the relevant stacktrace where the objects are coming from:

Scheduler1_Worker-1
  at java.lang.Thread.sleep(J)V (Native Method)
  at org.springframework.batch.poller.DirectPoller$DirectPollingFuture.get(JLjava/util/concurrent/TimeUnit;)Ljava/lang/Object; (DirectPoller.java:109)
  at org.springframework.batch.poller.DirectPoller$DirectPollingFuture.get()Ljava/lang/Object; (DirectPoller.java:80)
  at org.springframework.batch.integration.partition.MessageChannelPartitionHandler.pollReplies(Lorg/springframework/batch/core/StepExecution;Ljava/util/Set;)Ljava/util/Collection; (MessageChannelPartitionHandler.java:288)
  at org.springframework.batch.integration.partition.MessageChannelPartitionHandler.handle(Lorg/springframework/batch/core/partition/StepExecutionSplitter;Lorg/springframework/batch/core/StepExecution;)Ljava/util/Collection; (MessageChannelPartitionHandler.java:251)
  at org.springframework.batch.core.partition.support.PartitionStep.doExecute(Lorg/springframework/batch/core/StepExecution;)V (PartitionStep.java:106)
  at org.springframework.batch.core.step.AbstractStep.execute(Lorg/springframework/batch/core/StepExecution;)V (AbstractStep.java:208)
  at org.springframework.batch.core.job.SimpleStepHandler.handleStep(Lorg/springframework/batch/core/Step;Lorg/springframework/batch/core/JobExecution;)Lorg/springframework/batch/core/StepExecution; (SimpleStepHandler.java:152)
  at org.springframework.batch.core.job.AbstractJob.handleStep(Lorg/springframework/batch/core/Step;Lorg/springframework/batch/core/JobExecution;)Lorg/springframework/batch/core/StepExecution; (AbstractJob.java:413)
  at org.springframework.batch.core.job.SimpleJob.doExecute(Lorg/springframework/batch/core/JobExecution;)V (SimpleJob.java:136)
  at org.springframework.batch.core.job.AbstractJob.execute(Lorg/springframework/batch/core/JobExecution;)V (AbstractJob.java:320)
  at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run()V (SimpleJobLauncher.java:149)
  at org.springframework.core.task.SyncTaskExecutor.execute(Ljava/lang/Runnable;)V (SyncTaskExecutor.java:50)
  at org.springframework.batch.core.launch.support.SimpleJobLauncher.run(Lorg/springframework/batch/core/Job;Lorg/springframework/batch/core/JobParameters;)Lorg/springframework/batch/core/JobExecution; (SimpleJobLauncher.java:140)
  ...
  at org.springframework.scheduling.quartz.QuartzJobBean.execute(Lorg/quartz/JobExecutionContext;)V (QuartzJobBean.java:75)
  at org.quartz.core.JobRunShell.run()V (JobRunShell.java:202)
  at org.quartz.simpl.SimpleThreadPool$WorkerThread.run()V (SimpleThreadPool.java:573)

Note: this specific job could run for hours and processes a lot of data (millions of records). When the number of partitions exceed 500 (not the threshold) the manager is slowly accumulating more and more memory.
As a mitigation, I've reduced the number of partitions to 36ish and now it doesn't fail. Probably it's still consuming more and more memory but finishes before it starts to run OOM.

@fmbenhassine
Copy link
Contributor

@galovics Thank you for reporting this.

I'm afraid the issue is still present. I've checked the commit you made but since it's still working with entities, the associations are still there.

We will always work with entities according to the domain model. What we can do is reduce the number of entities loaded in memory to the minimum required. Before 93800c6, the code was loading job executions in a loop for every partitioned step execution, which is obviously not necessary.

In your screenshot, I see you have several JobExecution objects with different IDs. Are you running several job instances in the same JVM and sharing the MessageChannelPartitionHanlder between them?

To correctly address any performance issue, we need to analyse the performance for a single job execution first. So I am expecting to see a single job execution in memory with a partitioned step. Once we ensure that a single partitioned execution is optimized, we can discuss if the packaging/deployment pattern is suitable to run several job executions in the same JVM or not.

Please open a separate issue and provide a minimal example to be sure we are addressing your specific issue and we will dig deeper. Thank you upfront.

@galovics
Copy link

galovics commented Apr 5, 2023

@fmbenhassine

In your screenshot, I see you have several JobExecution objects with different IDs. Are you running several job instances in the same JVM and sharing the MessageChannelPartitionHanlder between them?

That's strange to me too. I re-read the Spring Batch docs on job instances to use the same terminology and understanding and I can confirm there's a single job instance being run. In fact it's the book example of the Spring Batch docs.
It's a remote partitioned end of day job (close of business (COB) as we refer to it) running once each day.

I can even show the code to you cause the project is open-source.
Here's the whole manager configuration: https://github.com/apache/fineract/blob/dbfedf5cfdffbddfd400f51498c02a88c0551bd1/fineract-provider/src/main/java/org/apache/fineract/cob/loan/LoanCOBManagerConfiguration.java
Here's the worker configuration: https://github.com/apache/fineract/blob/dbfedf5cfdffbddfd400f51498c02a88c0551bd1/fineract-provider/src/main/java/org/apache/fineract/cob/loan/LoanCOBWorkerConfiguration.java

@fmbenhassine
Copy link
Contributor

Thank you for your feedback.

I can confirm there's a single job instance being run

In that case, there should really be a single JobExecution object in memory. By design, Spring Batch does not allow concurrent job executions of the same job instance. Therefore, if a single job instance is launched within a JVM, there should be a single job execution for that instance running at a time (and consequently, a single JobExecution object in memory). That is the setup we need to analyse the performance issue.

As mentioned previously, as this issue has been closed and assigned to a release, please open a separate one with all these details and I will take a look. Thank you upfront.

@pstetsuk
Copy link

pstetsuk commented Jan 8, 2024

We have the same problem. I modified PR #3791so it can be merged to main branch

@hpoettker
Copy link
Contributor

@galovics @pstetsuk
If you find the time, it would be interesting to hear whether #4599 improves the situation for you.

@pstetsuk
Copy link

@hpoettker our problem is that we have thousands steps and all of then load in-memory every time it checks step result. It leads to OutOfMemory. Your fix doesn't change this behavior and can't resolve the problem. In the fix from @galovics it doesn't load all the steps but get the count of incomplete steps from the database. It works much faster and consumer much less memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
for: backport-to-4.3.x Issues that will be back-ported to the 4.3.x line has: minimal-example Bug reports that provide a minimal complete reproducible example has: votes Issues that have votes in: core related-to: performance type: enhancement
Projects
None yet
7 participants