-
Notifications
You must be signed in to change notification settings - Fork 665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nextflow 24.04.4 never exit due to incomplete file transfer #5363
Comments
Did it hang or did it just exit without finishing all of the file transfers? Your issue title suggests the former but your log suggests the latter |
It hangs for ~12 hours, then shows "Exiting before file transfers were completed -- Some files may be lost" message, and then hangs without exit for days. |
Then it looks like one of the file uploads hung up. Nextflow will timeout after 12 hours so that is the expected behavior. As for the file upload, it's hard to know the root cause. I would see if it happens consistently first. If not, it might be some intermittent networking issue |
We are encountering the same issue using both 24.04 and the latest edge release. Is there an option to "retry" the file transfer after it hangs x amount of time? |
@matthdsm does it happen consistently? and are you saying it doesn't happen for other versions? |
it happens often, but not consistently. We've started noticing the phenomenon after updating to 24.04, but I not a 100% sure it didn't happend before. |
I'm experiencing the same issue. It does not happen for every workflow but it seems to happen consistently for one of the workflows we run. @matthdsm What did you do in the end to resolve the issue? |
Are you able to include the |
See attached, does that suffice @pditommaso ? |
There are two threads in waiting status for publishing data. Still don't know the reason
|
Interestingly both are executed via |
This is what I think it is happening: |
I don't think it has anything to do with the S3 transfer, because we're seeing this issue on a shared FS too. |
@jorgee was suspecting something like that. What about using nextflow/modules/nextflow/src/main/groovy/nextflow/util/ThreadPoolManager.groovy Lines 98 to 99 in 0e30a8f
|
Would it help to break up directory publishing by publishing individual files instead? See #3933 |
Thanks for your feedback! @bentsherman Your PR is still open. Do you happen to have a nightly build or something I can try and test it? Reconfiguring the workflow so it outputs individual files is not an option, I'm afraid. We just spent quite some time tuning the output structure according to our needs. @pditommaso @jorgee What could I try setting then?
And what would be good values for these? |
I would be very much in favour of this option. The situation right now is the worst possible one: users think the pipeline ran successfully but no data is written out. And when they resume the pipeline (in order to pick up cached tasks and just try to re-publish the files) it turns out no caching information is available because the head job did not finish properly. |
@tverbeiren can you please try the following settings?
|
These are the current defaults, maxQueueSize must be bigger in your case.
|
I claim |
If I correctly understood the rejection mechanism, it is mainly when threads and queues are full. So, if we reduce the queue, the rejection should happen earlier. |
Think you are right, my assumption it was used a blocking queue that prevent more jobs to be added once it's full. We may need to recover this implementation nextflow/modules/nextflow/src/main/groovy/nextflow/util/BlockingBlockingQueue.groovy Line 33 in c0e2aa7
|
Push a tentative solution #5700 |
Using the following configuration, all files are properly published:
Do you see any disadvantages in setting this for all our workflows (with a proper solution pending)? |
it should be a valid workaround |
Bug report
Expected behavior and actual behavior
I have used nextflow 22.10.6.5843, which runs smoothly. After I updated my nextflow to v24.0.4.4, the same script hangs with some files not finished for transferring. The files to be transferred are totally around 50Gb.
Steps to reproduce the problem
Program output
Oct-03 12:42:12.157 [main] DEBUG nextflow.Session - Session await > all processes finished
Oct-03 12:42:17.082 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: slurm) - terminating tasks monitor poll loop
Oct-03 12:42:17.082 [main] DEBUG nextflow.Session - Session await > all barriers passed
Oct-03 12:42:17.093 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'TaskFinalizer' shutdown completed (hard=false)
Oct-03 12:42:22.095 [main] INFO nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (7 files)
Oct-03 12:43:22.102 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (7 files)
Oct-03 12:44:22.104 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (6 files)
Oct-03 12:45:22.106 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (6 files)
Oct-03 12:46:22.108 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (6 files)
Oct-03 12:47:22.110 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (6 files)
Oct-03 12:48:22.112 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (4 files)
Oct-03 12:49:22.114 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (3 files)
Oct-03 12:50:22.116 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (3 files)
.......
.......
.......
Oct-04 00:41:23.430 [main] DEBUG nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (3 files)
Oct-04 00:42:18.432 [main] WARN nextflow.util.ThreadPoolHelper - Exiting before file transfers were completed -- Some files may be lost
Oct-04 00:42:18.432 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'PublishDir' shutdown completed (hard=false)
Oct-04 00:42:18.463 [main] DEBUG n.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=32; failedCount=0; ignoredCount=0; cachedCount=0; pendingCount=0; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=226d 11h 31m 9s; failedDuration=0ms; cachedDuration=0ms;loadCpus=0; loadMemory=0; peakRunning=5; peakCpus=125; peakMemory=0; ]
Oct-04 00:42:18.733 [main] DEBUG nextflow.cache.CacheDB - Closing CacheDB done
Oct-04 00:42:18.820 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'FileTransfer' shutdown completed (hard=false)
Oct-04 00:42:18.820 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye
Environment
Additional context
The text was updated successfully, but these errors were encountered: