Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markdup step TimeOut exiting with IHEC usage #53

Open
paulstretenowich opened this issue Nov 11, 2019 · 13 comments
Open

Markdup step TimeOut exiting with IHEC usage #53

paulstretenowich opened this issue Nov 11, 2019 · 13 comments

Comments

@paulstretenowich
Copy link

Hi,

I'm using the pipeline as part of IHEC, with the following versions:

  • Singularity version 3.2.1-1.el7
  • IHEC Fork of grape-nf
  • Nextflow version 19.04.0.5069

The pipeline itself is running well except the mergeBam step (not always working). When it comes to the markdup step it's taking very long time to end (I tried allowing up to 3 days) and ending with TIMEOUT. I noticed that the sambamba cmd is started but "stuck" and using 0% CPU (monitoring with htop). Then, I tried the sambamba cmd defined inside .command.sh outside the container (it worked) and inside the container (it worked too). I don't know what's happening there. If you need me to add some logs please tell me.

Thanks,
Paul

@emi80
Copy link
Member

emi80 commented Nov 13, 2019

Hi Paul,

apologies for the late reply.

About mergeBam not always working, I suspect it is related to the open issue #48. I am going to fix it soon.

About markDup, it would be useful to get the process logs and Nextflow logs for the pipeline run. Another test you could do, would be to manually launch the .command.run script from within the process folder and see if that works. Also, if you run the pipeline again does the problem still arise? It looks like a weird behavior...

Best,
Emilio

@paulstretenowich
Copy link
Author

Hi Emilio,

Yes, the mergeBam issue is related to #48, that's why I haven't given you much more information about.

About markDup, if I run .command.run out of the pipeline I have the same issue. If I run .command.sh it's working well. If I re-run the pipeline I also have the same issue, however, sometimes it works without changing anything.
Here are the logs from a run: nextflow.log command.err.txt command.log command.run.txt command.sh.txt

Thanks,
Paul

@emi80
Copy link
Member

emi80 commented Nov 15, 2019

HI Paul,

I could not find anything useful in the logs.

Did you run the .command.run and .command.sh locally or submitted via slurm? I am wondering whether the problem is running within a submitted job vs locally or it is the .command.run script that has some incompatibilities with your system.

Best,
Emilio

@paulstretenowich
Copy link
Author

Hi Emilio,

When I use the pipeline I run it with slurm but when I tried manually it was without slurm. In both cases it was with the singularity image.
Running .command.sh either inside the container or outside worked but when it comes to run .command.run I have the timeout/0% CPU usage issue with or without slurm.

Thanks,
Paul

@emi80
Copy link
Member

emi80 commented Nov 19, 2019

Hi Paul,

could you please try running the pipeline with the included small test dataset and the markdup profile? E.g.:

nextflow run grape-nf -profile markdup -with-singularity

Does the problem occur also in this case?

Best,
Emilio

@paulstretenowich
Copy link
Author

paulstretenowich commented Nov 21, 2019

Hi Emilio,

Testing with markdup profile on test dataset worked without issue.

Thanks,
Paul

@emi80
Copy link
Member

emi80 commented Nov 22, 2019

Hi Paul,

thanks, that does not help much.

I just realized you are not using the latest version of Nextflow. Any chance you can make a test using that version?

nextflow -self-update

      N E X T F L O W
      version 19.10.0 build 5170
      created 21-10-2019 15:07 UTC (17:07 CEST)
      cite doi:10.1038/nbt.3820
      http://nextflow.io

If the problem persists, I would then suggest you to run the hanging job via .command.run and inspect the process tree to see what's going on. You could use top or ps to check. In case you need help, please just send me the output of the ps -faux command.

Another test would be to run the pipeline adding trace.enabled = false to your local nextflow.config file and see whether the problems comes from that. I am not sure that's the case as the test dataset runs without issues.

Best,
Emilio

@paulstretenowich
Copy link
Author

paulstretenowich commented Nov 25, 2019

Hi Emilio,

After updating nextflow it seems to solve the issue only if I run it locally, when I'm using slurm I still have the samne issue. EDIT: The update solved the issue for 2 samples but for the other 2 even locally the issue remains.

You can find what's going on at the markdup step on the htop screenshot attached if that helps and here is the corresponding ps-faux.txt output.

Changing the value of trace.enabled to false doesn't change anything...

Thanks,
Paul

https://user-images.githubusercontent.com/31796146/69554181-6647f600-0f6f-11ea-8df4-017ba69f3f40.png

@emi80
Copy link
Member

emi80 commented Nov 25, 2019

Hi Paul,

it's a weird issue and it's hard to tell what's the cause. Maybe removing some of the complexity would help. Could you try running it without Singularity (e.g. with environment modules or conda)?

The best would be to find a minimal dataset for which we can reproduce the issue.

Best,
Emilio

@paulstretenowich
Copy link
Author

Hi Emilio,

Just to update you, I'm installing all the tools required for the pipeline to run and I will test without using singularity as you suggested. I will tell you if that changes anything with the issue.

Thanks,
Paul

@emi80
Copy link
Member

emi80 commented Jan 28, 2020

Hi Paul,

any news regarding this issue?

Best,
Emilio

@paulstretenowich
Copy link
Author

Hi Emilio,

I moved to another cluster and that specific issue is not happening. It might be related to the infrastructure of the first cluster I tried. I'm waiting for an update of the file system and I will test again hoping it'll work then. I'll keep you posted on that.

On the other cluster I'm running the pipeline the only remaining issue is the mergeBam which you are fixing.

Thanks,
Paul

@emi80
Copy link
Member

emi80 commented Jan 28, 2020

Hi Paul,

thanks for the update.

I'm closing this for now. Please feel free to reopen it again after the file system update if needed.

Best,
Emilio

@emi80 emi80 closed this as completed Jan 28, 2020
@emi80 emi80 reopened this May 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants