Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot run on lsf system, not sure if exit code 127 is the reason #196

Open
Jixuan-Huang opened this issue Jul 12, 2023 · 6 comments
Open

Comments

@Jixuan-Huang
Copy link

Hi,

when I tried to run caper with encode-atac-pipeline, I could see the job was submitted to the lsf system, but the job disappear immediately without any reports.

when I looked into the detailed record of the job, I could find an "exit code 127", and I think maybe the shell script was never been created in the home directory, but I have no ideas how to solve it. Here are the command and the job reports:

(encd-atac) caper hpc submit atac.wdl -i test.F.json --singularity --leader-job-name pipeline.test
2023-07-12 18:37:19,187|caper.hpc|INFO| Running shell command: bsub -W 2880 -M 4G -q ser -env all -J CAPER_pipeline.test /work/bio-huangjx/n6vro_by.sh
Job <4995762> is submitted to queue <ser>.
(encd-atac) bjobs
No unfinished job found
(encd-atac) bjobs -l 4995762

Job <4995762>, Job Name <CAPER_pipeline.test>, User <bio-huangjx>, Project <def
                     ault>, Status <EXIT>, Queue <ser>, Command </work/bio-huan
                     gjx/n6vro_by.sh>, Share group charged </bio-huangjx>
Wed Jul 12 18:38:25: Submitted from host <login02>, CWD <$HOME/TempDir/2305atac
                     seq/11.encode.atac>, Re-runnable;

 RUNLIMIT                
 2880.0 min of r01n14

 MEMLIMIT
      4 G 
Wed Jul 12 18:38:27: Started 5 Task(s) on Host(s) <1*r01n14> <3*r01n15> <1*r01n
                     12>, Allocated 5 Slot(s) on Host(s) <1*r01n14> <3*r01n15> 
                     <1*r01n12>, Execution Home </work/bio-huangjx>, Execution 
                     CWD </work/bio-huangjx/TempDir/2305atacseq/11.encode.atac>
                     ;
Wed Jul 12 18:38:29: Exited with exit code 127. The CPU time used is 0.1 second
                     s.
Wed Jul 12 18:38:29: Completed <exit>.

 MEMORY USAGE:
 MAX MEM: 1 Mbytes;  AVG MEM: 1 Mbytes

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

 RESOURCE REQUIREMENT DETAILS:
 Combined: select[type == local] order[-slots]
 Effective: select[type == local] order[-slots] 

Here is the config file for caper:

backend=lsf

# Local directory for localized files and Cromwell's intermediate files.
# If not defined then Caper will make .caper_tmp/ on CWD or `local-out-dir`.
# /tmp is not recommended since Caper store localized data files here.
local-loc-dir=

# This parameter defines resource parameters for Caper's leader job only.
lsf-leader-job-resource-param=-W 2880 -M 4G -q ser

# This parameter defines resource parameters for submitting WDL task to job engine.
# It is for HPC backends only (slurm, sge, pbs and lsf).
# It is not recommended to change it unless your cluster has custom resource settings.
# See https://github.com/ENCODE-DCC/caper/blob/master/docs/resource_param.md for details.
lsf-resource-param=${"-n " + cpu} ${if defined(gpu) then "-gpu " + gpu else ""} ${if defined(memory_mb) then "-M " else ""}${memory_mb}${if defined(memory_mb) then "m" else ""} ${"-W " + 60*time}

cromwell=/work/bio-huangjx/.caper/cromwell_jar/cromwell-82.jar
womtool=/work/bio-huangjx/.caper/womtool_jar/womtool-82.jar

And here is the information about the software and environment:

(encd-atac) lsb_release -a
LSB Version:	:core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID:	CentOS
Description:	CentOS Linux release 7.5.1804 (Core) 
Release:	7.5.1804
Codename:	Core

(encd-atac) caper -v
2.2.2

(encd-atac) cat test.F.json 
{
    "atac.title" : "XenTroTissues",
    "atac.description" : "15XenTroTissue",

    "atac.pipeline_type" : "atac",
    "atac.align_only" : false,
    "atac.true_rep_only" : false,

    "atac.genome_tsv" : "/work/bio-huangjx/data/refgenome/ENCO.atac/xetro10_NCBI.tsv",

    "atac.paired_end" : true,

    "atac.F_m1_R1" : [ "/work/bio-huangjx/TempDir/2305atacseq/00.rawdata/atac-m1-F/atac-m1-F_R1.fq.gz" ],
    "atac.F_m1_R2" : [ "/work/bio-huangjx/TempDir/2305atacseq/00.rawdata/atac-m1-F/atac-m1-F_R2.fq.gz" ],
    "atac.F_m2_R1" : [ "/work/bio-huangjx/TempDir/2305atacseq/00.rawdata/atac-m2-F/atac-m2-F_R1.fq.gz" ],
    "atac.F_m2_R2" : [ "/work/bio-huangjx/TempDir/2305atacseq/00.rawdata/atac-m2-F/atac-m2-F_R2.fq.gz" ],

    "atac.auto_detect_adapter" : true,
    
    "atac.multimapping" : 4

    "atac.smooth_win" : 140,
}

Thanks for responding!

@myylee
Copy link

myylee commented Nov 1, 2023

Also encountering the same issue. Would appreciate any help regarding this. Thanks.

@leepc12
Copy link
Contributor

leepc12 commented Nov 6, 2023

Can you edit the conf like the following (adding -o and -e to redirect error logs to local files) and try again?

lsf-leader-job-resource-param=-W 2880 -M 4G -q ser -o /YOUR/HOME/stdout.txt -e /YOUR/HOME/stderr.txt

Define /YOUR/HOME as a directory that you have access to.
And please post those two log files here.

@lewkiewicz
Copy link

lewkiewicz commented Feb 26, 2024

Hi! I am also dealing with this exact issue on an LSF cluster. I edited the conf file to include the line:

lsf-leader-job-resource-param=-W 2880 -M 4G -q ser -o /YOUR/HOME/stdout.txt -e /YOUR/HOME/stderr.txt

as suggested by leepc12. The output and error files are as follows:

stderr.txt

/home/lewks/.lsbatch/1708824945.82031239: line 8: /home/lewks/6pcgf87d.sh: No such file or directory

stdout.txt

Sender: LSF System lsfadmin@node184.hpc.local
[Subject: Job 82031239: <CAPER_ANY_GOOD_LEADER_JOB_NAME> in cluster Exited]
Job <CAPER_ANY_GOOD_LEADER_JOB_NAME> was submitted from host <node156.hpc.local> by user in > cluster at Sat Feb 24 20:35:45 2024
Job was executed on host(s) <node184.hpc.local>, in queue , as user in cluster at Sat Feb 24 20:35:45 2024
</home/lewks> was used as the home directory.
</home/lewks/atac-seq-pipeline> was used as the working directory.
Started at Sat Feb 24 20:35:45 2024
Terminated at Sat Feb 24 20:35:45 2024
Results reported at Sat Feb 24 20:35:45 2024

Your job looked like:


LSBATCH: User input

/home/lewks/6pcgf87d.sh

Exited with exit code 127.

Resource usage summary:

CPU time : 0.02 sec.
Max Memory : -
Average Memory : -
Total Requested Memory : -
Delta Memory : -
Max Swap : -
Max Processes : -
Max Threads : -
Run time : 0 sec.
Turnaround time : 0 sec.

The output (if any) follows:

PS:

Read file </home/lewks/stderr.txt> for stderr output of this job.

Thank you so much for any insight you might have as to how to fix this!

Best,
Stephanie

@gabdank
Copy link

gabdank commented Feb 27, 2024

I'm truly sorry to hear about the difficulties you're experiencing with running CAPER. Unfortunately, due to our current bandwidth and personnel limitations, we are unable to provide immediate attention to resolving this particular issue.
We sincerely apologize for any inconvenience this may cause and greatly appreciate your understanding.

@lewkiewicz
Copy link

No problem! Thanks for letting us know.

@sbresnahan
Copy link

Bump - I am experiencing this same issue. Error logs show calls to supposedly generated shell script that system checks for and finds does not exist...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants