Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fail to create scTAR_cellranger env #10

Open
adonatiucsd opened this issue Oct 5, 2021 · 29 comments
Open

fail to create scTAR_cellranger env #10

adonatiucsd opened this issue Oct 5, 2021 · 29 comments

Comments

@adonatiucsd
Copy link

Hi,
I am trying to get the fromcellranger pipeline to work on MacOS Catalina 10.15.7
When I run
conda env create -f scTAR_cellranger.yml
(after modifying the yml file prefix to /home/miniconda3/envs/scTAR_cellranger)
I get the following error message:

Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:

  • libtiff==4.2.0=hdc55705_0
  • _openmp_mutex==4.5=1_gnu
  • xorg-libxrender==0.9.10=h7f98852_1003
  • xorg-libxext==1.3.4=h7f98852_1
  • make==4.3=hd18ef5c_1
  • curl==7.76.1=h979ede3_1
  • c-ares==1.17.1=h7f98852_1
  • libxcb==1.13=h7f98852_1003
  • libgcc-devel_linux-64==9.3.0=h7864c58_19
  • readline==8.0=he28a2e2_2
  • gfortran_impl_linux-64==9.3.0=hc4a2995_19
  • pthread-stubs==0.4=h36c2ea0_1001
  • xz==5.2.5=h516909a_1
  • xorg-kbproto==1.0.7=h7f98852_1002
  • libcurl==7.76.1=hc4aaa36_1
  • cairo==1.16.0=h6cf1ce9_1008
  • ucsc-gtftogenepred==377=h0b8a92a_4
  • tk==8.6.10=h21135ba_1
  • binutils_linux-64==2.35=h67ddf6f_30
  • libuuid==2.32.1=h7f98852_1000
  • xorg-libxau==1.0.9=h7f98852_0
  • libnghttp2==1.43.0=h812cca2_0
  • libgomp==9.3.0=h2828fa1_19
  • libstdcxx-devel_linux-64==9.3.0=hb016644_19
  • xorg-renderproto==0.11.1=h7f98852_1002
  • libedit==3.1.20191231=he28a2e2_2
  • htslib==1.12=h9093b5e_1
  • gxx_linux-64==9.3.0=h3fbe746_30
  • zstd==1.4.9=ha95c52a_0
  • mysql-connector-c==6.1.11=h6eb9d5d_1007
  • icu==68.1=h58526e2_0
  • libxml2==2.9.10=h72842e0_4
  • perl==5.32.0=h36c2ea0_0
  • libgfortran-ng==9.3.0=hff62375_19
  • jpeg==9d=h36c2ea0_0
  • pcre2==10.36=h032f7d1_1
  • lz4-c==1.9.3=h9c3ff4c_0
  • gfortran_linux-64==9.3.0=hdc58fab_30
  • pcre==8.44=he1b5a44_0
  • fribidi==1.0.10=h36c2ea0_0
  • gcc_linux-64==9.3.0=hf25ea35_30
  • gcc_impl_linux-64==9.3.0=h70c0ae5_19
  • samtools==1.12=h9aed4be_1
  • pixman==0.40.0=h36c2ea0_0
  • bzip2==1.0.8=h7f98852_4
  • xorg-libx11==1.7.0=h7f98852_0
  • bedtools==2.30.0=h7d7f7ad_1
  • libiconv==1.16=h516909a_0
  • openssl==1.1.1k=h7f98852_0
  • libev==4.33=h516909a_1
  • libglib==2.68.1=h3e27bee_0
  • xorg-libxdmcp==1.1.3=h7f98852_0
  • openjdk==10.0.2=h14c3975_1015
  • gxx_impl_linux-64==9.3.0=hd87eabc_19
  • ca-certificates==2020.12.5=ha878542_0
  • xorg-xproto==7.0.31=h7f98852_1007
  • ncurses==6.2=h58526e2_4
  • libwebp-base==1.2.0=h7f98852_2
  • gettext==0.19.8.1=h0b5b191_1005
  • xorg-libsm==1.2.3=hd9c2040_1000
  • harfbuzz==2.8.0=h83ec7ef_1
  • krb5==1.17.2=h926e7f8_0
  • pango==1.48.4=hb8ff022_0
  • libgcc-ng==9.3.0=h2828fa1_19
  • libdeflate==1.7=h7f98852_5
  • _libgcc_mutex==0.1=conda_forge
  • gsl==2.6=he838d99_2
  • r-base==4.0.3=h349a78a_8
  • ld_impl_linux-64==2.35.1=hea4e1c9_2
  • xorg-libice==1.0.10=h7f98852_0
  • xorg-libxt==1.2.1=h7f98852_2
  • libstdcxx-ng==9.3.0=h6de172a_19
  • sed==4.8=he412f7d_0
  • libffi==3.3=h58526e2_2
  • bwidget==1.9.14=ha770c72_0
  • binutils_impl_linux-64==2.35.1=h193b22a_2
  • xorg-xextproto==7.3.0=h7f98852_1002
  • tktable==2.10=hb7b940f_3
  • libgfortran5==9.3.0=hff62375_19
  • zlib==1.2.11=h516909a_1010
  • expat==2.3.0=h9c3ff4c_0
  • fontconfig==2.13.1=hba837de_1005
  • graphite2==1.3.13=h58526e2_1001
  • freetype==2.10.4=h0708190_1
  • libopenblas==0.3.12=pthreads_h4812303_1
  • libpng==1.6.37=h21135ba_2
  • libssh2==1.9.0=ha56f1ee_6

any idea about what is wrong?
I tried to run snakemake without creating the scTAR_cellranger environment but I get a bunch of errors
for example after doing
conda activate snakemake
snakemake -j
I get:
The flag 'directory' used in rule getMatsSteps is only valid for outputs, not inputs.
Building DAG of jobs...
MissingInputException in line 99 of /Users/perrylabmac/TAR-scRNA-seq/from_cellranger/Snakefile:
Missing input files for rule convertToRefFlat:
/Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.gtf

thanks!

Antoine

@fw262
Copy link
Owner

fw262 commented Oct 6, 2021

Hi Antoine,

The ResolvePackageNotFound is a known issue if you're trying to create the environment on Mac using an environment.yml file created on a different OS, in this case Linux. I suggest running this pipeline either in Linux or try running without creating the scTAR_cellranger environment.

It looks like the error you're getting is related to your /Users/perrylabmac/Musca_ref_genome_Cellranger folder. Can you make sure this folder contains genes/genes.gtf?

Best,
Michael

@adonatiucsd
Copy link
Author

Hi Michael,

It turns out the genes.gtf was zipped; after unzipping and rerunning the snakemake I got this error:

Error in rule convertToRefFlat:
jobid: 9
output: /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.refFlat
shell:

            /Users/perrylabmac/miniconda3/pkgs/ucsc-gtftogenepred-377-h516baf0_3/bin/gtfToGenePred -genePredExt -geneNameAsName2 /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.gtf refFlat.tmp
            paste <(cut -f 12 refFlat.tmp) <(cut -f 1-10 refFlat.tmp) > /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.refFlat
            rm refFlat.tmp

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

thanks a lot for your help,

Antoine

@fw262
Copy link
Owner

fw262 commented Oct 7, 2021

Can you share the error message you get when you run the following commands on the command line? The snakemake log file does not provide the error message info.

/Users/perrylabmac/miniconda3/pkgs/ucsc-gtftogenepred-377-h516baf0_3/bin/gtfToGenePred -genePredExt -geneNameAsName2 /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.gtf refFlat.tmp
            paste <(cut -f 12 refFlat.tmp) <(cut -f 1-10 refFlat.tmp) > /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.refFlat
            rm refFlat.tmp

Thanks,
Michael

@adonatiucsd
Copy link
Author

I get this
(snakemake) Perrys-MacBook-Pro:log perrylabmac$ /Users/perrylabmac/miniconda3/pkgs/ucsc-gtftogenepred-377-h516baf0_3/bin/gtfToGenePred -genePredExt -geneNameAsName2 /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.gtf refFlat.tmp
dyld: Library not loaded: @rpath/libssl.1.1.dylib
Referenced from: /Users/perrylabmac/miniconda3/pkgs/ucsc-gtftogenepred-377-h516baf0_3/bin/gtfToGenePred
Reason: image not found
Abort trap: 6
(snakemake) Perrys-MacBook-Pro:log perrylabmac$ paste <(cut -f 12 refFlat.tmp) <(cut -f 1-10 refFlat.tmp) > /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.refFlat
cut: cut: refFlat.tmp: No such file or directory
refFlat.tmp: No such file or directory
(snakemake) Perrys-MacBook-Pro:log perrylabmac$ rm refFlat.tmp
rm: refFlat.tmp: No such file or directory

@fw262
Copy link
Owner

fw262 commented Oct 7, 2021

Hmm, seems like an issue with the libssl library. Try conda installing with conda install libssh2 and re-runing.

Michael

@adonatiucsd
Copy link
Author

So I installed libssh2 and rerun, here is the log file:
Select jobs to execute...

[Fri Oct 8 10:07:32 2021]
rule calcHMMbed:
input: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/outs/possorted_genome_bam.bam
output: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz
jobid: 8
wildcards: DATADIR=/Users/perrylabmac/M1_Musca, sample=Lib_10X_Musca_larva_2019_02_25_M1_1sttry
threads: 3
resources: tmpdir=/var/folders/kb/1djtgry52557wtjq8j15n_pm0000gn/T

[Fri Oct 8 12:18:55 2021]
Finished job 8.
1 of 11 steps (9%) done
Select jobs to execute...

[Fri Oct 8 12:18:55 2021]
rule calcHMMrefFlat:
input: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz, /Users/perrylabmac/Musca_ref_genome_Ce$
output: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz.withDir.genes.refFlat
jobid: 7
wildcards: DATADIR=/Users/perrylabmac/M1_Musca, sample=Lib_10X_Musca_larva_2019_02_25_M1_1sttry
resources: tmpdir=/var/folders/kb/1djtgry52557wtjq8j15n_pm0000gn/T

[Fri Oct 8 12:18:55 2021]
Error in rule calcHMMrefFlat:
jobid: 7
output: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz.withDir.genes.refFlat
shell:

            Rscript scripts/generate_refFlat_script_both.R /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.refFlat /Users/per$

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /Users/perrylabmac/TAR-scRNA-seq/from_cellranger/.snakemake/log/2021-10-08T100732.520897.snakemake.log

and what I get during execution:

(snakemake) Perrys-MacBook-Pro:from_cellranger perrylabmac$ snakemake -j --rerun-incomplete
The flag 'directory' used in rule getMatsSteps is only valid for outputs, not inputs.
Building DAG of jobs...
Using shell: /usr/local/bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads


HMM_refFlat_to_gtf_WITHDIR 1 1 1
all 1 1 1
calcHMMbed 1 3 3
calcHMMrefFlat 1 1 1
extract_HMM_expression_withDir 1 1 1
getDiffFeatures 1 1 1
getDiffRegionsToBlast 1 1 1
getDiffSeqsToBlastFa 1 1 1
labelDiffuTARs 1 1 1
ruleBlast 1 3 3
stage3_withDir 1 1 1
total 11 1 3

Select jobs to execute...

[Fri Oct 8 10:07:32 2021]
rule calcHMMbed:
input: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/outs/possorted_genome_bam.bam
output: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz
jobid: 8
wildcards: DATADIR=/Users/perrylabmac/M1_Musca, sample=Lib_10X_Musca_larva_2019_02_25_M1_1sttry
threads: 3
resources: tmpdir=/var/folders/kb/1djtgry52557wtjq8j15n_pm0000gn/T

Number of aligned reads is 594024319
mkdir: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features: File exists
tee: SingleCellHMM_Run_/Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features.log: No such file or directory
Path to SingleCellHMM.R /Users/perrylabmac/TAR-scRNA-seq/from_cellranger/scripts
INPUT_BAM /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/outs/possorted_genome_bam.bam
cellranger count folder /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR
tmp folder /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features
number Of thread 3
minimum coverage 59
thresholded at 1 in 10000000 reads

Reads spanning over splicing junction will join HMM blocks
To avoid that, split reads into small blocks before input to groHMM
Spliting and sorting reads...
awk: syntax error at source line 1
context is
{print $0 &gt;&gt; &gt;&gt;&gt; "chr"$ <<< 1".bed"}
awk: illegal statement at source line 1
zcat: can't stat: possorted_genome_bam_split.sorted.bed.gz (possorted_genome_bam_split.sorted.bed.gz.Z): No such file or directory
find: illegal option -- n
usage: find [-H | -L | -P] [-EXdsx] [-f path] path ... [expression]
find [-H | -L | -P] [-EXdsx] -f path [path ...] [expression]

Start to run groHMM on each individual chromosome...
chr*.bed

Merging HMM blocks within 500bp...
sort: No such file or directory
mkdir: toremove: File exists

Calculating the coverage...
zcat: can't stat: possorted_genome_bam_split.sorted.bed.gz (possorted_genome_bam_split.sorted.bed.gz.Z): No such file or directory

Filtering the HMM blocks by coverage...

Please examine if major chromosomes are all present in the final TAR_reads.bed.gz file

zcat: can't stat: TAR_reads.bed.gz (TAR_reads.bed.gz.Z): No such file or directory

Link the final TAR_reads.bed.gz file to the working directory

Move intermediate files to /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features/toremove ...

/Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features/toremove can be deleted if no error message in SingleCellHMM_Run log file and
all major chromosomes are present in the final TAR_reads.bed.gz file

gzip: toremove is a directory

done!
gzip: possorted_genome_bam_split.sorted.bed.gz already has .gz suffix -- unchanged
gzip: TAR_reads.bed.gz already has .gz suffix -- unchanged
[Fri Oct 8 12:18:55 2021]
Finished job 8.
1 of 11 steps (9%) done
Select jobs to execute...

[Fri Oct 8 12:18:55 2021]
rule calcHMMrefFlat:
input: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz, /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.refFlat
output: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz.withDir.genes.refFlat
jobid: 7
wildcards: DATADIR=/Users/perrylabmac/M1_Musca, sample=Lib_10X_Musca_larva_2019_02_25_M1_1sttry
resources: tmpdir=/var/folders/kb/1djtgry52557wtjq8j15n_pm0000gn/T

Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
Calls: read.delim -> read.table
Execution halted
[Fri Oct 8 12:18:55 2021]
Error in rule calcHMMrefFlat:
jobid: 7
output: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz.withDir.genes.refFlat
shell:

	Rscript scripts/generate_refFlat_script_both.R /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.refFlat /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz
	
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /Users/perrylabmac/TAR-scRNA-seq/from_cellranger/.snakemake/log/2021-10-08T100732.520897.snakemake.log

thanks!

Antoine

@fw262
Copy link
Owner

fw262 commented Oct 12, 2021

Hi Antoine,

I think this issue may again be related to your MacOS system. The awk syntax seem to be a little different depending on the OS. I propose 2 temporary solutions you can try.

  1. Try using gawk instead of awk (command: alias awk=gawk).
  2. Go into the file scripts/SingleCellHMM_MW.bash and change the line bedtools bamtobed -i ${INPUT_BAM} -split |LC_ALL=C sort -k1,1V -k2,2n --parallel=30| awk '{print $0}' | gzip > ${TMPDIR}/${PREFIX}_split.sorted.bed.gz into bedtools bamtobed -i ${INPUT_BAM} -split |LC_ALL=C sort -k1,1V -k2,2n --parallel=30| awk '{print ($0)}' | gzip > ${TMPDIR}/${PREFIX}_split.sorted.bed.gz where there is a parenthesis added to the awk print command.

Let me know if that fixes it.

Michael

@adonatiucsd
Copy link
Author

Hi Michael,

I tried both, I still get this error:

The flag 'directory' used in rule getMatsSteps is only valid for outputs, not inputs.
Building DAG of jobs...
Using shell: /usr/local/bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads


HMM_refFlat_to_gtf_WITHDIR 1 1 1
all 1 1 1
calcHMMrefFlat 1 1 1
extract_HMM_expression_withDir 1 1 1
getDiffFeatures 1 1 1
getDiffRegionsToBlast 1 1 1
getDiffSeqsToBlastFa 1 1 1
labelDiffuTARs 1 1 1
ruleBlast 1 3 3
stage3_withDir 1 1 1
total 10 1 3

Select jobs to execute...

[Tue Oct 12 09:22:09 2021]
rule calcHMMrefFlat:
input: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz, /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.refFlat
output: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz.withDir.genes.refFlat
jobid: 7
wildcards: DATADIR=/Users/perrylabmac/M1_Musca, sample=Lib_10X_Musca_larva_2019_02_25_M1_1sttry
resources: tmpdir=/var/folders/kb/1djtgry52557wtjq8j15n_pm0000gn/T

Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
Calls: read.delim -> read.table
Execution halted
[Tue Oct 12 09:22:09 2021]
Error in rule calcHMMrefFlat:
jobid: 7
output: /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz.withDir.genes.refFlat
shell:

	Rscript scripts/generate_refFlat_script_both.R /Users/perrylabmac/Musca_ref_genome_Cellranger/genes/genes.refFlat /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz
	
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /Users/perrylabmac/TAR-scRNA-seq/from_cellranger/.snakemake/log/2021-10-12T092208.761573.snakemake.log

Thanks!

Antoine

@fw262
Copy link
Owner

fw262 commented Oct 12, 2021

Can you share the content of your /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR directory with a ls -hlt command? It looks like you have an empty file somewhere along the TAR discover step.

Thanks,
Michael

@adonatiucsd
Copy link
Author

Hi!

(base) Perrys-MacBook-Pro:TAR perrylabmac$ ls -hlt
total 0
lrwxr-xr-x 1 perrylabmac staff 123B Oct 12 15:23 TAR_reads.bed.gz -> /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features/TAR_reads.bed.gz
drwxr-xr-x 7 perrylabmac staff 224B Oct 12 15:23 possorted_genome_bam_HMM_features

and the content of the inner folder:

(base) Perrys-MacBook-Pro:possorted_genome_bam_HMM_features perrylabmac$ ls -hlt
total 11993240
drwxr-xr-x 2 perrylabmac staff 64B Oct 12 15:23 toremove
-rw-r--r-- 1 perrylabmac staff 20B Oct 12 15:23 TAR_reads.bed.gz
-rw-r--r-- 1 perrylabmac staff 67B Oct 12 15:23 possorted_genome_bam_merge500.sorted.bed_count.gz
-rw-r--r-- 1 perrylabmac staff 61B Oct 12 15:23 possorted_genome_bam_merge500.sorted.bed.gz
-rw-r--r-- 1 perrylabmac staff 5.7G Oct 12 15:23 possorted_genome_bam_split.sorted.bed.gz

thanks,

Antoine

@fw262
Copy link
Owner

fw262 commented Oct 13, 2021

Can you make sure you have the necessary R packages, listed below, installed?

BiocManager
rtracklayer
groHMM
Seurat, version >= 4.0
data.table
dplyr
stringr

Can you also share what is inside the toremove folder with ls -lht?

Thanks,
michael

@adonatiucsd
Copy link
Author

Hi,

So I already had these R packages installed via Rstudio but I reinstalled them within miniconda3 just to make sure; I get the same error message as before.
The toremove folder is empty; nothing shows up when I type ls -lht

thanks,

Antoine

@fw262
Copy link
Owner

fw262 commented Oct 14, 2021

Can you share the SingleCellHMM_Run*.log file generated after implementing the awk fix? I would also recommend running the pipeline on a Linux system if possible.

Michael

@adonatiucsd
Copy link
Author

ok I will try to run this on Linux. Where can I find the SingleCellHMM log file?

@fw262
Copy link
Owner

fw262 commented Oct 14, 2021

I made a change to scripts/SingleCellHMM_MW.bash to fix a bug with the TAR discovery log file. If you update this file and re-run your pipeline, you should find the log file in /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features/SingleCellHMM_Run.log.

Michael

@adonatiucsd
Copy link
Author

Here it is:

Path to SingleCellHMM.R /Users/perrylabmac/TAR-scRNA-seq/from_cellranger/scripts
INPUT_BAM /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/outs/possorted_genome_bam.bam
cellranger count folder /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR
tmp folder /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features
number Of thread 10
minimum coverage 59
thresholded at 1 in 10000000 reads

Reads spanning over splicing junction will join HMM blocks
To avoid that, split reads into small blocks before input to groHMM
Spliting and sorting reads...
zcat: can't stat: possorted_genome_bam_split.sorted.bed.gz (possorted_genome_bam_split.sorted.bed.gz.Z): No such file or directory
awk: syntax error at source line 1
context is
{print $0 &gt;&gt; &gt;&gt;&gt; "chr"$ <<< 1".bed"}
awk: illegal statement at source line 1
find: illegal option -- n
usage: find [-H | -L | -P] [-EXdsx] [-f path] path ... [expression]
find [-H | -L | -P] [-EXdsx] -f path [path ...] [expression]

Start to run groHMM on each individual chromosome...
chr*.bed

Merging HMM blocks within 500bp...
sort: No such file or directory

Calculating the coverage...
zcat: can't stat: possorted_genome_bam_split.sorted.bed.gz (possorted_genome_bam_split.sorted.bed.gz.Z): No such file or directory

Filtering the HMM blocks by coverage...

Please examine if major chromosomes are all present in the final TAR_reads.bed.gz file

zcat: can't stat: TAR_reads.bed.gz (TAR_reads.bed.gz.Z): No such file or directory

Link the final TAR_reads.bed.gz file to the working directory

Move intermediate files to /Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features/toremove ...

/Users/perrylabmac/M1_Musca/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features/toremove can be deleted if no error message in SingleCellHMM_Run log file and
all major chromosomes are present in the final TAR_reads.bed.gz file

@adonatiucsd
Copy link
Author

Hi,

So I still haven't managed to get it to work on MacOS but I was able to run the pipeline on Linux.
However when loading the TAR bed file on IGV I noticed that many scaffolds did not have anything in the TAR track; and from the groHMM log file it looks like the program did not go through all the 20487 scaffold of the Housefly genome:

Path to SingleCellHMM.R /home/adonati/TAR-scRNA-seq/from_cellranger/scripts
INPUT_BAM /home/adonati/Desktop/M1/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/outs/possorted_genome_bam.bam
cellranger count folder /home/adonati/Desktop/M1/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR
tmp folder /home/adonati/Desktop/M1/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features
number Of thread 6
minimum coverage 59
thresholded at 1 in 10000000 reads

Reads spanning over splicing junction will join HMM blocks
To avoid that, split reads into small blocks before input to groHMM
Spliting and sorting reads...
Finished splitting input bam.
awk: cannot open "chrNW_004756140.1.bed" for output (Too many open files)

Start to run groHMM on each individual chromosome...
chrNW_004754939.1.bed
chrNW_004754940.1.bed
chrNW_004754941.1.bed
chrNW_004754943.1.bed
chrNW_004754965.1.bed
chrNW_004755053.1.bed
chrNW_004755054.1.bed
chrNW_004755111.1.bed
chrNW_004755164.1.bed
chrNW_004755198.1.bed
chrNW_004755220.1.bed
chrNW_004755231.1.bed
chrNW_004755318.1.bed
chrNW_004755333.1.bed
chrNW_004755352.1.bed
chrNW_004755353.1.bed
chrNW_004755386.1.bed
chrNW_004755398.1.bed
chrNW_004755497.1.bed
chrNW_004755531.1.bed
chrNW_004755586.1.bed
chrNW_004755622.1.bed
chrNW_004755631.1.bed
chrNW_004755653.1.bed
chrNW_004755684.1.bed
chrNW_004755784.1.bed
chrNW_004755797.1.bed
chrNW_004755819.1.bed
chrNW_004755838.1.bed
chrNW_004755853.1.bed
chrNW_004755919.1.bed
chrNW_004755924.1.bed
chrNW_004755942.1.bed
chrNW_004755953.1.bed
chrNW_004756008.1.bed
chrNW_004756052.1.bed
chrNW_004756054.1.bed
chrNW_004756076.1.bed
chrNW_004756109.1.bed
chrNW_004756120.1.bed

Merging HMM blocks within 500bp...

Calculating the coverage...

Filtering the HMM blocks by coverage...

Please examine if major chromosomes are all present in the final TAR_reads.bed.gz file

NW_004754939.1
NW_004754940.1
NW_004754941.1
NW_004754943.1
NW_004754965.1
NW_004755053.1
NW_004755054.1
NW_004755164.1
NW_004755198.1
NW_004755220.1
NW_004755231.1
NW_004755353.1
NW_004755386.1
NW_004755398.1
NW_004755497.1
NW_004755531.1
NW_004755586.1
NW_004755631.1
NW_004755653.1
NW_004755797.1
NW_004755819.1
NW_004755838.1
NW_004755853.1
NW_004755919.1
NW_004755924.1
NW_004755942.1
NW_004755953.1
NW_004756008.1
NW_004756052.1
NW_004756054.1
NW_004756076.1
NW_004756109.1

Link the final TAR_reads.bed.gz file to the working directory

Move intermediate files to /home/adonati/Desktop/M1/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features/toremove ...

/home/adonati/Desktop/M1/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/possorted_genome_bam_HMM_features/toremove can be deleted if no error message in SingleCellHMM_Run log file and
all major chromosomes are present in the final TAR_reads.bed.gz file

gzip: possorted_genome_bam_split.sorted.bed.gz already has .gz suffix -- unchanged
gzip: TAR_reads.bed.gz already has .gz suffix -- unchanged

Looking at the TAR_reads. bed it seems many scaffold are absent indeed. Any idea why the program skipped so many scaffolds?

Thanks!

Antoine

@fw262
Copy link
Owner

fw262 commented Oct 19, 2021

Hi Antoine,

I'm glad the pipeline worked better on Linux. I updated scripts/SingleCellHMM_MW.bash so please try re-running with this new script. With over 20k scaffolds, it's overloading memory so I modified that script to close bed files that are generated. This may take a little longer to run but should work.

Best,
Michael

@adonatiucsd
Copy link
Author

Hi Michael,

I am trying to get the pipeline to work on UCSD supercomputer Expanse, I installed miniconda3 and all the software required except R, which I can't figure out how to install without sudo (which I can't use on Expanse); do you think the pipeline could work if I install R via conda?

Thanks!

Antoine

@fw262
Copy link
Owner

fw262 commented Oct 21, 2021

A conda installed R should work fine, as long as you have all the required R packages as well. Let me know if it works out.

Michael

@adonatiucsd
Copy link
Author

Hi Michael,

Sorry about the stupid question, but am I supposed to activate the scTAR_cellranger environment before running snakemake? I do have R and installed all the required libraries but within the scTAR_cellranger environement. The problem is that to run snakemake I need to use "conda activate snakemake", which deactivates scTAR_cellranger...
Thanks a lot!

Antoine

@adonatiucsd
Copy link
Author

Hi Michael,

I get this error when I run the pipeline after installing R with conda:

wildcards: CR_REF=/expanse/lustre/scratch/adonati2/temp_project/10XMusca/mapping/Musca_ref_genome
resources: tmpdir=/tmp

[Sat Oct 23 18:45:20 2021]
Finished job 9.
1 of 11 steps (9%) done
Select jobs to execute...

[Sat Oct 23 18:45:20 2021]
rule calcHMMrefFlat:
input: /expanse/lustre/scratch/adonati2/temp_project/10XMusca/mapping/M1/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz, /expanse/lustre/scratch/adon$
output: /expanse/lustre/scratch/adonati2/temp_project/10XMusca/mapping/M1/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz.withDir.genes.refFlat
jobid: 7
wildcards: DATADIR=/expanse/lustre/scratch/adonati2/temp_project/10XMusca/mapping/M1, sample=Lib_10X_Musca_larva_2019_02_25_M1_1sttry
resources: tmpdir=/tmp

/usr/bin/bash: line 1: Rscript: command not found
[Sat Oct 23 18:45:20 2021]
Error in rule calcHMMrefFlat:
jobid: 7
output: /expanse/lustre/scratch/adonati2/temp_project/10XMusca/mapping/M1/Lib_10X_Musca_larva_2019_02_25_M1_1sttry/TAR/TAR_reads.bed.gz.withDir.genes.refFlat
shell:

        Rscript scripts/generate_refFlat_script_both.R /expanse/lustre/scratch/adonati2/temp_project/10XMusca/mapping/Musca_ref_genome/genes/genes.refFlat /expa$

(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/adonati2/TAR-scRNA-seq/from_cellranger/.snakemake/log/2021-10-23T184509.175761.snakemake.log

it looks like the Rscript command is not working from within the snakemake pipeline even if I do have Rscript in miniconda3/bin...

thanks!

Antoine

@fw262
Copy link
Owner

fw262 commented Oct 25, 2021

Hi Antoine,

It looks like Rscript is not available within the snakemake environment. You can try simply replacing Rscript command within the rule calcHMMrefFlat of Snakefile to explicitly call Rscript in miniconda3/bin.

In regards to the snakemake environment, it may be worthwhile to install snakemake via pip so you can activate the scTAR_cellranger environment without conflicting with the snakemake environment.

Best,
Michael

@adonatiucsd
Copy link
Author

Hi Michael,

I had a problem with R because for some reason conda installed R 3.2
I managed to install R 4.1 with mamba and then I was able to run the pipeline
It ran for a full 24h on the computer before stopping before completion (max time reached): it created tones of .bed files (looks like one per scaffold) in /TAR/possorted_genome_bam_HMM_features
along with "possorted_genome_bam_split.sorted.bed.gz" and "SingleCellHMM_Run.log" which reads like:

Path to SingleCellHMM.R /home/adonati2/TAR-scRNA-seq/from_cellranger/scripts
INPUT_BAM /expanse/lustre/scratch/adonati2/temp_project/10XMusca/mapping/M1/Lib_10X_Musca_larva_2019_02_25_$
cellranger count folder /expanse/lustre/scratch/adonati2/temp_project/10XMusca/mapping/M1/Lib_10X_Musca_larva_2019_02_25_$
tmp folder /expanse/lustre/scratch/adonati2/temp_project/10XMusca/mapping/M1/Lib_10X_Musca_larva_2019_02_25_$
number Of thread 24
minimum coverage 59
thresholded at 1 in 10000000 reads

Reads spanning over splicing junction will join HMM blocks
To avoid that, split reads into small blocks before input to groHMM
Spliting and sorting reads...
Finished splitting input bam.

I was wondering if the pipeline is really supposed to generate and keep all these .bed files? Is it normal that the pipeline took so long even running on Expanse (I ran it on just one Node, which is two 64-core AMD EPYC 7742 processors and contain 256 GB of DDR4 memory)?
Thanks!

Antoine

@fw262
Copy link
Owner

fw262 commented Oct 27, 2021

Hi Antoine,

Yes, the pipeline is suppose to generate those bed files. The groHMM algorithm is run on individual chromosomes/scaffolds. There is a limitation with the tool if you have a very high number of scaffolds (over 20,000 in your case). Are you interested in the expression across all scaffolds? Perhaps you can filter for the major scaffolds where you'd expect the most expression.

Best,
Michael

@adonatiucsd
Copy link
Author

Hi Michael,

I guess it would be nice to have the TARs for all scaffold; here I have many scaffolds indeed but most of them are small, only 35 of them are over 1Mb, and the total genome size is only 750Mb; will the pipeline run slower with a more fragmented genome or is the speed only dependent on genome size?

@fw262
Copy link
Owner

fw262 commented Oct 27, 2021

Hi Antoine,

The pipeline would run slower with more fragmented genome. It wouldn't take as long if you ran the pipeline on, for example, the human or mouse genome.

Best,
Michael

@adonatiucsd
Copy link
Author

Hi Michael,

Are all these bedfiles deleted at the end of the pipeline or are they all merged together and zipped?
thanks

Antoine

@fw262
Copy link
Owner

fw262 commented Nov 1, 2021

Hi Antoine,

These bed files are zipped, not merged, and moved to the possorted_genome_bam_HMM_features/toremove folder which you can remove manually.

Best,
Michael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants