-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IndexError: list index out of range during "Screening of possible de novo microRNAs" #146
Comments
I've had this same issue and I was able to fix it but I'm not sure it will work for everyone. I noticed that my FASTA files had an "@" symbol at the beginning of each header line so that the sequences looked something like this:
I removed For whatever reason this worked and ShortStack ran without any issues. |
See #136 ... FASTA headers that start with >@ are invalid with |
Just curious because these '>@' reads seem to be common : What software is making these FASTA files with '>@' headers? Would be good to add a feature request to change that. |
I don't actually know. These were sequences I got from Novogene, which use Illumina's sequencing platform. For small RNA-Seq they give you both raw FASTQs and cleaned/trimmed FASTAs. Whatever cleaning pipeline they use must add '>@' to any small RNA-Seq data they generate. |
Thanks for the suggestion. I just The other counfounding aspect of this is that I've used the same genome file and mirBase file with the same reads succesfully. The difference being the way the reads were trimmed... Additionally, the sequencing data we have is paired-end, so in addition to the different trimming parameters, I'm also testing using just the R1 reads or merged R1/R2 reads. Any chance this would lead to this error? Looking at the source code, it seems like something isn't getting parsed correctly: # Compute position of miR* in genomic coordinates given relative coordinates
s_gen_start, s_gen_stop = get_star_genomic(mir_locus, bedfields, s_rel_start, s_rel_stop)
if s_gen_start is None:
return None, None, None
# Now that the miR* position is known, the locus needs to be trimmed
# in size.
new_start = min(int(bedfields[1]) - 21, s_gen_start - 20)
new_end = max(int(bedfields[2]) + 20, s_gen_stop + 20)
new_locus = bedfields[0] + ':' + str(new_start) + '-' + str(new_end)
# It is possible that the new locus folds differently. So, re-analyze!
# Get the dotbracket
cmd = 'samtools faidx'
if bedfields[5] == '-':
cmd = cmd + ' -i'
cmd = cmd + f' {args.genomefile} {new_locus}'
cmd = cmd + f' | RNAfold --noPS'
foldjob = subprocess.run(cmd, shell=True, text=True, capture_output=True)
foldlines = foldjob.stdout.split('\n')
new_dotbracket = foldlines[2].rstrip().split(' ')[0] Thanks again to both of you for the suggestions! Much appreciated! |
Well, I've added some print statements to the # It is possible that the new locus folds differently. So, re-analyze!
# Get the dotbracket
cmd = 'samtools faidx'
if bedfields[5] == '-':
cmd = cmd + ' -i'
cmd = cmd + f' {args.genomefile} {new_locus}'
cmd = cmd + f' | RNAfold --noPS'
foldjob = subprocess.run(cmd, shell=True, text=True, capture_output=True)
foldlines = foldjob.stdout.split('\n')
# Debug prints
print("foldjob.stdout:", foldjob.stdout)
print("foldlines:", foldlines)
print("foldlines position 3:", foldlines[2]) # Print the third element of foldlines
print("Length of foldlines:", len(foldlines)) # Print the length of foldlines
new_dotbracket = foldlines[2].rstrip().split(' ')[0] Of course, this doesn't reveal too much. Here's an example of the output and the subsequent foldlines: ['>Porites_evermani_scaffold_4181:23753-23844', 'GCUGAGAUUGCUGUAGGCUGCUAUGAGCCUUGGGCGCAUCAGCGCAAGCUGUGUGCUUAAGCGUGAGUAGUUGUCUGCUGUGUUUCUUAGCU', '(((((((.(((.((((((.(((((..((((((((((((.((((....)))))))))))))).))..))))).)))))).)))..))))))). (-48.70)', '']
foldlines position 3: (((((((.(((.((((((.(((((..((((((((((((.((((....)))))))))))))).))..))))).)))))).)))..))))))). (-48.70)
Length of foldlines: 4
foldjob.stdout: >Porites_evermani_scaffold_875:120550-120743
AUAAUCAAGCUACAGUCUCUUUACGUUAGGAGCGCAGCUUAGAUGUUGAAUAAUGCUGCUUAUUCCUGCACCCAUCACUUUCUGUGGAGUCAUCUUCACAUAAUCAAGCCACAGUCUCUUUAUGUUGUAAGCGCAGCUCAGAUGUUGAAUUAAGCUGUCGACUCCUGCACCAAGAAGAGUGUUUUCUCCAUUGU
(((((..((..(((.((((((...(((((((((((((((((...(((.((((..(((((.......................(((((((....))))))).......((.((((.(.......)))))..))))))).....)))).)))))))))).)).)))))).)).))).))).)))...))..))))) (-45.90)
foldlines: ['>Porites_evermani_scaffold_875:120550-120743', 'AUAAUCAAGCUACAGUCUCUUUACGUUAGGAGCGCAGCUUAGAUGUUGAAUAAUGCUGCUUAUUCCUGCACCCAUCACUUUCUGUGGAGUCAUCUUCACAUAAUCAAGCCACAGUCUCUUUAUGUUGUAAGCGCAGCUCAGAUGUUGAAUUAAGCUGUCGACUCCUGCACCAAGAAGAGUGUUUUCUCCAUUGU', '(((((..((..(((.((((((...(((((((((((((((((...(((.((((..(((((.......................(((((((....))))))).......((.((((.(.......)))))..))))))).....)))).)))))))))).)).)))))).)).))).))).)))...))..))))) (-45.90)', '']
foldlines position 3: (((((..((..(((.((((((...(((((((((((((((((...(((.((((..(((((.......................(((((((....))))))).......((.((((.(.......)))))..))))))).....)))).)))))))))).)).)))))).)).))).))).)))...))..))))) (-45.90)
Length of foldlines: 4
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/sam/programs/mambaforge/envs/ShortStack-4.0.3_env/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/sam/programs/mambaforge/envs/ShortStack-4.0.3_env/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/home/sam/programs/mambaforge/envs/ShortStack-4.0.3_env/bin/ShortStack", line 2391, in mir_analysis
new_locus, s_start, s_end = analyze_fold(mir_locus, dotbracket, bedfields, merged_bam, args)
File "/home/sam/programs/mambaforge/envs/ShortStack-4.0.3_env/bin/ShortStack", line 2958, in analyze_fold
print("foldlines position 3:", foldlines[2]) # Print the third element of foldlines
IndexError: list index out of range
""" Would you be able to suggest which input file(s) I could examine to try to determine what's going wrong? |
Thanks for your persistence in trying to isolate this bug. It has me stumped too. I'm no expert but I seem to recall that debugging by "print" statements can be wonkly for processes that are part of multiprcessing pools in python. The error message and program halt may need be synchronized with the printed parts. Anyway, let's start here: Can you show the first 10-20 lines of each of your input files? e.g.
Otherwise, to reproduce the error on my end, I'd need to have you share all of these input files with me (post to cloud somewhere, private for me). The error is clearly the result of an empty line being produced by a call to RNAfold, but I don't quite understand why. |
Thanks so much for looking into this! It greatly appreciated. I'll get you the outputs from those files listed in a few minutes. In the meantime, if you want to try stuff yourself, here are links to files:
|
(base) sam@raven:/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive$ head E-Peve/data/Porites_evermanni_v1.fa
>Porites_evermani_scaffold_1
GGCGGGGGGGGGGGGGGGGGGGGTACTCCCATACATTACCTATACGGGTATGTGCCGCCC
AAAAGGGGCCGTGATTTTGAAGCTCCTGATTTAGAACGGGGTATCCATTTCAGAGGCGTT
TTCTAGAACGGGGTGTAATATTTCGAACGCACGAAAGCTCCACTTTTGTGTAAGCAGCCA
TTTGAAATTATTCAAGGACAGATTGCTTTTAAAAATACGGTTCAGCGCGTTAACAAGCAA
ACCGTTGTACTCTTGTTGCACCCTAGAACGGTGTATAAAAAATTGGCCCATTTCTAGAAC
GGGGTATCAGTTTTAGGGAGAATTCTAGAACGGGGTATAAAAAATTGGCCCTTTTCTGAA
CGGGGCATCAATGTTAGGGGAAATTTTTTCCAGAACGGGGTGCCAATTTGGAGTCCCGGG
CGGCACATACCCACCCAAAAAATACCCAAGTGCCCCCCCGGGGTCTAAACCCACATATTC
TTCACACTGTTCACAATTTACCTCTTTTGGCTCTTCTAAGGAGAGCTCATCTAAATATTG (base) sam@raven:/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive$ head data/mirbase-mature-v22.1.fa
>cel-let-7-5p MIMAT0000001 Caenorhabditis elegans let-7-5p
UGAGGUAGUAGGUUGUAUAGUU
>cel-let-7-3p MIMAT0015091 Caenorhabditis elegans let-7-3p
CUAUGCAAUUUUCUACCUUACC
>cel-lin-4-5p MIMAT0000002 Caenorhabditis elegans lin-4-5p
UCCCUGAGACCUCAAGUGUGA
>cel-lin-4-3p MIMAT0015092 Caenorhabditis elegans lin-4-3p
ACACCUGGGCUCUCCGGGUACC
>cel-miR-1-5p MIMAT0020301 Caenorhabditis elegans miR-1-5p
CAUACUUCCUUACAUGCCCAUA (base) sam@raven:/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive$ zcat E-Peve/output/06.1-Peve-sRNAseq-trimming-R1-only/trimmed-reads/sRNA-POR-73-S1-TP2_R1_001.fastp-R1-31bp-auto_adapters-polyG.fq.gz | head
@GWNJ-1013:680:GW2305123155th:1:1101:6171:1000 1:N:0:GTGGCCAT
TNCCGTAGATCCGAACTTGT
+
F#FFFFFFFFFFFFFFFFFF
@GWNJ-1013:680:GW2305123155th:1:1101:13349:1000 1:N:0:GTGGCCAT
ANCACTGATGACTGTTCAGTTTTTCTGAATT
+
F#FFFFFFFFF,FFFFFFFFFFFFFFFFFFF
@GWNJ-1013:680:GW2305123155th:1:1101:15085:1000 1:N:0:GTGGCCAT
TNAGGTCTAGGCTGGTTAGTTT (base) sam@raven:/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive$ zcat E-Peve/output/06.1-Peve-sRNAseq-trimming-R1-only/trimmed-reads/sRNA-POR-79-S1-TP2_R1_001.fastp-R1-31bp-auto_adapters-polyG.fq.gz | head
@GWNJ-1013:680:GW2305123155th:1:1101:2917:1000 1:N:0:CGTACGAT
TNAAAATCTTTGCTCTGAAGTGGAA
+
F#FFFFFFFFFFFFFFFFFFFFFFF
@GWNJ-1013:680:GW2305123155th:1:1101:3640:1000 1:N:0:CGTACGAT
GNACTGGTGGTTCAGTGGTAGAATTCTCGCC
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@GWNJ-1013:680:GW2305123155th:1:1101:7672:1000 1:N:0:CGTACGAT
TNCCAGTCACATCCTTTACTTTGGCA (base) sam@raven:/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive$ zcat E-Peve/output/06.1-Peve-sRNAseq-trimming-R1-only/trimmed-reads/sRNA-POR-82-S1-TP2_R1_001.fastp-R1-31bp-auto_adapters-polyG.fq.gz | head
@GWNJ-1013:680:GW2305123155th:1:1101:1217:1000 1:N:0:GAGTGGAT
GNGGTCCGGACGGAGGAGGGTTAT
+
F#FFFFFFFFFFFFFFFFFFFFFF
@GWNJ-1013:680:GW2305123155th:1:1101:4797:1000 1:N:0:GAGTGGAT
TNTCGGCTCACCAATCTCTGCT
+
F#FFFFFFFFFFFFFFFFFFFF
@GWNJ-1013:680:GW2305123155th:1:1101:5231:1000 1:N:0:GAGTGGAT
TNGGCGACCTTTGAT Again, thank you for taking the time to investigate!! |
These files all look fine to me. Not the news you want, but .... I just ran your data on my system with the following call and was not able to reproduce your error .. it ran to completion with no problem, no errors.
Here's a guess: ShortStack uses a lot of subprocess.run calls, including at the RNAfold job that is causing you pain. I wonder if your compute cluster has some funny business with permissions of sub-processes? It's a stab in the dark. But, I can't reproduce the error here with your data nor can I see anything obvious in my code. Here's another guess: Maybe also versioning issue of a dependency? Here are my versions:
If it is version issue RNAfold or samtools would be my two most likely suspects. |
Here's the conda environment I'm running: conda list
# packages in environment at /home/sam/programs/mambaforge/envs/ShortStack-4.0.3_env:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
bedtools 2.31.1 hf5e1c6e_0 bioconda
biopython 1.83 py310h2372a71_0 conda-forge
bowtie 1.3.1 py310h7b97f60_6 bioconda
bzip2 1.0.8 hd590300_5 conda-forge
c-ares 1.26.0 hd590300_0 conda-forge
ca-certificates 2024.2.2 hbcca054_0 conda-forge
cffi 1.16.0 py310h2fee648_0 conda-forge
colorama 0.4.6 pyhd8ed1ab_0 conda-forge
cutadapt 4.6 py310h4b81fae_1 bioconda
dnaio 1.2.0 py310h4b81fae_0 bioconda
htslib 1.19.1 h81da01d_1 bioconda
icu 73.2 h59595ed_0 conda-forge
isa-l 2.31.0 hd590300_1 conda-forge
keyutils 1.6.1 h166bdaf_0 conda-forge
krb5 1.21.2 h659d440_0 conda-forge
ld_impl_linux-64 2.40 h41732ed_0 conda-forge
libblas 3.9.0 21_linux64_openblas conda-forge
libcblas 3.9.0 21_linux64_openblas conda-forge
libcurl 8.5.0 hca28451_0 conda-forge
libdeflate 1.18 h0b41bf4_0 conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libev 4.33 hd590300_2 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 13.2.0 h807b86a_5 conda-forge
libgfortran-ng 13.2.0 h69a702a_5 conda-forge
libgfortran5 13.2.0 ha4646dd_5 conda-forge
libgomp 13.2.0 h807b86a_5 conda-forge
libhwloc 2.9.3 default_h554bfaf_1009 conda-forge
libiconv 1.17 hd590300_2 conda-forge
liblapack 3.9.0 21_linux64_openblas conda-forge
libnghttp2 1.58.0 h47da74e_1 conda-forge
libnsl 2.0.1 hd590300_0 conda-forge
libopenblas 0.3.26 pthreads_h413a1c8_0 conda-forge
libpng 1.6.42 h2797004_0 conda-forge
libsqlite 3.45.1 h2797004_0 conda-forge
libssh2 1.11.0 h0841786_0 conda-forge
libstdcxx-ng 13.2.0 h7e041cc_5 conda-forge
libuuid 2.38.1 h0b41bf4_0 conda-forge
libxcrypt 4.4.36 hd590300_1 conda-forge
libxml2 2.12.5 h232c23b_0 conda-forge
libzlib 1.2.13 hd590300_5 conda-forge
mysql-connector-c 6.1.11 h659d440_1008 conda-forge
ncurses 6.4 h59595ed_2 conda-forge
numpy 1.26.4 py310hb13e2d6_0 conda-forge
openssl 3.2.1 hd590300_0 conda-forge
pbzip2 1.1.13 h1fcc475_2 conda-forge
perl 5.32.1 7_hd590300_perl5 conda-forge
pigz 2.8 h2797004_0 conda-forge
pip 24.0 pyhd8ed1ab_0 conda-forge
pycparser 2.21 pyhd8ed1ab_0 conda-forge
python 3.10.13 hd12c33a_1_cpython conda-forge
python-isal 1.5.3 py310h2372a71_0 conda-forge
python_abi 3.10 4_cp310 conda-forge
readline 8.2 h8228510_1 conda-forge
samtools 1.19.2 h50ea8bc_0 bioconda
setuptools 69.0.3 pyhd8ed1ab_0 conda-forge
shortstack 4.0.3 hdfd78af_0 bioconda
shorttracks 1.1 hdfd78af_0 bioconda
strucvis 0.7 hdfd78af_0 bioconda
tbb 2021.11.0 h00ab1b0_1 conda-forge
tk 8.6.13 noxft_h4845f30_101 conda-forge
tqdm 4.66.2 pyhd8ed1ab_0 conda-forge
tzdata 2024a h0c530f3_0 conda-forge
ucsc-wigtobigwig 447 h2a80c09_1 bioconda
viennarna 2.6.4 py310pl5321h6cc9453_0 bioconda
wheel 0.42.0 pyhd8ed1ab_0 conda-forge
xopen 1.9.0 py310hff52083_0 conda-forge
xz 5.2.6 h166bdaf_0 conda-forge
zlib 1.2.13 hd590300_5 conda-forge
zstandard 0.22.0 py310h1275a96_0 conda-forge
zstd 1.5.5 hfc55251_0 conda-forge |
Ha! True, but it's honestly great that it ran! It means it can be done!! Thanks! We'll test this out on another machine and see how it goes. Currently, have been running on a machine with Ubuntu 18.04LTS. |
Well, the "fix" for this turned out to be to run this via the command line! I've been running ShortStack in an R Markdown document, as part of an R Project in our current project repo. Despite the fact that we've been successfully running ShortStack in this way for other genomes/trimming params, for some reason it must be causing this particular combination to fail. Again, thanks for your help/time investigating this issue and thanks for publishing such a cool/useful piece of software. It's all greatly appreciated. Sincerely. |
Gah! Spoke too soon!!! The command you tested (as well as myself, after your solution worked) omitted an option I had been running So, I guess the good news is that we now know that it's the |
OK, thanks. I was able to reproduce the error using this command:
I am investigating. |
Thanks again for your persistence. I found the bug and fixed it, as of commit 4c6dd4a This fix will be incorporated in the next release, 4.0.4, which should be soon. |
Thank you! We really, really appreciate it! |
I'm getting the above listed error. This is odd, as I've actually run this successfully previously, with a different subset of the same input reads!
If you happen to have any insight, I'd greatly appreciate it.
Error messages:
Full error
shortstack.log
:Full set of commands:
The text was updated successfully, but these errors were encountered: