Error: _Map_base::at #31

nick-youngblut · 2024-04-12T16:20:40Z

During the Reading Gene Annotations step in the sc_long_pipeline() workflow, I'm getting a _Map_base::at error.

I'm using a decompressed fastq for the BLAZE output (ran that stand-alone). I'm using GCF_000001405.40_GRCh38.p14 for the reference: (GCF_000001405.40_GRCh38.p14_genomic.fna.gz and GCF_000001405.40_GRCh38.p14_genomic.gtf.gz).

My config:

config_file = FLAMES::create_config(
  config_dir,
  type = "sc_3end",
  do_barcode_demultiplex = FALSE,
  threads = 12
)

My sc_long_pipeline() job:

sce = FLAMES::sc_long_pipeline(
    fastq = fastq_input,
    genome_fa = ref_fna_input,
    annotation = ref_annot_input,
    outdir = work_dir,
    config = config_file,
    minimap2 = minimap2_path,
    k8 = k8_path,
    expect_cell_number = 8000
)

My console output:

Skipping Demultiplexing step...
Please make sure the ` /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//blaze_output/matched_reads.fastq `` is the the demultiplexing output from previous FLAEMS call.
Running FLAMES pipeline...
#### Input parameters:
{
  "pipeline_parameters": {
    "seed": [2022],
    "threads": [12],
    "do_barcode_demultiplex": [false],
    "do_gene_quantification": [true],
    "do_genome_alignment": [true],
    "do_isoform_identification": [true],
    "bambu_isoform_identification": [false],
    "multithread_isoform_identification": [true],
    "do_read_realignment": [true],
    "do_transcript_quantification": [true]
  },
  "barcode_parameters": {
    "max_bc_editdistance": [2],
    "max_flank_editdistance": [8],
    "pattern": {
      "primer": ["CTACACGACGCTCTTCCGATCT"],
      "BC": ["NNNNNNNNNNNNNNNN"],
      "UMI": ["NNNNNNNNNNNN"],
      "polyT": ["TTTTTTTTT"]
    },
    "TSO_seq": ["CCCATGTACTCTGCGTTGATACCACTGCTT"],
    "TSO_prime": [3],
    "full_length_only": [false]
  },
  "isoform_parameters": {
    "generate_raw_isoform": [false],
    "max_dist": [10],
    "max_ts_dist": [100],
    "max_splice_match_dist": [10],
    "min_fl_exon_len": [40],
    "max_site_per_splice": [3],
    "min_sup_cnt": [5],
    "min_cnt_pct": [0.001],
    "min_sup_pct": [0.2],
    "bambu_trust_reference": [true],
    "strand_specific": [0],
    "remove_incomp_reads": [4],
    "downsample_ratio": [1]
  },
  "alignment_parameters": {
    "use_junctions": [true],
    "no_flank": [false]
  },
  "realign_parameters": {
    "use_annotation": [true]
  },
  "transcript_counting": {
    "min_tr_coverage": [0.4],
    "min_read_coverage": [0.4]
  }
} 
gene annotation: /home/rstudio/workspace//data/references/human/FLAMES/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz 
genome fasta: /home/rstudio/workspace//data/references/human/FLAMES/GCF_000001405.40_GRCh38.p14_genomic.fna.gz 
input fastq: /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//blaze_output/matched_reads.fastq 
output directory: /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//flames 
minimap2 path: /home/rstudio/miniconda3/bin/minimap2 
k8 path: /home/rstudio/miniconda3/bin/k8 
#### Aligning reads to genome using minimap2
02:49:30 PM Fri Apr 12 2024 minimap2_align
[M::mm_idx_gen::52.772*2.11] collected minimizers
[M::mm_idx_gen::59.234*3.06] sorted minimizers
[M::main::59.234*3.06] loaded/built the index for 705 target sequence(s)
[M::mm_mapopt_update::60.414*3.02] mid_occ = 2177
[M::mm_idx_stat] kmer size: 14; skip: 5; is_hpc: 0; #seq: 705
[M::mm_idx_stat::61.052*3.00] distinct minimizers: 65648619 (22.00% are singletons); average occurrences: 16.258; average spacing: 3.090; total length: 3298430636
[M::worker_pipeline::305.468*9.91] mapped 482699 sequences
[M::worker_pipeline::539.678*10.81] mapped 485246 sequences
[M::worker_pipeline::772.824*11.17] mapped 494113 sequences
[M::worker_pipeline::1008.325*11.36] mapped 492944 sequences
[M::worker_pipeline::1250.776*11.48] mapped 487522 sequences
[M::worker_pipeline::1515.925*11.57] mapped 468694 sequences
[M::worker_pipeline::1778.447*11.64] mapped 468982 sequences
[M::worker_pipeline::2041.090*11.68] mapped 468921 sequences
[M::worker_pipeline::2305.636*11.72] mapped 468068 sequences
[M::worker_pipeline::2566.991*11.75] mapped 469711 sequences
[M::worker_pipeline::2814.316*11.77] mapped 476070 sequences
[M::worker_pipeline::3044.860*11.79] mapped 482292 sequences
[M::worker_pipeline::3277.376*11.80] mapped 482075 sequences
[M::worker_pipeline::3507.651*11.81] mapped 482633 sequences
[M::worker_pipeline::3738.925*11.82] mapped 482293 sequences
[M::worker_pipeline::3910.235*11.82] mapped 377902 sequences
[M::main] Version: 2.28-r1209
[M::main] CMD: /home/rstudio/miniconda3/bin/minimap2 -ax splice -t 12 -k14 --secondary=no --seed 2022 --junc-bed /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//flames/tmp_splice_anno.bed12 --junc-bonus 1 /home/rstudio/workspace//data/references/human/FLAMES/GCF_000001405.40_GRCh38.p14_genomic.fna.gz /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//blaze_output/matched_reads.fastq
[M::main] Real time: 3910.908 sec; CPU: 46208.963 sec; Peak RSS: 21.147 GB
[bam_sort_core] merging from 17 files and 1 in-memory blocks...
04:02:56 PM Fri Apr 12 2024 Start gene quantification and UMI deduplication
04:02:56 PM Fri Apr 12 2024 quantify genes 
Found genome alignment file(s): 	align2genome.bam
Connected to your session in progress, last started 2024-Apr-12 14:48:40 UTC (1 hour ago)
Assigning reads to genes...
Processed: 100%|██████████| 40375/40375 [02:41<00:00, 249.45gene_group/s]
Finalising the gene count matrix ...
Plotting the saturation curve ...
Generating deduplicated fastq file ...
Processed: 7572500.0Read [01:27, 87015.60Read/s] 
04:14:35 PM Fri Apr 12 2024 Gene quantification and UMI deduplication done!
04:14:35 PM Fri Apr 12 2024 Start isoform identificaiton
04:14:35 PM Fri Apr 12 2024 find_isoform
#### Reading Gene Annotations
Error: _Map_base::at

I'm using the patched version of FLAMES from #26.

My sessionInfo:

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] FLAMES_1.9.2

loaded via a namespace (and not attached):
  [1] BiocIO_1.12.0               bitops_1.0-7                filelock_1.0.2             
  [4] tibble_3.2.1                R.oo_1.25.0                 basilisk.utils_1.14.1      
  [7] bambu_3.4.0                 graph_1.80.0                XML_3.99-0.14              
 [10] rpart_4.1.19                lifecycle_1.0.3             edgeR_4.0.16               
 [13] doParallel_1.0.17           OrganismDbi_1.44.0          globals_0.16.2             
 [16] lattice_0.21-8              ensembldb_2.26.0            MultiAssayExperiment_1.28.0
 [19] backports_1.4.1             magrittr_2.0.3              limma_3.58.1               
 [22] Hmisc_5.1-1                 rmarkdown_2.25              yaml_2.3.7                 
 [25] metapod_1.10.1              reticulate_1.34.0           ggbio_1.50.0               
 [28] cowplot_1.1.1               DBI_1.1.3                   RColorBrewer_1.1-3         
 [31] abind_1.4-5                 zlibbioc_1.48.2             GenomicRanges_1.54.1       
 [34] purrr_1.0.2                 R.utils_2.12.3              AnnotationFilter_1.26.0    
 [37] biovizBase_1.50.0           BiocGenerics_0.48.1         RCurl_1.98-1.12            
 [40] nnet_7.3-19                 VariantAnnotation_1.48.1    rappdirs_0.3.3             
 [43] circlize_0.4.15             GenomeInfoDbData_1.2.11     IRanges_2.36.0             
 [46] S4Vectors_0.40.2            ggrepel_0.9.4               irlba_2.3.5.1              
 [49] listenv_0.9.0               dqrng_0.3.1                 parallelly_1.36.0          
 [52] DelayedMatrixStats_1.24.0   codetools_0.2-19            DropletUtils_1.22.0        
 [55] DelayedArray_0.28.0         scuttle_1.12.0              xml2_1.3.5                 
 [58] tidyselect_1.2.0            shape_1.4.6                 viridis_0.6.4              
 [61] ScaledMatrix_1.10.0         matrixStats_1.0.0           stats4_4.3.1               
 [64] BiocFileCache_2.10.2        base64enc_0.1-3             GenomicAlignments_1.38.2   
 [67] jsonlite_1.8.7              BiocNeighbors_1.20.2        GetoptLong_1.0.5           
 [70] Formula_1.2-5               scater_1.30.1               iterators_1.0.14           
 [73] foreach_1.5.2               tools_4.3.1                 progress_1.2.2             
 [76] Rcpp_1.0.11                 glue_1.7.0                  gridExtra_2.3              
 [79] SparseArray_1.2.4           xfun_0.40                   MatrixGenerics_1.14.0      
 [82] GenomeInfoDb_1.38.8         dplyr_1.1.4                 HDF5Array_1.30.1           
 [85] withr_2.5.1                 BiocManager_1.30.22         fastmap_1.1.1              
 [88] GGally_2.1.2                basilisk_1.14.3             bluster_1.12.0             
 [91] rhdf5filters_1.14.1         fansi_1.0.5                 rsvd_1.0.5                 
 [94] digest_0.6.33               R6_2.5.1                    colorspace_2.1-0           
 [97] dichromat_2.0-0.1           biomaRt_2.58.2              RSQLite_2.3.2              
[100] R.methodsS3_1.8.2           utf8_1.2.4                  tidyr_1.3.1                
[103] generics_0.1.3              data.table_1.14.8           rtracklayer_1.62.0         
[106] prettyunits_1.2.0           httr_1.4.7                  htmlwidgets_1.6.2          
[109] S4Arrays_1.2.1              pkgconfig_2.0.3             gtable_0.3.4               
[112] blob_1.2.4                  ComplexHeatmap_2.18.0       SingleCellExperiment_1.24.0
[115] XVector_0.42.0              htmltools_0.5.6.1           RBGL_1.78.0                
[118] ProtGenerics_1.34.0         clue_0.3-65                 scales_1.3.0               
[121] Biobase_2.62.0              png_0.1-8                   scran_1.30.2               
[124] knitr_1.44                  rstudioapi_0.15.0           reshape2_1.4.4             
[127] rjson_0.2.21                checkmate_2.3.0             curl_5.1.0                 
[130] cachem_1.0.8                rhdf5_2.46.1                GlobalOptions_0.1.2        
[133] stringr_1.5.1               vipor_0.4.5                 parallel_4.3.1             
[136] foreign_0.8-84              AnnotationDbi_1.64.1        restfulr_0.0.15            
[139] pillar_1.9.0                grid_4.3.1                  reshape_0.8.9              
[142] vctrs_0.6.4                 BiocSingular_1.18.0         dbplyr_2.4.0               
[145] beachmat_2.18.1             cluster_2.1.4               beeswarm_0.4.0             
[148] htmlTable_2.4.2             evaluate_0.22               GenomicFeatures_1.54.4     
[151] cli_3.6.1                   locfit_1.5-9.8              compiler_4.3.1             
[154] Rsamtools_2.18.0            rlang_1.1.1                 crayon_1.5.2               
[157] ggbeeswarm_0.7.2            plyr_1.8.9                  stringi_1.7.12             
[160] viridisLite_0.4.2           BiocParallel_1.36.0         munsell_0.5.0              
[163] Biostrings_2.70.3           lazyeval_0.2.2              Matrix_1.6-5               
[166] dir.expiry_1.10.0           BSgenome_1.70.2             hms_1.1.3                  
[169] sparseMatrixStats_1.14.0    bit64_4.0.5                 future_1.33.0              
[172] ggplot2_3.5.0               Rhdf5lib_1.24.2             KEGGREST_1.42.0            
[175] statmod_1.5.0               SummarizedExperiment_1.32.0 igraph_1.5.1               
[178] memoise_2.0.1               bit_4.0.5                   xgboost_1.7.5.1

The text was updated successfully, but these errors were encountered:

nick-youngblut · 2024-04-12T16:23:53Z

Note: I'm using the gtf file instead of gff based on #27

ChangqingW · 2024-04-15T00:23:07Z

Seems to be a bug in the new Rcpp implementation. I have not been able to locate the bug, but you could edit the config file to either switch multithread_isoform_identification to false (to use the old python implementation) or switch bambu_isoform_identification to true (to use bambu instead)

nick-youngblut · 2024-04-15T16:50:38Z

If I use bambu_isoform_identification = TRUE, then I get the following error:

Error in check_arguments(annotation, fastq, genome_bam, outdir, genome_fa,  : 
  Bambu requires GTF format for annotation file.

I'm using a GTF file for the annotation input, but it is gzip'd, which I'm guessing is causing the error.

ChangqingW · 2024-04-16T00:50:28Z

Hmm, bambu can probably read gziped annotation but the custom parser in our legacy code won't. I'll specify that in the docs.

For the multithread_isoform_identification implementation I got a Error: unordered_map::at error with plain GTF but there seems to a ton of unordered_map calls in the functions and I cannot locate where the issue is at the moment

nick-youngblut · 2024-04-18T15:24:18Z

Moreover, I get the following error even when using an uncompressed gtf file:

[E::fai_build3_core] Cannot index files compressed with gzip, please use bgzip

It appears that the genome_fa must be bgzip'd or uncompressed. Do you think that you will just create uncompressed temporary files for these reference files, at least when the user selects bambu?

nick-youngblut · 2024-06-06T03:57:25Z

error with plain GTF but there seems to a ton of unordered_map calls in the functions and I cannot locate where the issue is at the moment

Any updates on this?

Same for the please use bgzip error: is it necessary for FLAMES to create a temporary uncompressed (or bgzip-compressed) file when the user selects bambu?

ChangqingW · 2024-06-13T00:29:39Z

Same for the please use bgzip error: is it necessary for FLAMES to create a temporary uncompressed (or bgzip-compressed) file when the user selects bambu?

Sounds reasonable, I think it assumes everything is uncompressed at the moment

maxim-h · 2024-07-14T12:33:40Z

Some printf debugging led me to believe the problem is coming from the get_gene_blocks function around here, because not all genes present in chr_to_gene map are present in the gene_dict map for some reason.

maxim-h · 2024-07-15T09:44:47Z

@ChangqingW
So, in my case the issue is coming from the get_gene_blocks as I described before. The gene it's failing at is a small mitochondrial gene with only 1 transcript containing only 1 exon.

Looks like it's missing from gene_dict because it's also not present in the gene_to_transcript map after remove_similar_tr was ran. Pretty sure on this line:

FLAMES/src/classes/junctions.cpp

Line 101 in 8510bdb

if (transcript.size() < 2) continue;

the whole outer loop iteration is skipped, including this block

FLAMES/src/classes/junctions.cpp

Lines 126 to 133 in 8510bdb

    
           // Create a new transcriptvector of the transcripts that weren't removed 
        
           out_gene_to_transcript[gene] = {}; 
        
           for (int i = 0; i < (int)transcript.size(); i++) { 
        
               // Add this transcript to the transcriptvector if it wasn't marked as a duplicate 
        
               if (dup_set.count(i) == 0) { 
        
                   out_gene_to_transcript.at(gene).push_back(transcript.at(i)); 
        
               } 
        
           }

that is supposed to add the gene to the new out_gene_to_transcript map.

I'm not quite sure if the intention behind that continue was to skip those genes entirely or to only skip the isoform comparison. If it's the former you'd need to add some condition downstream that skips the genes not present in gene_to_transcript and gene_dict, and if the latter you need to change the continue statement to only apply to the inner loop.

…ltithread. #31

OliverVoogd · 2024-07-24T04:36:12Z

@maxim-h
Thanks for your suggestions! It looks like remove_similar_tr was ignoring genes with only 1 transcript, which was not the intended behaviour. I've updated the relevant function in 6412113

…| fixed some minor log messages

ChangqingW mentioned this issue Jul 12, 2024

Error with reading gene annotation file #37

Closed

maxim-h mentioned this issue Jul 12, 2024

Fix GFF parser for gencode #39

Merged

ChangqingW assigned OliverVoogd Jul 22, 2024

ChangqingW added the bug Something isn't working label Jul 22, 2024

OliverVoogd added a commit that referenced this issue Jul 24, 2024

genes with only 1 transcript are no longer removed in find_isoform_mu…

6412113

…ltithread. #31

OliverVoogd added a commit that referenced this issue Aug 8, 2024

Added unzip .gtf to temp file for bambu isoform identification. #31

ea9d870

OliverVoogd added a commit that referenced this issue Aug 8, 2024

#31 added output message and correct use of unzipped annotation file …

ad3a218

…| fixed some minor log messages

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: _Map_base::at #31

Error: _Map_base::at #31

nick-youngblut commented Apr 12, 2024

nick-youngblut commented Apr 12, 2024 •

edited

Loading

ChangqingW commented Apr 15, 2024 •

edited

Loading

nick-youngblut commented Apr 15, 2024

ChangqingW commented Apr 16, 2024 •

edited

Loading

nick-youngblut commented Apr 18, 2024

nick-youngblut commented Jun 6, 2024

ChangqingW commented Jun 13, 2024

maxim-h commented Jul 14, 2024 •

edited

Loading

maxim-h commented Jul 15, 2024 •

edited

Loading

OliverVoogd commented Jul 24, 2024

Error: _Map_base::at #31

Error: _Map_base::at #31

Comments

nick-youngblut commented Apr 12, 2024

nick-youngblut commented Apr 12, 2024 • edited Loading

ChangqingW commented Apr 15, 2024 • edited Loading

nick-youngblut commented Apr 15, 2024

ChangqingW commented Apr 16, 2024 • edited Loading

nick-youngblut commented Apr 18, 2024

nick-youngblut commented Jun 6, 2024

ChangqingW commented Jun 13, 2024

maxim-h commented Jul 14, 2024 • edited Loading

maxim-h commented Jul 15, 2024 • edited Loading

OliverVoogd commented Jul 24, 2024

nick-youngblut commented Apr 12, 2024 •

edited

Loading

ChangqingW commented Apr 15, 2024 •

edited

Loading

ChangqingW commented Apr 16, 2024 •

edited

Loading

maxim-h commented Jul 14, 2024 •

edited

Loading

maxim-h commented Jul 15, 2024 •

edited

Loading