Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: _Map_base::at #31

Open
nick-youngblut opened this issue Apr 12, 2024 · 10 comments
Open

Error: _Map_base::at #31

nick-youngblut opened this issue Apr 12, 2024 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@nick-youngblut
Copy link

During the Reading Gene Annotations step in the sc_long_pipeline() workflow, I'm getting a _Map_base::at error.

I'm using a decompressed fastq for the BLAZE output (ran that stand-alone). I'm using GCF_000001405.40_GRCh38.p14 for the reference: (GCF_000001405.40_GRCh38.p14_genomic.fna.gz and GCF_000001405.40_GRCh38.p14_genomic.gtf.gz).

My config:

config_file = FLAMES::create_config(
  config_dir,
  type = "sc_3end",
  do_barcode_demultiplex = FALSE,
  threads = 12
)

My sc_long_pipeline() job:

sce = FLAMES::sc_long_pipeline(
    fastq = fastq_input,
    genome_fa = ref_fna_input,
    annotation = ref_annot_input,
    outdir = work_dir,
    config = config_file,
    minimap2 = minimap2_path,
    k8 = k8_path,
    expect_cell_number = 8000
)

My console output:

Skipping Demultiplexing step...
Please make sure the ` /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//blaze_output/matched_reads.fastq `` is the the demultiplexing output from previous FLAEMS call.
Running FLAMES pipeline...
#### Input parameters:
{
  "pipeline_parameters": {
    "seed": [2022],
    "threads": [12],
    "do_barcode_demultiplex": [false],
    "do_gene_quantification": [true],
    "do_genome_alignment": [true],
    "do_isoform_identification": [true],
    "bambu_isoform_identification": [false],
    "multithread_isoform_identification": [true],
    "do_read_realignment": [true],
    "do_transcript_quantification": [true]
  },
  "barcode_parameters": {
    "max_bc_editdistance": [2],
    "max_flank_editdistance": [8],
    "pattern": {
      "primer": ["CTACACGACGCTCTTCCGATCT"],
      "BC": ["NNNNNNNNNNNNNNNN"],
      "UMI": ["NNNNNNNNNNNN"],
      "polyT": ["TTTTTTTTT"]
    },
    "TSO_seq": ["CCCATGTACTCTGCGTTGATACCACTGCTT"],
    "TSO_prime": [3],
    "full_length_only": [false]
  },
  "isoform_parameters": {
    "generate_raw_isoform": [false],
    "max_dist": [10],
    "max_ts_dist": [100],
    "max_splice_match_dist": [10],
    "min_fl_exon_len": [40],
    "max_site_per_splice": [3],
    "min_sup_cnt": [5],
    "min_cnt_pct": [0.001],
    "min_sup_pct": [0.2],
    "bambu_trust_reference": [true],
    "strand_specific": [0],
    "remove_incomp_reads": [4],
    "downsample_ratio": [1]
  },
  "alignment_parameters": {
    "use_junctions": [true],
    "no_flank": [false]
  },
  "realign_parameters": {
    "use_annotation": [true]
  },
  "transcript_counting": {
    "min_tr_coverage": [0.4],
    "min_read_coverage": [0.4]
  }
} 
gene annotation: /home/rstudio/workspace//data/references/human/FLAMES/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz 
genome fasta: /home/rstudio/workspace//data/references/human/FLAMES/GCF_000001405.40_GRCh38.p14_genomic.fna.gz 
input fastq: /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//blaze_output/matched_reads.fastq 
output directory: /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//flames 
minimap2 path: /home/rstudio/miniconda3/bin/minimap2 
k8 path: /home/rstudio/miniconda3/bin/k8 
#### Aligning reads to genome using minimap2
02:49:30 PM Fri Apr 12 2024 minimap2_align
[M::mm_idx_gen::52.772*2.11] collected minimizers
[M::mm_idx_gen::59.234*3.06] sorted minimizers
[M::main::59.234*3.06] loaded/built the index for 705 target sequence(s)
[M::mm_mapopt_update::60.414*3.02] mid_occ = 2177
[M::mm_idx_stat] kmer size: 14; skip: 5; is_hpc: 0; #seq: 705
[M::mm_idx_stat::61.052*3.00] distinct minimizers: 65648619 (22.00% are singletons); average occurrences: 16.258; average spacing: 3.090; total length: 3298430636
[M::worker_pipeline::305.468*9.91] mapped 482699 sequences
[M::worker_pipeline::539.678*10.81] mapped 485246 sequences
[M::worker_pipeline::772.824*11.17] mapped 494113 sequences
[M::worker_pipeline::1008.325*11.36] mapped 492944 sequences
[M::worker_pipeline::1250.776*11.48] mapped 487522 sequences
[M::worker_pipeline::1515.925*11.57] mapped 468694 sequences
[M::worker_pipeline::1778.447*11.64] mapped 468982 sequences
[M::worker_pipeline::2041.090*11.68] mapped 468921 sequences
[M::worker_pipeline::2305.636*11.72] mapped 468068 sequences
[M::worker_pipeline::2566.991*11.75] mapped 469711 sequences
[M::worker_pipeline::2814.316*11.77] mapped 476070 sequences
[M::worker_pipeline::3044.860*11.79] mapped 482292 sequences
[M::worker_pipeline::3277.376*11.80] mapped 482075 sequences
[M::worker_pipeline::3507.651*11.81] mapped 482633 sequences
[M::worker_pipeline::3738.925*11.82] mapped 482293 sequences
[M::worker_pipeline::3910.235*11.82] mapped 377902 sequences
[M::main] Version: 2.28-r1209
[M::main] CMD: /home/rstudio/miniconda3/bin/minimap2 -ax splice -t 12 -k14 --secondary=no --seed 2022 --junc-bed /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//flames/tmp_splice_anno.bed12 --junc-bonus 1 /home/rstudio/workspace//data/references/human/FLAMES/GCF_000001405.40_GRCh38.p14_genomic.fna.gz /home/rstudio/workspace//data/SspArc0008_10x_cDNA_longRead//blaze_output/matched_reads.fastq
[M::main] Real time: 3910.908 sec; CPU: 46208.963 sec; Peak RSS: 21.147 GB
[bam_sort_core] merging from 17 files and 1 in-memory blocks...
04:02:56 PM Fri Apr 12 2024 Start gene quantification and UMI deduplication
04:02:56 PM Fri Apr 12 2024 quantify genes 
Found genome alignment file(s): 	align2genome.bam
Connected to your session in progress, last started 2024-Apr-12 14:48:40 UTC (1 hour ago)
Assigning reads to genes...
Processed: 100%|██████████| 40375/40375 [02:41<00:00, 249.45gene_group/s]
Finalising the gene count matrix ...
Plotting the saturation curve ...
Generating deduplicated fastq file ...
Processed: 7572500.0Read [01:27, 87015.60Read/s] 
04:14:35 PM Fri Apr 12 2024 Gene quantification and UMI deduplication done!
04:14:35 PM Fri Apr 12 2024 Start isoform identificaiton
04:14:35 PM Fri Apr 12 2024 find_isoform
#### Reading Gene Annotations
Error: _Map_base::at

I'm using the patched version of FLAMES from #26.

My sessionInfo:

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] FLAMES_1.9.2

loaded via a namespace (and not attached):
  [1] BiocIO_1.12.0               bitops_1.0-7                filelock_1.0.2             
  [4] tibble_3.2.1                R.oo_1.25.0                 basilisk.utils_1.14.1      
  [7] bambu_3.4.0                 graph_1.80.0                XML_3.99-0.14              
 [10] rpart_4.1.19                lifecycle_1.0.3             edgeR_4.0.16               
 [13] doParallel_1.0.17           OrganismDbi_1.44.0          globals_0.16.2             
 [16] lattice_0.21-8              ensembldb_2.26.0            MultiAssayExperiment_1.28.0
 [19] backports_1.4.1             magrittr_2.0.3              limma_3.58.1               
 [22] Hmisc_5.1-1                 rmarkdown_2.25              yaml_2.3.7                 
 [25] metapod_1.10.1              reticulate_1.34.0           ggbio_1.50.0               
 [28] cowplot_1.1.1               DBI_1.1.3                   RColorBrewer_1.1-3         
 [31] abind_1.4-5                 zlibbioc_1.48.2             GenomicRanges_1.54.1       
 [34] purrr_1.0.2                 R.utils_2.12.3              AnnotationFilter_1.26.0    
 [37] biovizBase_1.50.0           BiocGenerics_0.48.1         RCurl_1.98-1.12            
 [40] nnet_7.3-19                 VariantAnnotation_1.48.1    rappdirs_0.3.3             
 [43] circlize_0.4.15             GenomeInfoDbData_1.2.11     IRanges_2.36.0             
 [46] S4Vectors_0.40.2            ggrepel_0.9.4               irlba_2.3.5.1              
 [49] listenv_0.9.0               dqrng_0.3.1                 parallelly_1.36.0          
 [52] DelayedMatrixStats_1.24.0   codetools_0.2-19            DropletUtils_1.22.0        
 [55] DelayedArray_0.28.0         scuttle_1.12.0              xml2_1.3.5                 
 [58] tidyselect_1.2.0            shape_1.4.6                 viridis_0.6.4              
 [61] ScaledMatrix_1.10.0         matrixStats_1.0.0           stats4_4.3.1               
 [64] BiocFileCache_2.10.2        base64enc_0.1-3             GenomicAlignments_1.38.2   
 [67] jsonlite_1.8.7              BiocNeighbors_1.20.2        GetoptLong_1.0.5           
 [70] Formula_1.2-5               scater_1.30.1               iterators_1.0.14           
 [73] foreach_1.5.2               tools_4.3.1                 progress_1.2.2             
 [76] Rcpp_1.0.11                 glue_1.7.0                  gridExtra_2.3              
 [79] SparseArray_1.2.4           xfun_0.40                   MatrixGenerics_1.14.0      
 [82] GenomeInfoDb_1.38.8         dplyr_1.1.4                 HDF5Array_1.30.1           
 [85] withr_2.5.1                 BiocManager_1.30.22         fastmap_1.1.1              
 [88] GGally_2.1.2                basilisk_1.14.3             bluster_1.12.0             
 [91] rhdf5filters_1.14.1         fansi_1.0.5                 rsvd_1.0.5                 
 [94] digest_0.6.33               R6_2.5.1                    colorspace_2.1-0           
 [97] dichromat_2.0-0.1           biomaRt_2.58.2              RSQLite_2.3.2              
[100] R.methodsS3_1.8.2           utf8_1.2.4                  tidyr_1.3.1                
[103] generics_0.1.3              data.table_1.14.8           rtracklayer_1.62.0         
[106] prettyunits_1.2.0           httr_1.4.7                  htmlwidgets_1.6.2          
[109] S4Arrays_1.2.1              pkgconfig_2.0.3             gtable_0.3.4               
[112] blob_1.2.4                  ComplexHeatmap_2.18.0       SingleCellExperiment_1.24.0
[115] XVector_0.42.0              htmltools_0.5.6.1           RBGL_1.78.0                
[118] ProtGenerics_1.34.0         clue_0.3-65                 scales_1.3.0               
[121] Biobase_2.62.0              png_0.1-8                   scran_1.30.2               
[124] knitr_1.44                  rstudioapi_0.15.0           reshape2_1.4.4             
[127] rjson_0.2.21                checkmate_2.3.0             curl_5.1.0                 
[130] cachem_1.0.8                rhdf5_2.46.1                GlobalOptions_0.1.2        
[133] stringr_1.5.1               vipor_0.4.5                 parallel_4.3.1             
[136] foreign_0.8-84              AnnotationDbi_1.64.1        restfulr_0.0.15            
[139] pillar_1.9.0                grid_4.3.1                  reshape_0.8.9              
[142] vctrs_0.6.4                 BiocSingular_1.18.0         dbplyr_2.4.0               
[145] beachmat_2.18.1             cluster_2.1.4               beeswarm_0.4.0             
[148] htmlTable_2.4.2             evaluate_0.22               GenomicFeatures_1.54.4     
[151] cli_3.6.1                   locfit_1.5-9.8              compiler_4.3.1             
[154] Rsamtools_2.18.0            rlang_1.1.1                 crayon_1.5.2               
[157] ggbeeswarm_0.7.2            plyr_1.8.9                  stringi_1.7.12             
[160] viridisLite_0.4.2           BiocParallel_1.36.0         munsell_0.5.0              
[163] Biostrings_2.70.3           lazyeval_0.2.2              Matrix_1.6-5               
[166] dir.expiry_1.10.0           BSgenome_1.70.2             hms_1.1.3                  
[169] sparseMatrixStats_1.14.0    bit64_4.0.5                 future_1.33.0              
[172] ggplot2_3.5.0               Rhdf5lib_1.24.2             KEGGREST_1.42.0            
[175] statmod_1.5.0               SummarizedExperiment_1.32.0 igraph_1.5.1               
[178] memoise_2.0.1               bit_4.0.5                   xgboost_1.7.5.1
@nick-youngblut
Copy link
Author

nick-youngblut commented Apr 12, 2024

Note: I'm using the gtf file instead of gff based on #27

@ChangqingW
Copy link
Collaborator

ChangqingW commented Apr 15, 2024

Seems to be a bug in the new Rcpp implementation. I have not been able to locate the bug, but you could edit the config file to either switch multithread_isoform_identification to false (to use the old python implementation) or switch bambu_isoform_identification to true (to use bambu instead)

@nick-youngblut
Copy link
Author

If I use bambu_isoform_identification = TRUE, then I get the following error:

Error in check_arguments(annotation, fastq, genome_bam, outdir, genome_fa,  : 
  Bambu requires GTF format for annotation file.

I'm using a GTF file for the annotation input, but it is gzip'd, which I'm guessing is causing the error.

@ChangqingW
Copy link
Collaborator

ChangqingW commented Apr 16, 2024

Hmm, bambu can probably read gziped annotation but the custom parser in our legacy code won't. I'll specify that in the docs.

For the multithread_isoform_identification implementation I got a Error: unordered_map::at error with plain GTF but there seems to a ton of unordered_map calls in the functions and I cannot locate where the issue is at the moment

@nick-youngblut
Copy link
Author

Moreover, I get the following error even when using an uncompressed gtf file:

[E::fai_build3_core] Cannot index files compressed with gzip, please use bgzip

It appears that the genome_fa must be bgzip'd or uncompressed. Do you think that you will just create uncompressed temporary files for these reference files, at least when the user selects bambu?

@nick-youngblut
Copy link
Author

error with plain GTF but there seems to a ton of unordered_map calls in the functions and I cannot locate where the issue is at the moment

Any updates on this?

Same for the please use bgzip error: is it necessary for FLAMES to create a temporary uncompressed (or bgzip-compressed) file when the user selects bambu?

@ChangqingW
Copy link
Collaborator

Same for the please use bgzip error: is it necessary for FLAMES to create a temporary uncompressed (or bgzip-compressed) file when the user selects bambu?

Sounds reasonable, I think it assumes everything is uncompressed at the moment

@maxim-h
Copy link
Contributor

maxim-h commented Jul 14, 2024

Some printf debugging led me to believe the problem is coming from the get_gene_blocks function around here, because not all genes present in chr_to_gene map are present in the gene_dict map for some reason.

@maxim-h
Copy link
Contributor

maxim-h commented Jul 15, 2024

@ChangqingW
So, in my case the issue is coming from the get_gene_blocks as I described before. The gene it's failing at is a small mitochondrial gene with only 1 transcript containing only 1 exon.

Looks like it's missing from gene_dict because it's also not present in the gene_to_transcript map after remove_similar_tr was ran. Pretty sure on this line:

if (transcript.size() < 2) continue;

the whole outer loop iteration is skipped, including this block

// Create a new transcriptvector of the transcripts that weren't removed
out_gene_to_transcript[gene] = {};
for (int i = 0; i < (int)transcript.size(); i++) {
// Add this transcript to the transcriptvector if it wasn't marked as a duplicate
if (dup_set.count(i) == 0) {
out_gene_to_transcript.at(gene).push_back(transcript.at(i));
}
}

that is supposed to add the gene to the new out_gene_to_transcript map.

I'm not quite sure if the intention behind that continue was to skip those genes entirely or to only skip the isoform comparison. If it's the former you'd need to add some condition downstream that skips the genes not present in gene_to_transcript and gene_dict, and if the latter you need to change the continue statement to only apply to the inner loop.

@ChangqingW ChangqingW added the bug Something isn't working label Jul 22, 2024
@OliverVoogd
Copy link
Collaborator

@maxim-h
Thanks for your suggestions! It looks like remove_similar_tr was ignoring genes with only 1 transcript, which was not the intended behaviour. I've updated the relevant function in 6412113

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants