Skip to content

Latest commit

 

History

History
388 lines (283 loc) · 14.4 KB

FAQ.rst

File metadata and controls

388 lines (283 loc) · 14.4 KB

FAQ

If there is no read captured from the bam file, there can be multiple reasons:

Improper --UMItag setting
if your data has UMI tag, please use the right value (default value for --UMItag is UB). Otherwise, please set --UMItag None, e.g., for SMART-seq2, bulk RNA-seq, 10x scDNA-seq, scATAC-seq data. See also issue #26, #44.
Improper --cellTAG setting
if your data has cell tag, please use the right value (default value for --cellTAG is CB). Otherwise, please set --cellTAG None, e.g., for SMART-seq2, bulk RNA-seq data.
Unmatched cell barcodes
please check whether the cell barcode (string) in the cell tag of BAM file can match exactly with the barcodes in the input file (-b), including the possible suffix (e.g., -1). See also issue #44.
Unmatched chromosome names

the input chromosome names, either in the input VCF file or specified by --chrom option, should match the @SQ records in the SAM/BAM header, especially for mitochondrial chromosome, which has multiple names, such as chrM and chrMT. Please use the right chromosome names, you may check the @SQ records with samtools view -h.

Notably, cellsnp-lite would internally remove the "chr" prefix (if available) of all chromosome names, including the names specified by --chrom option and the ones in the input BAM and VCF records. Therefore, users do not need to tweak the chromosome names in the option and the two files if they only differ in the "chr" prefix.

Yes. Cellsnp-lite supports multiple omics data from various sequencing platforms. You may use proper options for different omics, mainly the --cellTAG for cell tag and --UMItag for UMI tag, e.g., please set --UMItag None for 10x scDNA-seq or scATAC-seq data.

See :ref:`Processing other omics data <manual Advanced Usage Other Omics>` for details and more options.

See also: issue #26, #44.

Yes. Efforts have been made to explore using cellsnp-lite to calculate the B allele frequency (BAF) as one input signal for Numbat to calculate the CNV probability per spot in spatial transcriptomics (Arora et al, 2023).

Yes.

For Mode 2, by default it runs on chr1 to 22 on human. For mouse, you need to specify it to 1,2,...,19 (replace the ellipsis).

On the other hand, if it is lab straint with genotype available, e.g., from the mouse genome project (ftp://ftp-mouse.sanger.ac.uk/current_snps), you can also run Mode 1.

See also: issue #3.

The --genotype option should be added for cellsnp-lite to output the file cellSNP.cells.vcf.gz.

See also: issue #4, #22.

Mode 2a is more suitable for small datasets. For large datasets, you may try Mode 2b + Mode 1a. Mode 2a does joint calling and genotyping, but it is substantially slower than calling first in a bulk manner by Mode 2b followed by genotyping in Mode 1a. To speed up, you may try --minMAF 0.1 --minCOUNT 100 options in both modes.

See also: issue #70.

To speedup, you may

  • Check whether the cell barcodes are "filtered", i.e., from filtered_gene_bc_matrices instead of from raw_gene_bc_matrices in the cellranger output folder (update -b);
  • Try to use the SNP list from AF5e2 VCF file instead of AF5e4 in this human SNP list folder (update -R);
  • Use more threads or cores (update -p).

See also: issue #78.

Cellsnp-lite Mode 2 takes the allele with the highest count as REF and the second highest as ALT by default. Therefore, neither allele is necessarily identical to the actual (genomic) reference in Mode 2. This is different from Mode 1, which uses the REF and ALT alleles specified in the input VCF.

However, since v1.2.2, cellsnp-lite has the -f or --refseq option to extract the real (genomic) reference allele from FASTA file as REF, and assign the allele (other than REF) with the highest UMI/read counts as the ALT.

See also: issue #28.

You can extract the reads by samtools and then view them in IGV.

To extract reads covering a SNP and output to a BAM file (assuming the SNP position is chr1:100000):

samtools view -h -b  "input_BAM"  chr1:100000  >  "output_BAM"
samtoos index "output_BAM"

If you only want to extract SNP reads in specific cell (assuming cell barcode is XXX-1 and cell tag is CB):

samtools view -h -b  -d CB:XXX-1  "input_BAM"  chr1:100000  >  "output_BAM"
samtoos index "output_BAM"

Then you can load the output BAM file above into IGV to view the reads.

See also: issue #107.

IGV would filter some reads by default, which could lead to the difference in allele counts with cellsnp-lite output. The allele counts should be the same if given the same read filtering settings.

You may refer to the question on this page Which reads will be filtered by cellsnp-lite? and Preferences -> Alignments for read filtering settings of cellsnp-lite and IGV, respectively.

See also: issue #95.

The inconsistency of read counts between samtools and cellsnp-lite is probably due to the different filtering settings of the two tools, e.g., by default, cellsnp-lite will filter some low-quality reads (please check --exclFLAG option) while samtools do not. To make the filtering settings the same, you can use -F option in samtools view.

See also: issue #107.

The difference in read counts is probably because the two methods used different read filtering settings, especially in Mode 2.

In Mode 2, cellSNP (actually the dependency pysam.pileup()) has a default limitation that the max_depth (i.e., max pileup-ed read count) is 8000, However, cellsnp-lite does not have this max_depth limitation by default, it will pileup as many reads as possible. You may try using the same read filtering settings for both cellsnp-lite and cellSNP, to make their read counts highly concordant in Mode 2.

See also: issue #33.

Cellsnp-lite has a few options for SNP filtering. By default, SNP will be filtered if

  • its aggregated UMI (if --UMItag is not None) or read (otherwise) count is <20 (--minCOUNT);
  • its minor allele frequency (the frequency of the allele with second highest read or UMI count) is <0 (--minMAF).

See :ref:`Optional Arguments <manual Full Parameters Optional Arguments>` in manual for details and more options.

Cellsnp-lite has a few options for read filtering. By default, read will be filtered if

  • it does not contain target cell tag (if set in --cellTAG) or its cell barcode is not in the input barcode list (-b);
  • it does not contain target UMI tag (if set in --UMItag);
  • any mask bits is set in SAM FLAG: UNMAP, SECONDARY, QCFAIL (when use UMI) or UNMAP, SECONDARY, QCFAIL, DUP (otherwise).
  • its mapped length is <30 (--minLEN);
  • its mapping quality MAPQ is <20 (--minMAPQ);
  • total pileup read count per input file is >INT_MAX (--maxDEPTH);
  • it is not mapped in proper pairs (--countORPHAN).

See :ref:`Read Filtering <manual Full Parameters Read Filtering>` in manual for details and more options.

See also: issue #25.

In theory, the computational complexity (i.e., running time) of cellSNP-lite is O(n) for number of variants and O(n*log(n)) for number of cells, which means it is more sensitive to the cell counts.

For human SNP list, we suggest using the version with AF5e2 (i.e., AF>5%, 7.4M SNPs), instead of AF5e4 (i.e., AF>0.05%, 36.6M SNPs).

IMPO, the "multi-primary" strategy, in which multiple alignments with the best score are labeled as primary, should be fine for downstream tasks if the fraction of the "extra" primary alignments is low.

Generally, we recommend to use "single-primary" strategy for genotyping, in which only one alignment with best alignment score is labelled as primary and the rest as secondary.

See detailed discussion in issue #39.

Cellsnp-lite was designed for bi-allelic SNPs. In its Mode 1, REF and ALT alleles are specified by user while in mode 2, REF and ALT are inferred from data as the alleles with highest and second highest read(UMI) counts. Therefore, in Mode 1, the REF or ALT in the reference VCF could be different from the major or minor allele inferred from data. For example, the ALT in VCF could be REF in the data.

In cellsnp cmdline (for both Mode 1 and 2), MAF is always caculated as the fraction read(UMI)_count_of_minor_allele / read(UMI)_count_of_all_alleles, where the minor allele is the allele with second highest read(UMI) count inferred from data. See also issue #77.

Therefore, in Mode 1, post-filtering SNPs based on the minimum allele frequency of the REF and ALT alleles in VCF file could be different from filtering SNPs with --minMAF in the cellsnp cmdline, for a small subset of SNPs whose major allele (with highest read/UMI count) or minor allele (second highest) is neither REF or ALT allele but one of the OTH alleles. See also issue #90.

The number of SNPs whose major or minor allele is one of the OTH alleles is expected to be quite small (in Mode 1), given the input reference VCF is reliable (e.g., with common SNPs compiled from 1000 genome project), hence should have limited influence on downstream donor deconvolution.

See also: issue #77, #90, #93.

The large mitochondrial read counts in cellsnp-lite output makes it more likely for vireo to reach local optima so that the parameters of donors become the same and hence vireo cannot assign the cells to certain donor.

Besides, vireo is designed for nuclear SNVs. For mito SNVs, you may want to try this vireo Mito tutorial, which was used by MQuad. Note that the duplicate reads should probably be removed beforehand, if there are no UMIs in your data.

See also: issue #33.

You may use bcftools merge to make a combined VCF for all donors.

See also: issue #21, #100, #106.

For SMART-seq2 data, cellsnp-lite supports multiple BAM files as input, so there is no need to merge all FASTQ files to generate single BAM file.

Specifically, one BAM file (and corresponding .bai index file) should be generated per FASTQ (pairs, Read 1&2), e.g, with STAR. Then cellsnp-lite can be run with the -S option, which will pileup every well (cell) in one go. The "-S" option requires a plain file listing BAM files, each per line.

Notably, as SMART-seq2 reads have no cell barcode and UMI tag, you may need to specify --cellTAG None and --UMItag None in cellsnp-lite. Additionally, the —gzip option can be used to output zipped VCF file(s).

The output of cellsnp-lite ("-S" option) run on SMART-seq2 data will be in the same format as 10x scRNA-seq, i.e., the three SNP x cell count matrices in the output folder are already "pooled" from all input wells (cells), each row is SNP and column is cell.

Therefore, the downstream processing (after running cellsnp-lite "-S") should be similar to 10x scRNA-seq data, such as running MQuad with the cellsnp folder (three “pooled” matrices) and then using vireo for clonal reconstruction (see vireo Mito tutorial for details).