Skip to content
cziegenhain edited this page Jan 22, 2020 · 12 revisions

zUMIs' parent output directory contains following files:

This is a file where all the analysed cell barcodes with total number of reads are listed

zUMIs_outputs/<Project_name>.kept_barcodes.txt
  • Log files generated by STAR
<Project_name>.*.out
  • BAM files generated by zUMIs:
# filtered, unmapped BAM file tagged by cell barcode (BC) and UMI (UB)
will be found temporarily in zUMIs_output/.tmpMerge/
and after merging in <Project_name>.filtered.tagged.unmapped.bam

# filtered, STAR mapped, coordinate sorted BAM file with retained BC and UB tags & featureCount Gene assignments (exon or exon+intron)
<Project_name>.filtered.Aligned.GeneTagged.sorted.bam

Explanation of the bam tags zUMIs uses:

BC --- Barcode sequence
BX --- Raw barcode sequence (if hamming distance error correction is used)
QB --- Barcode sequence quality scores
UB --- UMI sequence
UX --- Raw UMI sequence (if hamming distance collapse is used)
QU --- UMI sequence quality scores
ES --- featureCounts status for exon assignment
EN --- Number of overlapping exons
GE --- GeneID for exon assignment
IS --- featureCounts status for intron assignment (if used)
IN --- Number of overlapping introns
GI --- GeneID for intron assignment

Output files

The final output is structured in two subdirectories:

zUMIs_output/expression
zUMIs_output/stats

The retained cell barcodes with total number of reads are listed in zUMIs_output/<Project_name>kept_barcodes.txt

  • "expression" contains .rds files
    • list of count matrices as sparseMatrix (*.dgecounts.rds)
  • "stats" contains plots and data files with descriptive statistics

Optionally, per-cell demultiplexed bam files are found in:

zUMIs_output/demultiplex

Structure of the output dgecounts object in <project_name>.dgecounts.rds

zUMIs produces dge output in .rds format that can be read in R with the following command.

AllCounts <- readRDS("zUMIs_output/expression/example.dgecounts.rds")
names(AllCounts)
[1] "umicount"  "readcount"

names(AllCounts$umicount)
[1] "exon"   "inex"   "intron"

names(AllCounts$umicount$exon)
[1] "all"          "downsampling"

AllCounts is a list of lists with all the count matrices as sparseMatrix. The parent list contains UMI and read count quantification. In each of these counting types, you will find the three feature types (introns,exons and intron+exon). Each of those contain a sparseMatrix generating using all reads observed and a list for the downsampling sizes requested. Each of the downsampling list elements is also a sparseMatrix.

The sparseMatrix can be converted into a conventional count table using "as.matrix" function in R and saved as a text file using the code below.

#Feature exons umicounts table
dge <- as.matrix(AllCounts$umicount$exon$all)
write.table(dge,"exons.umicounts.txt",quote=F,sep="\t")

NOTE: It is highly recommended to retain sparseMatrix format for the downstream analysis, especially when you have >20,000 cells to save time and space.

Loading zUMIs RDS output in Python

To load zUMIs output as a pandas object, please find code below Thanks to @adamgayoso for this!

import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
from rpy2.robjects.packages import STAP

def zumis_output2pandas(filename):
    mfunc = 'to_df <- function(dobj){return(as.data.frame(as.matrix(dobj)))}'
    rsparse2pandas = STAP(mfunc, "to_df")

    readRDS = robjects.r['readRDS']
    zumis_data = readRDS(filename)
    zd = dict(zip(zumis_data.names, list(zumis_data)))
    for key in ['umicount', 'readcount']:
        zd[key] = dict(zip(zd[key].names, list(zd[key])))
        zd[key]['exon'] = dict(zip(zd[key]['exon'].names, list(zd[key]['exon'])))
        zd[key]['exon']['all'] = rsparse2pandas.to_df(zd[key]['exon']['all']).transpose()
        zd[key]['exon']['downsampling'] = dict(zip(zd[key]['exon']['downsampling'].names, list(zd[key]['exon']['downsampling'])))
        for k in zd[key]['exon']['downsampling'].keys():
            zd[key]['exon']['downsampling'][k] = rsparse2pandas.to_df(zd[key]['exon']['downsampling'][k]).transpose()
    return zd