Skip to content

Barcodes

cziegenhain edited this page Mar 27, 2020 · 3 revisions

zUMIs provides three main options for selecting relevant barcodes:

  • automatic detection
  • number of barcodes with most reads
  • barcode list annotation

Here is more information on each of the modes:

Automatic barcode detection

zUMIs infers which barcodes mark good cells from the observed sequences. To this end, we fit a k-dimensional multivariate normal distribution using the R-package mclust for the number of reads/BC, where k is empirically determined by mclust via the Bayesian Information Criterion (BIC). We reason that only the kth normal distribution with the largest mean contains barcodes that identify reads originating from intact cells. We exclude all barcodes that fall in the lower 1% tail of this kth normal-distribution to exclude spurious barcodes.

Number of barcodes with most reads

zUMIs will make a summary statistic over all observed barcode sequences and their frequency. The user-specified number of barcodes will be selected in descending order.

Barcode annotation

If expected barcodes are known a priori, it is usually advisable to provide these. The format should be a plain text file without headers, where each line contains the exact barcode sequence.

For instance:

GGGGCA

TATTGT

GCACGG

CAATAA

CGCGTG

Attention: If you have specified a 6-mer in the barcode range (eg. 1-6), this annotation should also contain 6-mer reference barcodes!

In case you are using several barcode ranges in zUMIs, the expected barcodelist should contain the concatenated string of all possible expected barcode combinations!

For instance, take the above cell barcodes that should all have the same plate barcode:

CGTACTAGGGGGCA

CGTACTAGTATTGT

CGTACTAGGCACGG

CGTACTAGCAATAA

CGTACTAGCGCGTG

Attention: Make sure the annotation always contains reference barcodes with correct length (sum of all specified barcode lengths)!

Intersection of automatic + barcode whitelist

In this mode, zUMIs will use it's automatic BC detection as described above and make sure that each BC is part of the given BC whitelist. In case your reference BC whitelist barcodes are shorter than the barcode extracted from the sequence reads, zUMIs will still try to match them up by a grep command. Note this may become slow if you have many cells & whitelisted barcodes.

Example: You are using 10xGenomics data with 16bp RT-barcode + 8bp i7 index (-> zUMIs internal BC will be 24 bp) but only give 16bp RT barcodes in the whitelist. The matching up will still work in this case.

Barcode Sharing Feature

For some scRNA-seq protocols, the same cell may be observed with several barcode sequences. Examples are:

  • SPLiT-seq: Round 1 RT barcode for the same cell differs if using oligo-dT and random hexamer priming together.
  • 10xGenomics: i7 library barcode is actually a mix of 4 primers with distinct sequences to improve sequencer quality.

zUMIs can combine the reads belonging together when the users provides an annotation file in the following format in the barcode_sharing: field of the YAML config file:

  1. Hashed out header line defining which portion of the full zUMIs barcode to match up (eg. #17-26if bases 17-26 have the barcode portion of interest)
  2. Tab separated barcodes that belong together. Each line should contain all the barcodes that belong together in a file, with as many columns as necessary. eg:

GGTTTACT CTAAACGG TCGGCGTC AACCGTAA

TTTCATGA ACGTCCCT CGCATGTG GAAGGAAC

...

or

AACGTGAT GATAGACA

CGCTGATC GTCGTAGA

...