Skip to content

Protocol specific setup

cziegenhain edited this page Nov 18, 2020 · 10 revisions

The new release of zUMIs2 has made feature extraction from sequencing files much more flexible. Because of this, we could remove the protocol-specific processing modes necessary for some more complicated sequencing setups.

Smart-seq3

To process Smart-seq3 data, we need to instruct the pipeline to parse the UMI-containing and internal reads correctly. Please check Hagemann-Jensen et al., 2020 for more background information. Below is an example for paired-end, dual-indexed data:

project: Smartseq3
sequence_files:
  file1:
    name: /smartseq3/fastq/Undetermined_S0_R1_001.fastq.gz
    base_definition:
      - cDNA(23-150)
      - UMI(12-19)
    find_pattern: ATTGCGCAATG
  file2:
    name: /smartseq3/fastq/Undetermined_S0_R2_001.fastq.gz
    base_definition:
      - cDNA(1-150)
  file3:
    name: /smartseq3/fastq/Undetermined_S0_I1_001.fastq.gz
    base_definition:
      - BC(1-8)
  file4:
    name: /smartseq3/fastq/Undetermined_S0_I2_001.fastq.gz
    base_definition:
      - BC(1-8)

10x Genomics

zUMIs is compatible with 10x Genomics v1, v2, v3 and v3.1 NextGem!

project: tenX_v2
sequence_files:
  file1:
    name: R1.BCUMI.fastq.gz
    base_definition:
      - UMI(17-26)
      - BC(1-16)
  file2:
    name: R2.cDNA.fastq.gz
    base_definition:
      - cDNA(1-97)

Note on the i7 barcode read in 10x samples. 10x Genomics is a special case as each individual library gets indexed with a mix of 4 primers. To avoid that each cell can get wrongly divided into 4 cell barcodes in the analysis, you need to use zUMIs barcode sharing feature. When analysing a multi-sample 10x zUMIs call including the i7 file, add a barcode sharing annotation file.

Example:

project: tenX_v3_multisample
sequence_files:
  file1:
    name: R1.BCUMI.fastq.gz
    base_definition:
      - BC(1-16)
      - UMI(17-28)
  file2:
    name: Index.fastq.gz
    base_definition:
      - BC(1-8)
  file3:
    name: R2.cDNA.fastq.gz
    base_definition:
      - cDNA(1-97)

barcodes:
  barcode_sharing: shared_i7_bcs.txt

Format of shared_i7_bcs.txt: Begin with a header line defining the barcode range to look at. Here bases 17-24 is correct (i7 is appended to RT barcode within zUMIs). Each line corresponds to all 4 barcodes that belong together in one sample, tab delimited.

#17-24
GGTTTACT	CTAAACGG	TCGGCGTC	AACCGTAA
TTTCATGA	ACGTCCCT	CGCATGTG	GAAGGAAC
CAGTACTG	AGTAGTCT	GCAGTAGA	TTCCCGAC

Combinatorial Indexing

zUMIs is also compatible with combinatorial indexing protocols, such as sci-RNA-seq (Cao et al., 2018) and SPLiT-seq (Rosenberg et al., 2018).

In the previous version of zUMIs, a preprocessing step was necessary to accomodate the library structure of these protocols. As of now, this is not necessary anymore!

Here is an example for sci-Seq:

project: SciSeq
sequence_files:
  file1:
    name: i7.fastq.gz
    base_definition:
      - BC(1-8)
  file2:
    name: i5.fastq.gz
    base_definition:
      - BC(1-8)
  file3:
    name: R1.fastq.gz
    base_definition:
      - UMI(1-6)
      - BC(7-16)
  file4:
    name: R2.fastq.gz
    base_definition:
      - cDNA(1-50)

Here is an example for SPLiT-seq

project: SplitSeq
sequence_files:
  file1:
    name: R1.fastq.gz
    base_definition:
      - cDNA(1-50)
  file2:
    name: R2.fastq.gz
    base_definition:
      - UMI(1-10)
      - BC(11-18,49-56,87-94) 

If you use oligo-dT and hexamer priming with distinct barcodes on the same cells, you may find zUMIs barcode sharing feature useful to combine the counts of each cell.

ddSEQ / SureCell 3'

Illumina/BioRad ddSeq data can be processed with zUMIs from version 2.2.0. In this protocol, the cell barcode is composed of three blocks that may be shifted in the read. zUMIs can now detect the linker sequence and use it to account for the phase shift. When setting up base definitions, give the linker sequence and the base ranges for the "unshifted" case.

Follow this example for ddSeq data:

sequence_files:
  file1:
    name: BCUMIread_R1.fastq.gz
    base_definition:
    - BC(1-6,22-27,43-48)
    - UMI(50-57)
    correct_frameshift: TAGCCATCGCATTGC
  file2:
    name: cDNAread_R2.fastq.gz
    base_definition: cDNA(1-75)

InDrops

InDrops data can be processed by zUMIs when generated by the v2 or v3 protocols. The InDrops-specific mode has been removed because zUMIs2 can handle the data directly.

Here are examples for InDrops:

project: InDropsV2
sequence_files:
  file1:
    name: cdnaread.R1.fastq.gz
    base_definition:
      - cDNA(1-36)
  file3:
    name: librarybc.R2.fastq.gz
    base_definition:
      - BC(1-6)
  file4:
    name: barcodeUMIread.R3.fastq.gz
    base_definition:
      - BC(1-8,31-38)
      - UMI(39-44)
    correct_frameshift: AAGGCGTCACAAGCAATCACTC
project: InDropsV3
sequence_files:
  file1:
    name: cdnaread.R1.fastq
    base_definition:
      - cDNA(1-50)
  file2:
    name: barcode1read.R2.fastq
    base_definition:
      - BC(1-8)
  file3:
    name: librarybc.R3.fastq
    base_definition:
      - BC(1-6)
  file4:
    name: barcode2UMIread.R4.fastq
    base_definition:
      - BC(1-8)
      - UMI(9-14)

STRT & STRT-2i

zUMIs can process STRT and STRT-2i data. The STRT-specific mode and options have been removed with zUMIs2.

Here is an example for STRT-2i:

project: STRT
sequence_files:
  file1:
    name: umicdnaread.R1.fastq
    base_definition:
      - UMI(1-6)
      - cDNA(9-50)
  file2:
    name: barcode1read.R2.fastq
    base_definition:
      - BC(1-8)
  file3:
    name: barcode1read.R3.fastq
    base_definition:
      - BC(1-8)

BD Rhapsody WTA 3'

project: Rhapsody
sequence_files:
  file1:
    name: R1.fastq.gz
    base_definition:
      - BC(1-9,22-30,44-52)
      - UMI(53-60)
  file2:
    name: R2.fastq.gz
    base_definition:
      - cDNA(1-100)
  file3:
    name: I1.fastq.gz
    base_definition:
      - BC(1-8)