-
Notifications
You must be signed in to change notification settings - Fork 10
10x Genomics scRNA seq tutorial
This tutorial will guide you through a full analysis of single-cell RNA seq (scRNA-seq) data obtained from 10x Genomics protocol.
Before starting, you may want to check:
- How to install Eoulsan.
- How to use Eoulsan for the first time
- If you have a functional installation of Docker. Docker allows us to easilly deploy complex packages (e.g. Bioconductor) in a reproducible way. If you don't have Docker, you will need to install:
- UMI-tools
- STAR mapper
- subread package (version >= 1.5.3)
This tutorial uses an example dataset, supplied by 10x Genomics. You can find other public datasets on their webpage. This tool is inspired from UMI-tools single cell workflow.
You can find a 10x Genomics workflow demo in the single cell dedicated GitHub.
There are some requirements to run a full workflow (you can download those files by following the instructions in this tutorial). You would need:
- A dataset (FASTQ files);
- A reference genome (FASTA file);
- A transcriptome annotation (GFF, GFF3, or GTF file);
- A gene annotation file (TSV file);
- A workflow file (XML file);
- A design file (TXT file).
Here is a concise view of the steps of the workflow:
Step | Module | Inputs | Output |
---|---|---|---|
Fastqc* | fastqc | FASTQ (R1, R2) | HTML |
Filter reads* | filterreads | FASTQ (R1, R2) | FASTQ (R1, R2) |
Find cell barcodes whitelist | umiwhitelist | FASTQ (R1) | TXT |
Extract CB/UMI from R1 to R2 | umiextract | FASTQ (R1, R2) + TXT | FASTQ (R2) |
Create STAR index* | starindexgenerator | FASTQ (R2) | Star index |
Reads mapping | mapreads | FASTQ (R2) | BAM |
Multiqc* | multiqc | Log files | HTML |
Filter SAM files* | filtersam | SAM | SAM |
Convert SAM to BAM files | sam2bam | SAM | BAM |
Assign reads to genes | featurecounts | BAM + GTF | BAM |
Count unique reads per genes per cell | umicount | BAM | TSV |
Create a Bioconductor SingleCellExperiment Object* | rsinglecellexperimentcreator | TSV (matrix) + TSV (gene annotation) | RDS |
* Optional steps
# Step 1: get data
### FASTQ files
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/scrnaseq-chromium/hgmm_100_fastqs.tar
tar -xf hgmm_100_fastqs.tar
mkdir data
cat fastqs/hgmm_100_S1_L00?_R1_001.fastq.gz > data/hgmm_R1.fastq.gz
cat fastqs/hgmm_100_S1_L00?_R2_001.fastq.gz > data/hgmm_R2.fastq.gz
### Reference genome
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/hg19ens91.fasta.bz2
### Annotation files
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/hg19ens91.gtf.bz2
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/hg19ens91.gff.bz2
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/hg19ens91.tsv.bz2
# Step 2: create the design file
eoulsan.sh createdesign -p data/*.fastq.gz hg19ens91.fasta.bz2 hg19ens91.gtf.bz2 hg19ens91.gff.bz2 hg19ens91.tsv.bz2
# Step 3: create the workflow file or use an existing one
wget https://github.com/GenomicParisCentre/eoulsan/wiki/files/workflow-scrnaseq-chromium.xml
# Step 4: launch the analysis
eoulsan.sh exec workflow-scrnaseq-chromium.xml design.txt
We provide in this section some samples files to test Eoulsan single-cell RNA seq workflow on 10x Genomics data. The dataset is from a 10x Genomics experiment of a mixture of 100 cells from human and mouse. To download it:
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/scrnaseq-chromium/hgmm_100_fastqs.tar
tar -xf hgmm_100_fastqs.tar;
You will get a directory called "fastqs". It contains data from 8 sequencing lanes, with three files for each lane: a read 1 file (R1), a read 2 file (R2) and a file containing the Illumina indexes. We don't need the indexes files. To process the analysis, you need to merge all reads 1 files into one single file (same for reads 2):
mkdir data
cat fastqs/hgmm_100_S1_L00?_R1_001.fastq.gz > data/hgmm_R1.fastq.gz;
cat fastqs/hgmm_100_S1_L00?_R2_001.fastq.gz > data/hgmm_R2.fastq.gz;
Let's have a look at the content of the reads files. In the R1, you will see that the reads have a length of 26 bases. This corresponds to the cell barcode's 16 bases, followed by the UMI's 10 bases. In the R2, you will find reads that are 98 bases long. See 10x Genomics support for more information on these reads length.
# R1: cell barcodes (16 nt) + UMI (10 nt)
$ zcat hgmm_R1.fastq.gz | head -n 4
@ST-K00126:308:HFLYFBBXX:1:1101:25834:1173 1:N:0:NACCACCA
NGGGTCAGTCTAGTGTGGCGATTCAC
+
#AAFFJJJJJJJJJJJJJJJJJJJJJ
# R2: read (98 nt)
$ zcat hgmm_R2.fastq.gz | head -n 4
@ST-K00126:308:HFLYFBBXX:1:1101:25834:1173 2:N:0:NACCACCA
NGAAGCTGCATGAATGTAATCAACATTCCAACTGGAGCTCTCATTTGCTTGTCCTCTTTGCCTTCGGTAATATGTATAAACTTACATCACGACTTTCT
+
#<-7AAAA7FF<JFJJFJJF-AJAJ----AAA<--A<77-7----7---7A<--7-7------7--77AFFJAA<A<F7F<7<FAJAFF<7FFF77<A
In an empty directory, copy the reads, genome and annotation files, then you can create a design file with the next command:
eoulsan.sh createdesign -p data/*.fastq.gz hg19ens91.fasta.bz2 hg19ens91.gtf.bz2 hg19ens91.gff.bz2
You can modify the design file (design.txt
) to add additional information. Note that Eoulsan handles automatically compressed files. The design file can be modify by any text editor or spreadsheet like Microsoft Excel or Libreoffice Calc.
You can add additional columns to the design to describe your cells. In this workflow, all the annotations in column names starting with "Cell." will be automatically added to the Bioconductor SingleCellExperiment object generated at the end of the workflow.
To create a workflow file, the best solution is to reuse an existing workflow file and adapt it to your needs. You can either copy the workflow from the 10x Genomics Demo or download it with the following command:
wget https://github.com/genomicpariscentre/eoulsan/wiki/files/workflow-scrnaseq-chromium.xml
The workflow file contains the list of the steps that will be executed by Eoulsan. Each step have parameters and is related to a module to be executed. For each step you can change, add or remove parameters. Parameters are specific to a module, consult the documentation of the built-in steps for the list of available parameters of each step.
This step returns a HTML report on the sequencing reads quality. FastQC provides a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. For more information, see the Fastqc website.
<!-- FastQC of non filtered reads -->
<step id="step1fastqc" skip="false">
<module>fastqc</module>
</step>
You can add this step before and after any step that manipulate raw sequencing reads (FASTQ, BAM, SAM) to check the data quality (before or after filtering). To get a typical output of this report within Eoulsan workflow, see the demo fastqc folder.
This step filters sequencing reads taken as inputs. It can trim polyN read tails, remove reads with a short length and discard reads with bad base quality mean. For more information, see the filterreads module documentation.
<!-- Filter reads: remove low quality reads -->
<step id="step2filterreads" skip="false" discardoutput="asap">
<module>filterreads</module>
<parameters>
<parameter>
<name>illuminaid</name>
<value></value>
</parameter>
<parameter>
<name>quality.threshold</name>
<value>30</value>
</parameter>
</parameters>
</step>
In this example, the parameter illumnaid means that you will remove all reads that do not pass the Illumina filters. The quality.threshold parameters refers to the minimal phred quality score allowed to keep the reads (considered as good quality reads). A phred score of 30, such as in this example, means that there is a probability of 1/1000 for a base to be incorrectly identified.
The 10x Genomics protocol uses cell barcodes (CB) to identify single cells. Those barcodes are short nucleotide sequences (16 bases long). In the experiment there will be much more CB than there are cells (factor x10 to x1000). We thus need a way to identify the "true" CB, to identify those that can be assigned to real cells. UMI-tools whitelist detect these "true" CB with the common knee method, by taking the top most abundant CB. For more details on this method, see this blog post.
<!-- Extract cell barcodes and identify the most likely true barcodes using the 'knee' method. -->
<step id="step3whitelist" skip="false" requiredprocs="8">
<module>umiwhitelist</module>
<parameters></parameters>
</step>
As an output, you will get knee plots and a whitelist (text file). See the demo whitelist folder for an example output.
The whitelist contains a list of all the accepted CB. It has four columns:
- Column 1: "true" CB
- Column 2: list of CBs within a edit distance of the "true" CB (in column 1)
- Column 3: number of UMI (or reads) assigned to the corresponding "true" CB (in column 1)
- Column 4: number of UMI (or reads) assigned to the others CBs (in column 2)
E.g:
$ head -n 2 whitelist.txt
AAAGATGAGAAACGAG AAAAATGAGAAACGAG,AAACATGAGAAACGAG,... 46467 4,5,...
AAAGCAAGTACCTACA AAAACAAGTACCTACA,AAACCAAGTACCTACA,... 31947 2,3,...
<!-- Extract UMI barcode from a read and add it to the read name, leaving any sample barcode in place. -->
<step id="step4extract" dataproduct="match" skip="false">
<module>umiextract</module>
<parameters></parameters>
</step>
<!-- Create STAR index -->
<!-- This step may be skipped if needed (long time running). In this case, the step "mapreads" will automatically generate an index. -->
<step id="step5createstarindex" skip="false" requiredprocs="8">
<module>starindexgenerator</module>
<parameters>
<!-- The overhang value must be greater than the length of the reads -->
<parameter>
<name>overhang</name>
<value>100</value>
</parameter>
<parameter>
<name>use.gtf.file</name>
<value>true</value>
</parameter>
<parameter>
<name>gtf.feature.exon</name>
<value>exon</value>
</parameter>
<parameter>
<name>gtf.tag.exon.parent.transcript</name>
<value>Parent</value>
</parameter>
</parameters>
</step>
<!-- Map reads -->
<step id="step6mapreads" skip="false" discardoutput="asap" requiredprocs="8">
<module>mapreads</module>
<parameters>
<parameter>
<name>mapper</name>
<value>star</value>
</parameter>
<parameter>
<name>mapper.arguments</name>
<value>--outSAMunmapped Within</value>
</parameter>
</parameters>
</step>
<!-- Quality filter of SAM files -->
<step id="step7filtersam" skip="false">
<module>filtersam</module>
<parameters>
<parameter>
<name>removeunmapped</name>
<value>true</value>
</parameter>
<parameter>
<name>removemultimatches</name>
<value>true</value>
</parameter>
</parameters>
</step>
<!-- Assign reads to genes -->
<step id="step8featurecounts" requiredprocs="8" skip="false">
<module>featurecounts</module>
<inputs>
<input>
<port>input</port>
<fromstep>step8filtersam</fromstep>
<fromport>output</fromport>
</input>
</inputs>
<parameters></parameters>
</step>
See Multiqc
<!-- MutliQC of filtered and mapped reads -->
<step id="step9multiqc" skip="false">
<module>multiqc</module>
<parameters>
<parameter>
<name>reports</name>
<value>fastqc,mapreads</value>
</parameter>
<parameter>
<name>use.docker</name>
<value>false</value>
</parameter>
</parameters>
</step>
<step id="step10samtobam" skip="false">
<module>sam2bam</module>
<inputs>
<input>
<port>input</port>
<fromstep>step9featurecounts</fromstep>
<fromport>outputsam</fromport>
</input>
</inputs>
<parameters></parameters>
</step>
<!-- Count UMIs per gene per cell -->
<step id="step11umicounts" skip="false">
<module>umicount</module>
<parameters></parameters>
</step>
This step allow to create a RDS file that contains a Bioconductor SingleCellExperiment object. This object contains the matrixes generated by the step 3.11 and the cell and gene annotations.
The cell annotations is defined in the design file. User can add many columns to describes its cells. In this step, only the annotation which column names start with a prefix (e.g. "Cell.") will be add to the SingleCellExperiment object.
The gene annotations is defined is defined in the hg19ens91.tsv.bz2
file.
<!-- Create a SingleCellExperiment Bioconductor Object in a RDS file -->
<step id="step12singlecellexperiment" skip="false">
<module>rsinglecellexperimentcreator</module>
<parameters>
<parameter>
<name>input.matrices</name>
<value>true</value>
</parameter>
<parameter>
<name>design.prefix</name>
<value>Cell.</value>
</parameter>
<parameter>
<name>r.execution.mode</name>
<value>docker</value>
</parameter>
</parameters>
</step>
Once your design file and workflow file are ready, you can launch Eoulsan analysis with the following command:
eoulsan.sh exec workflow-scrnaseq-chromium.xml design.txt
Before starting the analysis, Eoulsan will check if:
- the design file and workflow file are valid
- the input files are valid (fasta and annotation files)
- modules called in the workflow do exist
- the order of the steps is correct (in terms of inputs / outputs needed for each step)
- a genome mapper index is needed. In this case, a step to generate this index is automatically added.
If successful, you will obtain a new directory named eoulsan-YearMonthDay-Time
with log files about the analysis. Result files are stored in the current directory.
sce <- readRDS('step12singlecellexperiment_output/step12singlecellexperiment_output_sce_matrix.rds')
summary(sce)