-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlow_level.yaml
162 lines (121 loc) · 8.65 KB
/
low_level.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
- Class: meta
Course: BiocSwirl
Lesson: Low Level Analyses
Author: Paaksum Wong
Type: Standard
Organization: The University of British Columbia
Version: 1.0
#Introduction
- Class: text
Output: Welcome to BiocSwirl - Terminal based Bioconductor Courses. This lesson will cover how to install required packages, loading scRNA data and exploring various visualization methods, and processing raw reads.
- Class: text
Output: Install the required software at the following link, https://github.com/lisancao/biocswirl/blob/master/biocswirl_dev/devenv_install.
#Explain alignment
- Class: text
Output: Next-gen sequencers output a FASTQ file for each sequencing read (single-end or paired-end), which will be the first thing you work with in scRNA-seq analysis. Each FASTQ file contains a sequence identifier, the nucleotide sequence as well as quality scores for each nucleotide encoded in ASCII characters. The first step of analysis is to align these FASTQ files to a reference.
- Class: text
Output: There are many open source and commercial alignment tools with varying sensitivities and speeds. Which aligner you choose to use depends on available computing power and the acceptable trade-off between accuracy and speed. Commonly used aligners include BWA, Bowtie2, HISAT2, Bfast, and Stampy.
- Class: text
Output: In our sample data, 100bp single-end reads were aligned using RSEM to the mm10 mouse genome build with the RefSeq annotation downloaded on 11 June 2013. Raw fastq files are available at Gene Expression Omnibus, accession ID GSE71585.
#Alignment QC
- Class: text
Output: The standard output of alignment packages is a BAM file. Before downstream processing of aligned reads (.bam) in R, we must assess the quality of the alignments.
- Class: mult_question
Output: What tool can we use to quality check the aligned .bam files?
AnswerChoices: FastQC;samtools;STAR;DESeq2
CorrectAnswer: FastQC
AnswerTests: omnitest(correctVal='FastQC')
- Class: text
Output: Go to https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ and follow instructions to perform QC on .bam files.
#Import data into R
- Class: text
Output: Now that we have good quality alignment files, let's load our data into R. First, let's install bioconductor and the required packages.
- Class: cmd_question
Output: In the R console, enter (without asterisks) *if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install()*
CorrectAnswer: if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install()
AnswerTests: omnitest(correctExpr='if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install()')
Hint: Type the expression into the R console.
- Class: cmd_question
Output: Install the packages tidyverse, ___.
CorrectAnswer: BiocManager::install(c("tidyverse","___"))
AnswerTests: omnitest(correctExpr='BiocManager::install(c("tidyverse","___"))')
Hint: Refer to the bioconductor install page. Use quotations.
- Class: cmd_question
Output: Load the installed packages.
CorrectAnswer: library("")
AnswerTests: omnitest(correctExpr='library("")')
Hint: Load each library separately.
- Class: text
Output: Set working directory.
- Class: cmd_question
Output: Input the .csv metadata file.
CorrectAnswer: read_csv("cell_metadata.csv")
AnswerTests: omnitest(correctExpr='read_csv("cell_metadata.csv")')
Hint: Use tidyverse.
- Class: mult_question
Output: What are the dimensions of the metadata table?
AnswerChoices: 1680,16;16,1679;16,1680;1679,16; none of the above
CorrectAnswer: 1679,16
AnswerTests: omnitest(correctVal='1679,16')
Hint: Rows first, then columns.
- Class: text
Output: Confirm whether your data was imported correctly using the head() function.
# Counting reads
- Class: text
Output: From the aligned sequence BAM files and a list of genomic features, we can now count how many reads map to each feature. First we must obtain the list which contains annotated genomic features.
- Class: mult_question
Output: Which of the following file types contain annotated genomic features?
AnswerChoices: SAM, BAM, VCF, GTF, PDF
CorrectAnswer: GTF
AnswerTests: omnitest(correctVal='GTF')
- Class: text
Output: General Transfer Format (GTF) files generally contain the chromosome number, the feature type name, and the nucleotide start/end position of the feature. Find and download your specific species from the Ensembl FTP Downloads https://uswest.ensembl.org/info/data/ftp/index.html.
- Class: text
Output: Once the GTF file is downloaded, read counting can proceed. htseq-count is a python script that outputs a table with counts for each feature. Follow https://htseq.readthedocs.io/en/release_0.11.1/install.html for installation instructions.
- Class: cmd_question
Output: Write a function that counts the features of a BAM input file, skipping all reads with alignment quality lower than 10. For the file names, use 'sample.gtf' and count all bam files
CorrectAnswer: htseq-count -f bam -a *.bam sample.gtf
AnswerTests: omnitest(correctExpr='htseq-count -f bam -a *.bam sample.gtf')
Hint: Make sure the correct options are used. Recall the bash wildcard when searching for a pattern to return all BAM files.
# Remove low read counts
- Class: text
Output: Raw single-cell data is noisy in that read counts contain many genes with zero counts or insignificantly low count numbers. We will be filtering these low counts before proceeding to differential gene expression.
- Class: mult_question
Output: For less than how many counts should genes be filtered out?
AnswerChoices: 1;5;10;20;it depends
CorrectAnswer: it depends
AnswerTests: omnitest(correctVal='it depends')
- Class: text
Output: The filter threshold will depend on the differential expression algorithm that is being used. Refer to the specific differential expression vignette for details.
# Normalization
- Class: text
Output: There are many factors to account for which contribute to the number of reads mapped to a gene. Such factors include gene length, GC content, and sequencing depth; during library prep, variability in number of molecules sequenced between samples may contribute to different total read counts for different samples. Normalization is therefore an essential step when converting raw read counts to gene expression values.
- Class: text
Output: Normalization methods can be categorized into several types - Normalization by library size, normalization by distribution, and normalization by controls. Proceed to learn more about each method.
- Class: text
Output: Normalization by library size removes differences in sequencing depth by dividing total number of reads in each sample. It assumes that the amount of total expression is the same under different experimental conditions. Reads per kilobase per million mapped reads (RPKM) and fragments per kilobase per million mapped reads (FPKM) are common methods that normalize by library size.
- Class: text
Output: Normalization by distribution equilibrates expression levels for non-DE genes by comparing distributions of read counts across samples. This method assumes that DE and non-DE genes behave the same, and that there is roughly balanced expression across conditions.
- Class: text
Output: Normalization by controls uses control genes which are not affected by biological conditions but the same amount of control molecules are present. A disparity in the number of control molecules sequenced allows for normalization.
- Class: cmd_question
Output: Our sample data was normalized by library size. RPKM data was generated by aligning 100bp single-end reads using RSEM to the mm10 mouse genome. Now, import the file containing RPKM values into R.
CorrectAnswer: read_csv("genes_rpkm.csv")
AnswerTests: omnitest(correctExpr='read_csv("genes_rpkm.csv")')
Hint: Import using tidyverse, similarly to the metadata file.
# Dimensionality Reduction
- Class: text
Output: Raw single-cell data is noisy in that the data is high dimensional. High dimensional data is complex and can affect the accuracy of downstream algorithms since there is more data that needs to be generalized. Principal component analysis is a common method to perform dimensionality reduction.
- Class: cmd_question
Output: Plot a PCA plot of the sample dataset.
CorrectAnswer: plotPCA(genes_rpkm)
AnswerTests: omnitest(correctExpr='plotPCA(genes_rpkm)')
Hint: Use plotPCA.
# References
#https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6171491/