07-rnaseq-day2.html

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="generator" content="pandoc">
    <title>RNA-seq analysis in R</title>
    <link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
    <link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
    <link rel="stylesheet" type="text/css" href="css/swc.css" />
    <link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
    <meta charset="UTF-8" />
    <!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
    <!--[if lt IE 9]>
      <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
  </head>
  <body class="lesson">
    <div class="container card">
      
      <article>
      <div class="row">
        <div class="col-md-10 col-md-offset-1">
                    <a href="index.html"><h1 class="title">RNA-seq analysis in R</h1></a>
          <h2 class="subtitle">Alignment and feature counting</h2>
          <p><strong>Authors: Belinda Phipson, Maria Doyle, Harriet Dashnow</strong></p>
<p>This material has been created using the following sources:<br />
http://www.statsci.org/smyth/pubs/QLedgeRPreprint.pdf <span class="citation">(Lun, Chen, and Smyth 2016)</span> http://bioinf.wehi.edu.au/RNAseqCaseStudy/</p>
<p>Packages used:<br />
Rsubread</p>
<p>Data files needed:<br />
Mouse chromosome 1 Rsubread index files (~400MB).<br />
Targets2.txt<br />
The 12 fastq.gz files for the mouse dataset.</p>
<p>Mouse mammary data (fastq files): <a href="https://figshare.com/s/f5d63d8c265a05618137" class="uri">https://figshare.com/s/f5d63d8c265a05618137</a> You should download these files and place them in your <code>/data</code> directory.</p>
<p>GEO entry for the dataset:<br />
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60450</p>
<p>The raw reads were downloaded from SRA from the link given in GEO for the dataset (ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP%2FSRP045%2FSRP045534). These files are in .sra format. The sra toolkit from NCBI was used to convert the .sra files to .fastq files using the fastq-dump command.</p>
<h2 id="downloading-genome-files">Downloading genome files</h2>
<p>We have provided the index files for chromosome 1 for the mouse genome build mm10 for this workshop in order to save time on building the index. However, full genome fasta files for a number of different genomes are available to download from the UCSC genome browser, see http://hgdownload.soe.ucsc.edu/downloads.html; from NCBI: http://www.ncbi.nlm.nih.gov/genome; or from ENSEMBL: http://asia.ensembl.org/info/data/ftp/index.html.</p>
<h2 id="introduction-and-data-import">Introduction and data import</h2>
<p>For the purpose of this workshop, we are going to be working with a small part of the mouse reference genome (chromosome 1) to demonstrate how to do read alignment and counting using R. Mapping reads to the genome is a very important task, and many different aligners are available, such as bowtie <span class="citation">(Langmead and Salzberg 2012)</span>, topHat <span class="citation">(Trapnell, Pachter, and Salzberg 2009)</span>, STAR <span class="citation">(Dobin et al. 2013)</span> and Rsubread <span class="citation">(Liao, Smyth, and Shi 2013)</span>. Rsubread is the only aligner that can run in R. Most alignment tools are run in a linux environment, and they are very computationally intensive. Most mapping tasks require larger computers than an average laptop, so usually read mapping is done on a server in a linux-like environment. Here we are only going to be mapping 1000 reads from each sample from our mouse lactation dataset <span class="citation">(Fu et al. 2015)</span>, and we will only be mapping to chromosome 1. This is so that everyone can have a go at alignment and counting on their laptops using RStudio.</p>
<p><em>Don’t forget to open a new project specifying the directory where you have saved all the data files for day 2</em></p>
<p>First, let’s load the Rsubread package into R.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(Rsubread)</code></pre></div>
<p>Earlier we put all the sequencing read data (.fastq.gz files) in the data directory. Now we need to find them in order to tell the Rsubread aligner which files to look at. We can search for all .fastq.gz files in the data directory using the <code>list.files</code> command. The pattern argument takes a regular expression. In this case we are using the <code>$</code> to mean the end of the string, so we will only get files that end in “.fastq.gz”</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">fastq.files &lt;-<span class="st"> </span><span class="kw">list.files</span>(<span class="dt">path =</span> <span class="st">&quot;./data&quot;</span>, <span class="dt">pattern =</span> <span class="st">&quot;.fastq.gz$&quot;</span>, <span class="dt">full.names =</span> <span class="ot">TRUE</span>)
fastq.files</code></pre></div>
<pre class="output"><code> [1] &quot;./data/SRR1552444.fastq.gz&quot; &quot;./data/SRR1552445.fastq.gz&quot;
 [3] &quot;./data/SRR1552446.fastq.gz&quot; &quot;./data/SRR1552447.fastq.gz&quot;
 [5] &quot;./data/SRR1552448.fastq.gz&quot; &quot;./data/SRR1552449.fastq.gz&quot;
 [7] &quot;./data/SRR1552450.fastq.gz&quot; &quot;./data/SRR1552451.fastq.gz&quot;
 [9] &quot;./data/SRR1552452.fastq.gz&quot; &quot;./data/SRR1552453.fastq.gz&quot;
[11] &quot;./data/SRR1552454.fastq.gz&quot; &quot;./data/SRR1552455.fastq.gz&quot;
</code></pre>
<h2 id="alignment">Alignment</h2>
<h3 id="build-the-index">Build the index</h3>
<p>Read sequences are stored in compressed (gzipped) FASTQ files. Before the differential expression analysis can proceed, these reads must be aligned to the mouse genome and counted into annotated genes. This can be achieved with functions in the Rsubread package.</p>
<p>The first step in performing the alignment is to build an index. In order to build an index you need to have the fasta file (.fa), which can be downloaded from the UCSC genome browser. Here we are building the index just for chromosome 1. This may take several minutes to run. Building the full index using the whole mouse genome usually takes about 30 minutes to an hr on a server. We won’t be building the index in the workshop due to time constraints, we have provided the index files for you. The command below assumes the chr1 genome information for mm10 is stored in the “chr1.fa” file.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># buildindex(basename=&quot;chr1_mm10&quot;,reference=&quot;chr1.fa&quot;)</span></code></pre></div>
<p>The above command will generate the index files in the working directory. In this example, the prefix for the index files is chr1_mm10. You can see the additional files generated using the <code>dir</code> command, which will list every file in your current working directory.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">dir</span>()</code></pre></div>
<h3 id="aligning-reads-to-chromosome-1-of-reference-genome">Aligning reads to chromosome 1 of reference genome</h3>
<p>Now that we have generated our index, we can align our reads using the <code>align</code> command. There are often numerous mapping parameters that we can specify, but usually the default mapping parameters for the <code>align</code> function are fine. If we had paired end data, we would specify the second read file/s using the <code>readfile2</code> argument. Our mouse data comprises 100bp single end reads.</p>
<p>We can specify the output files, or we can let <code>Rsubread</code> choose the output file names for us. The default output file name is the filename with “.subread.BAM” added at the end.</p>
<p>Now we can align our 12 fastq.gz files using the <code>align</code> command.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">align</span>(<span class="dt">index=</span><span class="st">&quot;data/chr1_mm10&quot;</span>,<span class="dt">readfile1=</span>fastq.files)</code></pre></div>
<p>This will align each of the 12 samples one after the other. As we’re only using a subset of 1000 reads per sample, aligning should just take a minute or so for each sample. To run the full samples from this dataset would take several hours per sample. The BAM files are saved in the working directory.</p>
<p>To see how many parameters you can change try the <code>args</code> function:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">args</span>(align)</code></pre></div>
<pre class="output"><code>function (index, readfile1, readfile2 = NULL, type = &quot;rna&quot;, input_format = &quot;gzFASTQ&quot;, 
    output_format = &quot;BAM&quot;, output_file = paste(as.character(readfile1), 
        &quot;subread&quot;, output_format, sep = &quot;.&quot;), nsubreads = 10, 
    TH1 = 3, TH2 = 1, maxMismatches = 3, nthreads = 1, indels = 5, 
    complexIndels = FALSE, phredOffset = 33, unique = TRUE, nBestLocations = 1, 
    minFragLength = 50, maxFragLength = 600, PE_orientation = &quot;fr&quot;, 
    nTrim5 = 0, nTrim3 = 0, readGroupID = NULL, readGroup = NULL, 
    color2base = FALSE, DP_GapOpenPenalty = -1, DP_GapExtPenalty = 0, 
    DP_MismatchPenalty = 0, DP_MatchScore = 2, detectSV = FALSE) 
NULL
</code></pre>
<p>In this example we have kept many of the default settings, which have been optimised to work well under a variety of situations. The default setting for <code>align</code> is that it only keeps reads that uniquely map to the reference genome. For testing differential expression of genes, this is what we want, as the reads are unambigously assigned to one place in the genome, allowing for easier interpretation of the results. Understanding all the different parameters you can change involves doing a lot of reading about the aligner that you are using, and can take a lot of time to understand! Today we won’t be going into the details of the parameters you can change, but you can get more information from looking at the help:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">?align</code></pre></div>
<p>We can get a summary of the proportion of reads that mapped to the reference genome using the <code>propmapped</code> function.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">bam.files &lt;-<span class="st"> </span><span class="kw">list.files</span>(<span class="dt">path =</span> <span class="st">&quot;./data&quot;</span>, <span class="dt">pattern =</span> <span class="st">&quot;.BAM$&quot;</span>, <span class="dt">full.names =</span> <span class="ot">TRUE</span>)
bam.files</code></pre></div>
<pre class="output"><code> [1] &quot;./data/SRR1552444.fastq.gz.subread.BAM&quot;
 [2] &quot;./data/SRR1552445.fastq.gz.subread.BAM&quot;
 [3] &quot;./data/SRR1552446.fastq.gz.subread.BAM&quot;
 [4] &quot;./data/SRR1552447.fastq.gz.subread.BAM&quot;
 [5] &quot;./data/SRR1552448.fastq.gz.subread.BAM&quot;
 [6] &quot;./data/SRR1552449.fastq.gz.subread.BAM&quot;
 [7] &quot;./data/SRR1552450.fastq.gz.subread.BAM&quot;
 [8] &quot;./data/SRR1552451.fastq.gz.subread.BAM&quot;
 [9] &quot;./data/SRR1552452.fastq.gz.subread.BAM&quot;
[10] &quot;./data/SRR1552453.fastq.gz.subread.BAM&quot;
[11] &quot;./data/SRR1552454.fastq.gz.subread.BAM&quot;
[12] &quot;./data/SRR1552455.fastq.gz.subread.BAM&quot;
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">props &lt;-<span class="st"> </span><span class="kw">propmapped</span>(<span class="dt">files=</span>bam.files)</code></pre></div>
<pre class="output"><code>The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a &#39;ulimit -n 500&#39; command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 133; the mappability is 13.30%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a &#39;ulimit -n 500&#39; command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 130; the mappability is 13.00%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a &#39;ulimit -n 500&#39; command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 111; the mappability is 11.10%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a &#39;ulimit -n 500&#39; command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 103; the mappability is 10.30%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a &#39;ulimit -n 500&#39; command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 53; the mappability is 5.30%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a &#39;ulimit -n 500&#39; command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 67; the mappability is 6.70%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a &#39;ulimit -n 500&#39; command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 119; the mappability is 11.90%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a &#39;ulimit -n 500&#39; command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 145; the mappability is 14.50%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a &#39;ulimit -n 500&#39; command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 133; the mappability is 13.30%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a &#39;ulimit -n 500&#39; command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 142; the mappability is 14.20%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a &#39;ulimit -n 500&#39; command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 127; the mappability is 12.70%
The input file is opened as a BAM file.
The fragments in the input file are being counted.
Your operation system does not allow a single process to open more then 400 files. You may need to change this setting by using a &#39;ulimit -n 500&#39; command, or the program may crash.
Finished. All records: 1000; all fragments: 1000; mapped fragments: 113; the mappability is 11.30%
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">props</code></pre></div>
<pre class="output"><code>                                  Samples NumTotal NumMapped PropMapped
1  ./data/SRR1552444.fastq.gz.subread.BAM     1000       133      0.133
2  ./data/SRR1552445.fastq.gz.subread.BAM     1000       130      0.130
3  ./data/SRR1552446.fastq.gz.subread.BAM     1000       111      0.111
4  ./data/SRR1552447.fastq.gz.subread.BAM     1000       103      0.103
5  ./data/SRR1552448.fastq.gz.subread.BAM     1000        53      0.053
6  ./data/SRR1552449.fastq.gz.subread.BAM     1000        67      0.067
7  ./data/SRR1552450.fastq.gz.subread.BAM     1000       119      0.119
8  ./data/SRR1552451.fastq.gz.subread.BAM     1000       145      0.145
9  ./data/SRR1552452.fastq.gz.subread.BAM     1000       133      0.133
10 ./data/SRR1552453.fastq.gz.subread.BAM     1000       142      0.142
11 ./data/SRR1552454.fastq.gz.subread.BAM     1000       127      0.127
12 ./data/SRR1552455.fastq.gz.subread.BAM     1000       113      0.113
</code></pre>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge"><span class="glyphicon glyphicon-pencil"></span>Challenge</h2>
</div>
<div class="panel-body">
<ol style="list-style-type: decimal">
<li>Try aligning the fastq files allowing multi-mapping reads (set <code>unique = FALSE</code>), and allowing for up to 6 “best” locations to be reported (<code>nBestLocations = 6</code>). Specify the output file names (bam.files.multi) by substituting “.fastq.gz” with “.multi.bam” so we don’t overwrite our unique alignment bam files.</li>
<li>Look at the proportion of reads mapped and see if we get any more reads mapping by specifying a less stringent criteria.</li>
</ol>
</div>
</section>
<h2 id="quality-control">Quality control</h2>
<p>We can have a look at the quality scores associated with each base that has been called by the sequencing machine using the <code>qualityScores</code> function in <em>Rsubread</em>.</p>
<p>Let’s first extract quality scores for 100 reads for the file “SRR1552450.fastq.gz”.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Extract quality scores</span>
qs &lt;-<span class="st"> </span><span class="kw">qualityScores</span>(<span class="dt">filename=</span><span class="st">&quot;data/SRR1552450.fastq.gz&quot;</span>,<span class="dt">nreads=</span><span class="dv">100</span>)</code></pre></div>
<pre class="output"><code>
qualityScores Rsubread 1.22.3

Scan the input file...
Totally 1000 reads were scanned; the sampling interval is 9.
Now extract read quality information...

Completed successfully. Quality scores for 100 reads (equally spaced in the file) are returned.
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Check dimension of qs</span>
<span class="kw">dim</span>(qs)</code></pre></div>
<pre class="output"><code>[1] 100 100
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Check first few elements of qs with head</span>
<span class="kw">head</span>(qs)</code></pre></div>
<pre class="output"><code>      1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[1,] 31 34 31 37 37 37 37 37 39 37 39 39 39 41 40 41 41 41 36 38 40 41 39
[2,] 34 34 34 37 37 37 37 37 39 39 39 39 39 41 41 41 41 39 40 41 41 41 41
[3,] 34 34 31 37 37 37 37 37 39 39 39 39 39 41 41 41 41 41 41 41 41 41 41
[4,] 34 34 34 37 37 37 37 37 39 39 39 39 39 41 41 41 41 39 40 40 41 41 41
[5,] 34 34 34 37 37 37 37 37 39 39 39 39 39 41 41 41 40 41 41 41 41 41 41
[6,] 34 34 34 37 37 37 37 37 39 39 39 39 39 41 41 41 41 41 41 41 41 41 41
     24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[1,] 40 38 40 40 40 40 40 38 40 40 40 38 37 39 40 40 41 40 41 41 38 40 40
[2,] 41 41 41 41 41 41 41 41 41 38 40 41 41 41 41 41 41 41 41 41 41 41 41
[3,] 41 41 41 41 41 41 40 40 41 38 38 38 40 41 40 41 39 40 40 41 41 41 39
[4,] 41 40 41 41 41 41 41 40 41 41 41 41 40 41 41 41 41 41 41 41 41 41 41
[5,] 40 41 41 41 41 41 39 40 41 41 41 41 40 41 41 41 41 41 41 41 41 41 41
[6,] 41 41 36 40 40 41 41 41 41 40 38 40 38 40 41 41 40 41 41 41 40 40 41
     47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
[1,] 40 41 41 41 40 40 36 40 38 37 37 35 35 33 35 35 33 35 34 35 35 35 34
[2,] 41 41 41 41 41 41 41 41 41 41 41 40 41 41 41 39 39 39 39 37 37 37 37
[3,] 40 38 40 40 37 39 40 41 41 40 37 39 39 39 37 37 37 37 37 37 36 36 35
[4,] 41 41 41 41 41 41 41 41 41 40 40 41 40 41 41 41 41 40 41 41 39 39 39
[5,] 41 41 41 41 41 40 41 41 41 41 40 41 41 41 38 40 41 41 41 41 39 39 39
[6,] 40 40 40 41 40 39 33 36 35 34 35 35 35 35 35 35 35 33 33 33 34 31 31
     70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
[1,] 34 34 35 35 35 35 35 35 35 36 35 35 35 34 34 35 34 35 35 35 35 35 35
[2,] 36 36 36 34 36 35 35 35 35 35 35 35 35 35 35 37 35 36 35 35 35 35 35
[3,] 36 35 33 35 35 35 35 34 36 35 35 34 35 35 35 34 35 35 35 35 35 33 35
[4,] 39 37 37 37 37 37 37 36 36 36 36 36 36 35 35 35 35 35 35 35 35 35 34
[5,] 39 39 38 37 37 37 37 37 35 35 35 35 35 36 35 35 35 35 35 35 35 35 35
[6,] 34 29 34 33 34 35 32 32 34 34 33 31 34 34 18 24 32 33 35 35 27 32 31
     93 94 95 96 97 98 99 100
[1,] 35 35 35 35 35 35 35  33
[2,] 35 36 36 35 35 34 34  35
[3,] 35 35 31 34 35 35 33  33
[4,] 35 35 35 35 35 35 35  35
[5,] 35 36 35 35 35 35 35  35
[6,] 32 34 34 35 34 34 33  34
</code></pre>
<p>A quality score of 30 corresponds to a 1 in 1000 chance of an incorrect base call. (A quality score of 10 is a 1 in 10 chance of an incorrect base call.) To look at the overall distribution of quality scores across the 100 reads, we can look at a boxplot</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">boxplot</span>(qs)</code></pre></div>
<p><img src="fig-07/unnamed-chunk-12-1.png" title="plot of chunk unnamed-chunk-12" alt="plot of chunk unnamed-chunk-12" style="display: block; margin: auto;" /></p>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge-1"><span class="glyphicon glyphicon-pencil"></span>Challenge</h2>
</div>
<div class="panel-body">
<ol style="list-style-type: decimal">
<li>Extract quality scores for SRR1552451.fastq.gz for 50 reads.</li>
<li>Plot a boxplot of the quality scores for SRR1552451.fastq.gz.</li>
</ol>
</div>
</section>
<h2 id="counting">Counting</h2>
<p>Now that we have figured out where each read comes from in the genome, we need to summarise the information across genes or exons. The alignment produces a set of BAM files, where each file contains the read alignments for each library. In the BAM file, there is a chromosomal location for every read that mapped uniquely. The mapped reads can be counted across mouse genes by using the <code>featureCounts</code> function. <code>featureCounts</code> contains built-in annotation for mouse (mm9, mm10) and human (hg19) genome assemblies (NCBI refseq annotation).</p>
<p>The code below uses the exon intervals defined in the NCBI refseq annotation of the mm10 genome. Reads that map to exons of genes are added together to obtain the count for each gene, with some care taken with reads that span exon-exon boundaries. <code>featureCounts</code> takes all the BAM files as input, and outputs an object which includes the count matrix, similar to the count matrix we have been working with on Day 1. Each sample is a separate column, each row is a gene.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">fc &lt;-<span class="st"> </span><span class="kw">featureCounts</span>(bam.files, <span class="dt">annot.inbuilt=</span><span class="st">&quot;mm10&quot;</span>)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># See what slots are stored in fc</span>
<span class="kw">names</span>(fc)</code></pre></div>
<pre class="output"><code>[1] &quot;counts&quot;     &quot;annotation&quot; &quot;targets&quot;    &quot;stat&quot;      
</code></pre>
<p>The statistics of the read mapping can be seen with fc$stats. This reports the numbers of unassigned reads and the reasons why they are not assigned (eg. ambiguity, multi-mapping, secondary alignment, mapping quality, fragment length, chimera, read duplicate, non-junction and so on), in addition to the number of successfully assigned reads for each library. (We know the real reason why the majority of the reads aren’t mapping - they’re not from chr 1!)</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## Take a look at the featurecounts stats
fc$stat</code></pre></div>
<pre class="output"><code>                      Status ..data.SRR1552444.fastq.gz.subread.BAM
1                   Assigned                                     48
2       Unassigned_Ambiguity                                      0
3    Unassigned_MultiMapping                                      0
4      Unassigned_NoFeatures                                     85
5        Unassigned_Unmapped                                    867
6  Unassigned_MappingQuality                                      0
7  Unassigned_FragmentLength                                      0
8         Unassigned_Chimera                                      0
9       Unassigned_Secondary                                      0
10    Unassigned_Nonjunction                                      0
11      Unassigned_Duplicate                                      0
   ..data.SRR1552445.fastq.gz.subread.BAM
1                                      45
2                                       1
3                                       0
4                                      84
5                                     870
6                                       0
7                                       0
8                                       0
9                                       0
10                                      0
11                                      0
   ..data.SRR1552446.fastq.gz.subread.BAM
1                                      34
2                                       0
3                                       0
4                                      77
5                                     889
6                                       0
7                                       0
8                                       0
9                                       0
10                                      0
11                                      0
   ..data.SRR1552447.fastq.gz.subread.BAM
1                                      34
2                                       0
3                                       0
4                                      69
5                                     897
6                                       0
7                                       0
8                                       0
9                                       0
10                                      0
11                                      0
   ..data.SRR1552448.fastq.gz.subread.BAM
1                                      13
2                                       1
3                                       0
4                                      39
5                                     947
6                                       0
7                                       0
8                                       0
9                                       0
10                                      0
11                                      0
   ..data.SRR1552449.fastq.gz.subread.BAM
1                                      20
2                                       0
3                                       0
4                                      47
5                                     933
6                                       0
7                                       0
8                                       0
9                                       0
10                                      0
11                                      0
   ..data.SRR1552450.fastq.gz.subread.BAM
1                                      50
2                                       1
3                                       0
4                                      68
5                                     881
6                                       0
7                                       0
8                                       0
9                                       0
10                                      0
11                                      0
   ..data.SRR1552451.fastq.gz.subread.BAM
1                                      69
2                                       0
3                                       0
4                                      76
5                                     855
6                                       0
7                                       0
8                                       0
9                                       0
10                                      0
11                                      0
   ..data.SRR1552452.fastq.gz.subread.BAM
1                                      50
2                                       2
3                                       0
4                                      81
5                                     867
6                                       0
7                                       0
8                                       0
9                                       0
10                                      0
11                                      0
   ..data.SRR1552453.fastq.gz.subread.BAM
1                                      52
2                                       1
3                                       0
4                                      89
5                                     858
6                                       0
7                                       0
8                                       0
9                                       0
10                                      0
11                                      0
   ..data.SRR1552454.fastq.gz.subread.BAM
1                                      54
2                                       0
3                                       0
4                                      73
5                                     873
6                                       0
7                                       0
8                                       0
9                                       0
10                                      0
11                                      0
   ..data.SRR1552455.fastq.gz.subread.BAM
1                                      42
2                                       0
3                                       0
4                                      71
5                                     887
6                                       0
7                                       0
8                                       0
9                                       0
10                                      0
11                                      0
</code></pre>
<p>The counts for the samples are stored in fc$counts. Take a look at that.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## Take a look at the dimensions to see the number of genes
<span class="kw">dim</span>(fc$counts)</code></pre></div>
<pre class="output"><code>[1] 27179    12
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## Take a look at the first 6 lines
<span class="kw">head</span>(fc$counts)</code></pre></div>
<pre class="output"><code>          ..data.SRR1552444.fastq.gz.subread.BAM
497097                                         0
100503874                                      0
100038431                                      0
19888                                          0
20671                                          0
27395                                          0
          ..data.SRR1552445.fastq.gz.subread.BAM
497097                                         0
100503874                                      0
100038431                                      0
19888                                          0
20671                                          0
27395                                          0
          ..data.SRR1552446.fastq.gz.subread.BAM
497097                                         0
100503874                                      0
100038431                                      0
19888                                          0
20671                                          0
27395                                          0
          ..data.SRR1552447.fastq.gz.subread.BAM
497097                                         0
100503874                                      0
100038431                                      0
19888                                          0
20671                                          0
27395                                          0
          ..data.SRR1552448.fastq.gz.subread.BAM
497097                                         0
100503874                                      0
100038431                                      0
19888                                          0
20671                                          0
27395                                          0
          ..data.SRR1552449.fastq.gz.subread.BAM
497097                                         0
100503874                                      0
100038431                                      0
19888                                          0
20671                                          0
27395                                          0
          ..data.SRR1552450.fastq.gz.subread.BAM
497097                                         0
100503874                                      0
100038431                                      0
19888                                          0
20671                                          0
27395                                          0
          ..data.SRR1552451.fastq.gz.subread.BAM
497097                                         0
100503874                                      0
100038431                                      0
19888                                          0
20671                                          0
27395                                          0
          ..data.SRR1552452.fastq.gz.subread.BAM
497097                                         0
100503874                                      0
100038431                                      0
19888                                          0
20671                                          0
27395                                          0
          ..data.SRR1552453.fastq.gz.subread.BAM
497097                                         0
100503874                                      0
100038431                                      0
19888                                          0
20671                                          0
27395                                          0
          ..data.SRR1552454.fastq.gz.subread.BAM
497097                                         0
100503874                                      0
100038431                                      0
19888                                          0
20671                                          0
27395                                          0
          ..data.SRR1552455.fastq.gz.subread.BAM
497097                                         0
100503874                                      0
100038431                                      0
19888                                          0
20671                                          0
27395                                          0
</code></pre>
<p>The row names of the fc$counts matrix represent the Entrez gene identifiers for each gene and the column names are the output filenames from calling the <code>align</code> function. The <code>annotation</code> slot shows the annotation information that <code>featureCounts</code> used to summarise reads over genes.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">head</span>(fc$annotation)</code></pre></div>
<pre class="output"><code>     GeneID                           Chr
1    497097                chr1;chr1;chr1
2 100503874                     chr1;chr1
3 100038431                          chr1
4     19888 chr1;chr1;chr1;chr1;chr1;chr1
5     20671      chr1;chr1;chr1;chr1;chr1
6     27395      chr1;chr1;chr1;chr1;chr1
                                            Start
1                         3214482;3421702;3670552
2                                 3647309;3658847
3                                         3680155
4 4290846;4343507;4351910;4352202;4360200;4409170
5         4490928;4493100;4493772;4495136;4496291
6         4773198;4777525;4782568;4783951;4785573
                                              End      Strand Length
1                         3216968;3421901;3671498       -;-;-   3634
2                                 3650509;3658904         -;-   3259
3                                         3681788           +   1634
4 4293012;4350091;4352081;4352837;4360314;4409241 -;-;-;-;-;-   9747
5         4492668;4493466;4493863;4495942;4496413   -;-;-;-;-   3130
6         4776801;4777648;4782733;4784105;4785726   -;-;-;-;-   4203
</code></pre>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge-2"><span class="glyphicon glyphicon-pencil"></span>Challenge</h2>
</div>
<div class="panel-body">
<ol style="list-style-type: decimal">
<li>Redo the counting over the exons, rather than the genes (specify <code>useMetaFeatures = FALSE</code>). Use the bam files generated doing alignment reporting only unique reads, and call the <code>featureCounts</code> object <code>fc.exon</code>. Check the dimension of the counts slot to see how much larger it is.</li>
<li>Using your “.multi.bam” files, redo the counting over genes, allowing for multimapping reads (specify <code>countMultiMappingReads = TRUE</code>), calling the object <code>fc.multi</code>. Check the stats.</li>
</ol>
</div>
</section>
<p>Notes</p>
<ul>
<li>If you are sequencing your own data, the sequencing facility will almost always provide fastq files.<br />
</li>
<li>For publicly available sequence data from GEO/SRA, the files are usually in the Sequence Read Archive (SRA) format. Prior to read alignment, these files need to be converted into the FASTQ format using the fastq-dump utility from the SRA Toolkit. See http: //www.ncbi.nlm.nih.gov/books/NBK158900 for how to download and use the SRA Toolkit.<br />
</li>
<li>By default, alignment is performed with <code>unique=TRUE</code>. If a read can be aligned to two or more locations, <em>Rsubread</em> will attempt to select the best location using a number of criteria. Only reads that have a unique best location are reported as being aligned. Keeping this default is recommended, as it avoids spurious signal from non-uniquely mapped reads derived from, e.g., repeat regions.<br />
</li>
<li>The Phred offset determines the encoding for the base-calling quality string in the FASTQ file. For the Illumina 1.8 format onwards, this encoding is set at +33. However, older formats may use a +64 encoding. Users should ensure that the correct encoding is specified during alignment. If unsure, one can examine the first several quality strings in the FASTQ file. A good rule of thumb is to check whether lower-case letters are present (+64 encoding) or absent (+33).<br />
</li>
<li><code>featureCounts</code> requires gene annotation specifying the genomic start and end position of each exon of each gene. <em>Rsubread</em> contains built-in gene annotation for mouse and human. For other species, users will need to read in a data frame in GTF format to define the genes and exons. Users can also specify a custom annotation file in SAF format. See the Rsubread users guide for more information, or try <code>?featureCounts</code>, which has an example of what an SAF file should like like.</li>
</ul>
<h1 id="package-versions-used">Package versions used</h1>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">sessionInfo</span>()</code></pre></div>
<pre class="output"><code>R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)

locale:
[1] en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  base     

other attached packages:
[1] Rsubread_1.22.3 knitr_1.14     

loaded via a namespace (and not attached):
[1] magrittr_1.5  formatR_1.4   tools_3.3.1   stringi_1.1.1 methods_3.3.1
[6] stringr_1.1.0 evaluate_0.9 
</code></pre>
<h1 id="references" class="unnumbered">References</h1>
<div id="refs" class="references">
<div id="ref-Dobin2013">
<p>Dobin, Alexander, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. 2013. “STAR: ultrafast universal RNA-seq aligner.” <em>Bioinformatics (Oxford, England)</em> 29 (1): 15–21. doi:<a href="https://doi.org/10.1093/bioinformatics/bts635">10.1093/bioinformatics/bts635</a>.</p>
</div>
<div id="ref-Fu2015">
<p>Fu, Nai Yang, Anne C Rios, Bhupinder Pal, Rina Soetanto, Aaron T L Lun, Kevin Liu, Tamara Beck, et al. 2015. “EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival.” <em>Nature Cell Biology</em> 17 (4): 365–75. doi:<a href="https://doi.org/10.1038/ncb3117">10.1038/ncb3117</a>.</p>
</div>
<div id="ref-Langmead2012">
<p>Langmead, Ben, and Steven L Salzberg. 2012. “Fast gapped-read alignment with Bowtie 2.” <em>Nature Methods</em> 9 (4). Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.: 357–9. doi:<a href="https://doi.org/10.1038/nmeth.1923">10.1038/nmeth.1923</a>.</p>
</div>
<div id="ref-liao2013subread">
<p>Liao, Yang, Gordon K Smyth, and Wei Shi. 2013. “The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote.” <em>Nucleic Acids Research</em> 41: 16 pages.</p>
</div>
<div id="ref-Lun2016">
<p>Lun, Aaron T L, Yunshun Chen, and Gordon K Smyth. 2016. “It’s DE-licious: A Recipe for Differential Expression Analyses of RNA-seq Experiments Using Quasi-Likelihood Methods in edgeR.” <em>Methods in Molecular Biology (Clifton, N.J.)</em> 1418 (January): 391–416. doi:<a href="https://doi.org/10.1007/978-1-4939-3578-9\_19">10.1007/978-1-4939-3578-9\_19</a>.</p>
</div>
<div id="ref-trapnell2009tophat">
<p>Trapnell, Cole, Lior Pachter, and Steven L Salzberg. 2009. “TopHat: discovering splice junctions with RNA-seq.” <em>Bioinformatics</em> 25 (9): 1105–11. doi:<a href="https://doi.org/doi:10.1093/bioinformatics/btp120">doi:10.1093/bioinformatics/btp120</a>.</p>
</div>
</div>
        </div>
      </div>
      </article>
      
    </div>
    <!-- Javascript placed at the end of the document so the pages load faster -->
    <script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
    <script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
    <script src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'></script>
  </body>
</html>