-
Notifications
You must be signed in to change notification settings - Fork 623
IGV Tutorial
Previous Section | This Section | Next Section |
---|---|---|
Alignment | IGV | Alignment Visualization |
- Introduction
- Description of the lab
- Requirements
- Compatibility
- Data Set for IGV
- Visualization Part 1: Getting familiar with IGV
- Get familiar with the interface 1. Load a Genome and Data Tracks 2. Navigation
- Region Lists
- Loading Read Alignments
- Visualizing read alignments
- Visualization Part 2: Inspecting SNPs, SNVs, and SVs
- Neighbouring somatic SNV and germline SNP
- Homopolymer repeat with indel
- Coverage by GC
- Heterozygous SNPs on different alleles
- Low mapping quality
- Homozygous deletion
- Mis-alignment
- Translocation
- Visualization Part 3: Automating Tasks in IGV
Welcome to the lab for Genome Visualization! This lab will introduce you to the [Integrative Genomics Viewer](http://www.broadinstitute.org/igv/home), one of the most popular visualization tools for High Throughput Sequencing (HTS) data.
Lecture files that accompany this tutorial:
After this lab, you will be able to:
- Visualize a variety of genomic data
- Quickly navigate around the genome
- Visualize read alignments
- Validate SNP/SNV calls and structural re-arrangements by eye
Things to know before you start:
- The lab may take between 1-2 hours, depending on your familiarity with genome browsing. Do not worry if you do not complete the lab. It will remain available to review later.
- There are a few thought-provoking Questions or Notes pertaining to sections of the lab. These are optional, and may take more time, but are meant to help you better understand the visualizations you are seeing. These questions will be denoted by boxes, as follows: Question(s):
Thought-provoking question goes here
- Integrative Genomics Viewer
- Ability to run Java
- Note that while most tutorials in this course are performed on the cloud, IGV will always be run on your local machine
This tutorial was intended for IGV v2.3, which is available on the IGV Download page. It is strongly recommended that you use this version.
We will be using publicly available Illumina sequence data from the HCC1143 cell line. The HCC1143 cell line was generated from a 52 year old caucasian woman with breast cancer. Additional information on this cell line can be found here: HCC1143 (tumor, TNM stage IIA, grade 3, primary ductal carcinoma) and HCC1143/BL (matched normal EBV transformed lymphoblast cell line).
- Sequence read alignments generated from a cell line HCC1143 that have been filtered to this region:
- Chromosome 21: 19,000,000-20,000,000
- HCC1143.normal.21.19M-20M.bam
- HCC1143.normal.21.19M-20M.bam.bai
We will be visualizing read alignments using IGV, a popular visualization tool for HTS data.
First, lets familiarize ourselves with it.
By default, IGV loads Human hg19. If you work with another version of the human genome, or another organism altogether, you can change the genome by clicking the drop down menu in the upper-left. For this lab, we will be using Human hg19.
We will also load additional tracks from Server using (File
-> Load from Server...
):
- Ensembl genes (or your favourite source of gene annotations)
- GC Percentage
- dbSNP 1.3.1 or 1.3.7
Load hg19 genome and additional data tracks
You should see listing of chromosomes in this reference genome. Choose 1, for chromosome 1.
Chromosome chooser
Navigate to chr1:10,000-11,000 by entering this into the location field (in the top-left corner of the interface) and clicking Go
. This shows a window of chromosome 1 that is 1,000 base pairs wide and beginning at position 10,000.
Navigition using Location text field. Sequence displayed as thin coloured rectangles.
IGV displays the sequence of letters in a genome as a sequence of colours (e.g. A = green, C = blue, etc.). This makes repetitive sequences, like the ones found at the start of this region, easy to identify. Zoom in a bit more using the +
button to see the individual bases of the reference genome sequence.
You can navigate to a gene of interest by typing it in the same box the genomic coordinates are in and pressing Enter/Return. Try it for your favourite gene, or BRCA1 if you can not decide.
Gene model
Genes are represented as lines and boxes. Lines represent intronic regions, and boxes represent exonic regions. The arrows indicate the direction/strand of transcription for the gene. When an exon box become narrower in height, this indicates a UTR.
When loaded, tracks are stacked on top of each other. You can identify which track is which by consulting the label to the left of each track.
Sometimes, it is really useful to save where you are, or to load regions of interest. For this purpose, there is a Region Navigator in IGV. To access it, click Regions
> Region Navigator
. While you browse around the genome, you can save some bookmarks by pressing the Add
button at any time.
Bookmarks in IGV
We will be using the breast cancer cell line HCC1143 to visualize alignments. For speed, only a small portion of chr21 will be loaded (19M:20M).
HCC1143 Alignments to hg19:
Copy the files to your local drive, and in IGV choose File
> Load from File...
, select the bam file, and click OK
. Note that the bam and index files must be in the same directory for IGV to load these properly.
Load BAM track from File
Navigate to a narrow window on chromosome 21: chr21:19,480,041-19,480,386
.
To start our exploration, right click on the track-name, and select the following options:
- Sort alignments by
start location
- Group alignments by
pair orientation
Experiment with the various settings by right clicking the read alignment track and toggling the options. Think about which would be best for specific tasks (e.g. quality control, SNP calling, CNV finding).
Changing how read alignments are sorted, grouped, and colored
You will see reads represented by grey or white bars stacked on top of each other, where they were aligned to the reference genome. The reads are pointed to indicate their orientation (i.e. the strand on which they are mapped). Mouse over any read and notice that a lot of information is available. To toggle read display from hover
to click
, select the yellow box and change the setting.
Changing how read information is shown (i.e. on hover, click, never)
Once you select a read, you will learn what many of these metrics mean, and how to use them to assess the quality of your datasets. At each base that the read sequence mismatches the reference, the colour of the base represents the letter that exists in the read (using the same colour legend used for displaying the reference).
Viewing read information for a single aligned read
In this section we will be looking in detail at 8 positions in the genome, and determining whether they represent real events or artifacts.
- Navigate to region
chr21:19,479,237-19,479,814
- Note two heterozygous variants, one corresponds to a known dbSNP (
G/T
on the right) the other does not (C/T
on the left) - Zoom in and center on the
C/T
SNV on the left, sort by base (windowchr21:19,479,321
is the SNV position) - Sort alignments by
base
- Color alignments by
read strand
Example1. Good quality SNVs/SNPs
Notes:
- High base qualities in all reads except one (where the alt allele is the last base of the read)
- Good mapping quality of reads, no strand bias, allele frequency consistent with heterozygous mutation
Question(s):
* What does *Shade base by quality* do? How might this be helpful?
* How does Color by *read strand* help?
Navigate to position chr21:19,518,412-19,518,497
Example 2a
- Group alignments by
read strand
- Center on the
A
within the homopolymer run (chr21:19,518,470
), andSort alignments by
->base
Example 2b
- Center on the one base deletion (
chr21:19,518,452
), andSort alignments by
->base
Notes:
- The alt allele is either a deletion or insertion of one or two
T
s - The remaining bases are mismatched, because the alignment is now out of sync
- The dpSNP entry at this location (rs74604068) is an A->T, and in all likelihood an artifact
- i.e. the common variants from dbSNP include some cases that are actually common misalignments caused by repeats
Navigate to position chr21:19,611,925-19,631,555
. Note that the range contains areas where coverage drops to zero in a few places.
Example 3
- Use
Collapsed
view - Use
Color alignments by
->insert size and pair orientation
- Load GC track
- See concordance of coverage with GC content
Question:
* Why are there blue and red reads throughout the alignments?
Navigate to region chr21:19,666,833-19,667,007
Example 4
- Sort by base (at position
chr21:19,666,901
)
Note:
- There is no linkage between alleles for these two SNPs because reads covering both only contain one or the other
Navigate to region chr21:19,800,320-19,818,162
- Load repeat track (
File
->Load from server...
)
Load repeats
Example 5
Notes:
- Mapping quality plunges in all reads (white instead of grey). Once we load repeat elements, we see that there are two LINE elements that cause this.
Navigate to region chr21:19,324,469-19,331,468
Example 6
- Turn on
View as Pairs
andExpanded
view - Use
Color alignments by
->insert size and pair orientation
- Sort reads by insert size
- Click on a red read pair to pull up information on alignments
Notes:
- Typical insert size of read pair in the vicinity: 350bp
- Insert size of red read pairs: 2,875bp
- This corresponds to a homozygous deletion of 2.5kb
Navigate to region chr21:19,102,154-19,103,108
Example 7
Notes:
- This is a position where AluY element causes mis-alignment.
- Misaligned reads have mismatches to the reference and well-aligned reads have partners on other chromosomes where additional ALuY elements are encoded.
- Zoom out until you can clearly see the contrast between the difficult alignment region (corresponding to an AluY) and regions with clean alignments on either side
Navigate to region chr21:19,089,694-19,095,362
Example 8
- Expanded view
-
Group alignments by
->pair orientation
-
Color alignments by
->pair orientation
Notes:
- Many reads with mismatches to reference
- Read pairs in RL pattern (instead of LR pattern)
- Region is flanked by reads with poor mapping quality (white instead of grey)
- Presence of reads with pairs on other chromosomes (coloured reads at the bottom when scrolling down)
We can use the Tools menu to invoke running a batch script. Batch scripts are described on the IGV website:
- Batch file requirements: https://www.broadinstitute.org/igv/batch
- Commands recognized in a batch script: https://www.broadinstitute.org/software/igv/PortCommands
- We also need to provide sample attribute file as described here: http://www.broadinstitute.org/software/igv/?q=SampleInformation
Download the batch script and the attribute file for our dataset:
- Batch script: Run_batch_IGV_snapshots.txt
- Attribute file: Igv_HCC1143_attributes.txt
Now run the file from the Tools
menu:
Automation
Notes:
- This script will navigate automatically to each location in the lab
- A screenshot will be taken and saved to the screenshots directory specified
Malachi Griffith, Sorana Morrissy, Jim Robinson, Ben Ainscough, Jason Walker, Obi Griffith
Previous Section | This Section | Next Section |
---|---|---|
Alignment | IGV | Alignment Visualization |
NOTICE: This resource has been moved to rnabio.org. The version here will be maintained for legacy use only. All future development and maintenance will occur only at rnabio.org. Please proceed to rnabio.org for the current version of this course.
Table of Contents
Module 0: Authors | Citation | Syntax | Intro to AWS | Log into AWS | Unix | Environment | Resources
Module 1: Installation | Reference Genomes | Annotations | Indexing | Data | Data QC
Module 2: Adapter Trim | Alignment | IGV | Alignment Visualization | Alignment QC
Module 3: Expression | Differential Expression | DE Visualization
Module 4: Alignment Free - Kallisto
Module 5: Ref Guided | De novo | Merging | Differential Splicing | Splicing Visualization
Module 6: Trinity
Module 7: Trinotate
Appendix: Saving Results | Abbreviations | Lectures | Practical Exercise Solutions | Integrated Assignment | Proposed Improvements | AWS Setup