Sequencing of new genomes has become commonplace. In this episode of SOC, Sofia Robb will discuss available open source methods for sharing genome-scale data if it is not feasible to share it through standard databases such as Ensembl. She will also demonstrate helpful ways to mine Ensembl data with Biomart and useful UNIX command line tricks for sorting, searching, and reformatting text data files.
Genome Tools Talk
Career Path Talk
Live Action Role Playing
You work with chickens and have completed an RNAseq experiment. You have two conditions,
- condition 1: g1 = 'h3.3a-/-, h3.3b-/-'
- condition 2: g2 = 'wild type genotype'
You performed differential expression analysis, perhaps with cuffdiff.
Part 1: You are going to download your data.
Part 2: You are going to create up- and down-regualted gene lists.
Part 3: You are going to find out more information about what genes are in your gene lists.
Part 4: You are going to search your gene list for genes involved in your favorite processes.
Let's get the expression data from Ensembl
Part 1 Tasks:
- Go to EBI Expression Atlas
- Select chicken
- Check box to download the first experiment, "RNA-seq of H3.3 knockout and wild type chicken DT40 cells". If you get lost, directly download here
- Click the download link at the top of the last column.
- Navigate to E-MTAB-2754 directory
- Checkout the contents of E-MTAB-2754-analytics.tsv
Contents of E-MTAB-2754-analytics.tsv:
$ head E-MTAB-2754-analytics.tsv
Gene ID Gene Name g1_g2.p-value g1_g2.log2foldchange
ENSGALG00000000003 PANX2 0.100242375805959 -0.4
ENSGALG00000000011 C10orf88 0.0802046773105167 0.2
ENSGALG00000000038 CTRB2 NA 0.2
ENSGALG00000000044 WFIKKN1 NA 0
ENSGALG00000000048 0.288103121752422 0.4
ENSGALG00000000055 LAMTOR3 0.529728058895927 0.1
ENSGALG00000000059 TUBB3 0.228430079834946 -0.2
ENSGALG00000000067 SPR 0.0560358954256604 -0.4
ENSGALG00000000071 0.878861305389193 0
What is it that people want to do usually with differential expression data?
They usually want to find the top up regulated genes and the top down regulated genes.
Let's do it!!
Where do we start?
Part 2 Tasks:
-
We want to make sure we are only looking at data points that are statically signifant, p-value > 0.001.
a. Sort expression file by p-value
b. Keep only the lines that have a p-value > 0.001. -
Now let's find our most up- and down- regulated genes. Which means we need to sort the log2foldchage column (4th column)
a. Sort file by log2foldchange
b. Get the top 100 up/down-regulated genes
c. Get a list of all the genes with the most signifant changes
d. Do it a different way
Now what are these genes?
We are going to mine gene info data from Ensembl BioMart. BioMart is a SUPER handy tool (if your organism is in Ensembl).
Ensembl has 6 different sites for different groups of organisms:
Ensembl (veterbrates)
Ensembl Plants
Ensembl Fungi
Ensembl Bacteria
Ensembl Metazoa
Let's find out more about chicken genes using Ensembl's BioMart tool.
Part 3 Tasks:
- Retrieve the gene ID, gene name, gene description, and interporscan ID, short description, and description for every chicken gene. Need Help?
- Find the gene information about out upregulated genes.
- Find the gene information about out downregulated genes.
Are any of our up- or down-regualted genes involved in a process you are super interested in?
Of our most signficant up- and down-regualted genes, are any involved in stem cell proliferation (GO:0072089) or pigmenation (GO:0043473)?
Part 4 Tasks:
- Get a list of genes involved in stem cell proliferation (GO:0072089). Need Help?
- Are any of our up-regulated also in our list of genes involved in stem cell proliferation? Need Help?