Skip to content

Latest commit

 

History

History
211 lines (163 loc) · 5.75 KB

2. CpG islands Assessment.md

File metadata and controls

211 lines (163 loc) · 5.75 KB
title author date output
CpG islands Assessment
Quynh Nhu Nguyen
March 4th 2023
html_document

Question 1:

In the video we briefly described an algorithm that defines CpG islands. This algorithm has been used to create a list that is available in most genomic annotation data bases.

The Bioconductor package AnnotationHub package permits us to explore available annotations. By typing the following commands we can see what is available:

BiocManager::install("AnnotationHub")
library(AnnotationHub)
ah = AnnotationHub()
head(ah)

We can then subset these to just the databases related to the hg19 genome:

ah = subset(ah,ah$genome=="hg19")

We can then use the query function to search the available annotations in this "hub". For example:

query(ah,"genes")

On the left you see the record IDs for annotationHub.

What is the record ID used by AnnotationHub for hg19 CpG Islands? Hint use query on the object ah we created above

query(ah, "cpg Islands")
##or
query(ah,"cpg")
## or 
query(ah,"Islands")

Question 2:

Use AnnotationHub to create an R object representing CpG Islands:

library(AnnotationHub)
ah = AnnotationHub()
cgi = ah[["AH5086"]]

What is the class of the object cgi (hint: use class function)?

class(cgi)

Question 3:

How many CpG islands are represented in the object cgi defined in the previous question?

length(cgi)

Question 4:

We can extract the sequence of each CpG Islands this way:

library(BSgenome.Hsapiens.UCSC.hg19)
cgiseq= getSeq(Hsapiens,cgi)

Note that it is indispensable that we assure the same genome builds are being used here:

genome(cgi)[1:24]
genome(Hsapiens)[1:24]

We will use this to determine the observed to expected ratio described in the video.

Compute the proportion of Cs for each island and report the median of these proportions (hint: use the alphabetFrequency function).

res = alphabetFrequency(cgiseq)
L = rowSums(res)
cprop = res[,"C"]/L
median(cprop)
# or
median(letterFrequency(cgiseq, "C", as.prob=TRUE))
# Make a histogram to see they are generally high
hist(cprop)

Question 5:

Compute the proportion of Gs for each island and report the median of these proportions.

res = alphabetFrequency(cgiseq)
L = rowSums(res)
gprop = res[,"G"]/L
median(gprop)
# or
median(letterFrequency(cgiseq, "G", as.prob=TRUE))
# Make a histogram to see they are generally high
hist(gprop)

Question 6:

Now that for each CpG island we have the proportion of Cs $p_C$ and the proportion of Gs $p_G$ we can compute the proportion of CG we expect to see by chance if all dinucleotides have the same probability of appearing. To a close approximation this expected proportion is simply $p_C\times p_G$.

The number of CpGs that we expect to see in a genomic interval of size $L$ is then $L\times p_C\times p_G$. Once we have this expectation we can compute the observed to expected ratio.

Compute the expected number of CpGs in each CpG island using the formula. For each island divide the observed number of CpGs by the expected number of CpGs.

Report the median of these observed to expected ratios (hint: use the vcountPattern function to get the number of CpGs in each island).

res = alphabetFrequency(cgiseq)
L = rowSums(res)
cprop = res[,"C"]/L
gprop = res[,"G"]/L
expected=L*cprop*gprop
observed=vcountPattern("CG",cgiseq)
cpgoe=observed/expected
median(cpgoe)
## We can look at a histogram
hist(cpgoe)
abline(v=1,col=2)
##because these are ratios, better to look at log
hist(log2 ( cpgoe ))
abline(v=0,col=2)

We see that the ratio is less than one for most islands

mean(cpgoe[log2(cpgoe)<0])

Question 7:

Repeat the entire exercise for GC instead of CG. What is the median observed to expected ratio?

res = alphabetFrequency(cgiseq)
L = rowSums(res)
cprop = res[,"C"]/L
gprop = res[,"G"]/L
expected=L*cprop*gprop
observed=vcountPattern("GC",cgiseq)
gpcoe=observed/expected
median(gpcoe)
## We can look at a histogram
##because these are ratios, better to look at log
hist(log2 ( gpcoe ))
abline(v=0,col=2)
mean(gpcoe[log2(gpcoe)<0])

Compare CpG aand GpC ratios with a box plot

boxplot(list(cpgoe,gpcoe))

Question 8:

Note that the CpG observed to expected ratio is below 1 and that few islands actually surpass a ratio of 1 or more. However, for the rest of the genome the observed to expected ratio is substantially smaller. To look at regions that are not islands let's shift the islands we have by 20,000.

To avoid problems with short sequences, we will restrict our analysis to the mapped chromosomes:

chr2use = seqlevels(cgi)[1:24]
index = which( seqnames(cgi) %in% chr2use)

And define the non CpG islands by shifting the known ones by 20K.

noncgi = shift(cgi[index],20000)

Some of these regions contain repeats or are unmapped so we remove regions that have 0 Cs or 0 Gs:

library(BSgenome.Hsapiens.UCSC.hg19)
noncgiseq= getSeq(Hsapiens,noncgi)
nullres = alphabetFrequency(noncgiseq)
keepIndex=nullres[,"G"]>0 &  nullres[,"C"]>0 & nullres[,"N"]==0
nullres = nullres[keepIndex,]
noncgiseq=noncgiseq[keepIndex]

Use nullres and noncgiseq defined above to compute the expected number of CpGs we should see in each of the regions. Report the median observed to expected ratio for these regions?

L2 = rowSums(nullres)
cprop2 = nullres[,"C"]/L2
gprop2 = nullres[,"G"]/L2
expected2=L2*cprop2*gprop2
observed2=vcountPattern("CG",noncgiseq)
noncgioe=observed2/expected2
median(noncgioe)

We can compare them all

boxplot(gpcoe,noncgioe,cpgoe)