title | author | date | output |
---|---|---|---|
CpG islands Assessment |
Quynh Nhu Nguyen |
March 4th 2023 |
html_document |
In the video we briefly described an algorithm that defines CpG islands. This algorithm has been used to create a list that is available in most genomic annotation data bases.
The Bioconductor package AnnotationHub package permits us to explore available annotations. By typing the following commands we can see what is available:
BiocManager::install("AnnotationHub")
library(AnnotationHub)
ah = AnnotationHub()
head(ah)
We can then subset these to just the databases related to the hg19 genome:
ah = subset(ah,ah$genome=="hg19")
We can then use the query function to search the available annotations in this "hub". For example:
query(ah,"genes")
On the left you see the record IDs for annotationHub.
What is the record ID used by AnnotationHub for hg19 CpG Islands? Hint use query on the object ah we created above
query(ah, "cpg Islands")
##or
query(ah,"cpg")
## or
query(ah,"Islands")
Use AnnotationHub to create an R object representing CpG Islands:
library(AnnotationHub)
ah = AnnotationHub()
cgi = ah[["AH5086"]]
What is the class of the object cgi (hint: use class function)?
class(cgi)
How many CpG islands are represented in the object cgi defined in the previous question?
length(cgi)
We can extract the sequence of each CpG Islands this way:
library(BSgenome.Hsapiens.UCSC.hg19)
cgiseq= getSeq(Hsapiens,cgi)
Note that it is indispensable that we assure the same genome builds are being used here:
genome(cgi)[1:24]
genome(Hsapiens)[1:24]
We will use this to determine the observed to expected ratio described in the video.
Compute the proportion of Cs for each island and report the median of these proportions (hint: use the alphabetFrequency function).
res = alphabetFrequency(cgiseq)
L = rowSums(res)
cprop = res[,"C"]/L
median(cprop)
# or
median(letterFrequency(cgiseq, "C", as.prob=TRUE))
# Make a histogram to see they are generally high
hist(cprop)
Compute the proportion of Gs for each island and report the median of these proportions.
res = alphabetFrequency(cgiseq)
L = rowSums(res)
gprop = res[,"G"]/L
median(gprop)
# or
median(letterFrequency(cgiseq, "G", as.prob=TRUE))
# Make a histogram to see they are generally high
hist(gprop)
Now that for each CpG island we have the proportion of Cs
The number of CpGs that we expect to see in a genomic interval of size
Compute the expected number of CpGs in each CpG island using the formula. For each island divide the observed number of CpGs by the expected number of CpGs.
Report the median of these observed to expected ratios (hint: use the vcountPattern function to get the number of CpGs in each island).
res = alphabetFrequency(cgiseq)
L = rowSums(res)
cprop = res[,"C"]/L
gprop = res[,"G"]/L
expected=L*cprop*gprop
observed=vcountPattern("CG",cgiseq)
cpgoe=observed/expected
median(cpgoe)
## We can look at a histogram
hist(cpgoe)
abline(v=1,col=2)
##because these are ratios, better to look at log
hist(log2 ( cpgoe ))
abline(v=0,col=2)
We see that the ratio is less than one for most islands
mean(cpgoe[log2(cpgoe)<0])
Repeat the entire exercise for GC instead of CG. What is the median observed to expected ratio?
res = alphabetFrequency(cgiseq)
L = rowSums(res)
cprop = res[,"C"]/L
gprop = res[,"G"]/L
expected=L*cprop*gprop
observed=vcountPattern("GC",cgiseq)
gpcoe=observed/expected
median(gpcoe)
## We can look at a histogram
##because these are ratios, better to look at log
hist(log2 ( gpcoe ))
abline(v=0,col=2)
mean(gpcoe[log2(gpcoe)<0])
Compare CpG aand GpC ratios with a box plot
boxplot(list(cpgoe,gpcoe))
Note that the CpG observed to expected ratio is below 1 and that few islands actually surpass a ratio of 1 or more. However, for the rest of the genome the observed to expected ratio is substantially smaller. To look at regions that are not islands let's shift the islands we have by 20,000.
To avoid problems with short sequences, we will restrict our analysis to the mapped chromosomes:
chr2use = seqlevels(cgi)[1:24]
index = which( seqnames(cgi) %in% chr2use)
And define the non CpG islands by shifting the known ones by 20K.
noncgi = shift(cgi[index],20000)
Some of these regions contain repeats or are unmapped so we remove regions that have 0 Cs or 0 Gs:
library(BSgenome.Hsapiens.UCSC.hg19)
noncgiseq= getSeq(Hsapiens,noncgi)
nullres = alphabetFrequency(noncgiseq)
keepIndex=nullres[,"G"]>0 & nullres[,"C"]>0 & nullres[,"N"]==0
nullres = nullres[keepIndex,]
noncgiseq=noncgiseq[keepIndex]
Use nullres and noncgiseq defined above to compute the expected number of CpGs we should see in each of the regions. Report the median observed to expected ratio for these regions?
L2 = rowSums(nullres)
cprop2 = nullres[,"C"]/L2
gprop2 = nullres[,"G"]/L2
expected2=L2*cprop2*gprop2
observed2=vcountPattern("CG",noncgiseq)
noncgioe=observed2/expected2
median(noncgioe)
We can compare them all
boxplot(gpcoe,noncgioe,cpgoe)