-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathsnpClust.Rmd
153 lines (112 loc) · 4.58 KB
/
snpClust.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
title: "Inferring Linkage Disequilibrium blocks from genotypes"
author: "Shubham Chaturvedi, Pierre Neuvial, Nathalie Vialaneix"
date: "`r Sys.Date()`"
output:
html_document:
toc: yes
vignette: >
%\VignetteIndexEntry{Inferring Linkage Disequilibrium blocks from genotypes}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r skipNoSNPSTATS}
# IMPORTANT: this vignette is not created if snpStats is not installed
if (!require("snpStats")) {
knitr::opts_chunk$set(eval = FALSE)
}
```
# Introduction
In this vignette we demonstrate the use of `snpClust` function in the `adjclust`
package. `snpClust` performs adjacency-constrained hierarchical clustering of
single nucleotide polymorphisms (SNPs), where the similarity between SNPs is
defined by linkage disequilibrium (LD).
This function implements the algorithm described in [1]. It is an extension of
the algorithm described in [3,4]. Denoting by $p$ the number of SNPs to cluster
and assuming that the similarity between SNPs whose indices are more distant
than $h$, its time complexity is $O(p (\log(p) + h))$, and its space complexity
is $O(hp)$.
```{r loadLib, message=FALSE}
library("adjclust")
```
# Loading and displaying genotype data
The beginning of this vignette closely follows the "LD vignette" of the SnpStats
package [2]. First, we load genotype data.
```{r loadData, results="hide", message=FALSE}
data("ld.example", package = "snpStats")
```
We focus on the `ceph.1mb` data.
```{r preData}
geno <- ceph.1mb[, -316] ## drop one SNP leading to one missing LD value
p <- ncol(geno)
nSamples <- nrow(geno)
geno
```
These data are drawn from the International HapMap Project and concern 602
SNPs[^1] over a 1Mb region of chromosome 22 in sample of 90 Europeans.
We can compute and display the LD between these SNPs.
[^1]: We have dropped SNP rs2401075 because it produced a missing value due to
the lack of genetic diversity in the considered sample.
```{r LD}
ld.ceph <- snpStats::ld(geno, stats = "R.squared", depth = p-1)
image(ld.ceph, lwd = 0)
```
# Adjacency-constrained Hierarchical Agglomerative Clustering
The `snpClust` function can handle genotype data as an input:
```{r snpClust}
fit <- snpClust(geno, stats = "R.squared")
```
Note that due to numerical errors in the LD estimation, some of the estimated LD values may be slightly larger than 1. These values are rounded to 1 internally.
The above figure suggests that the LD signal is concentrated close to the
diagonal. We can focus on a diagonal band with the bandwidth parameter `h`:
```{r snpClust-sparse}
fitH <- snpClust(geno, h = 100, stats = "R.squared")
fitH
```
# Output
The output of the `snpClust` is of class `chac`. In particular, it can be
plotted as a dendrogram silently relying on the function `plot.dendrogram`:
```{r dendro}
plot(fitH, type = "rectangle", leaflab = "perpendicular")
```
Moreover, the output contains an element named `merge` which describes the
successive merges of the clustering, and an element `gains` which gives the
improvement in the criterion optimized by the clustering at each successive
merge.
```{r objectDesc}
head(cbind(fitH$merge, fitH$gains))
```
# Other types of input
In this section we show how the `snpClust` function can also be applied directly
to LD values.
```{r snpClust-LD}
h <- 100
ld.ceph <- snpStats::ld(geno, stats = "R.squared", depth = h, symmetric = TRUE)
image(ld.ceph, lwd = 0)
```
Note that we have forced the `snpStats::ld` function to return a symmetric matrix.
We can apply `snpClust` directly to this LD matrix (of class `Matrix::dsCMatrix`):
```{r snpClust-sMatrix}
fitL <- snpClust(ld.ceph, h)
```
`snpClust` also handles inputs of class `base::matrix`:
```{r snpClust-matrix, warning=FALSE}
gmat <- as(geno, "matrix")
fitM <- snpClust(geno, h, stats = "R.squared")
```
# References
[1] Ambroise C., Dehman A., Neuvial P., Rigaill G., and Vialaneix N. (2019).
Adjacency-constrained hierarchical clustering of a band similarity matrix with
application to genomics. *Algorithms for Molecular Biology*, **14**, 22.
[2] Clayton D. (2015). snpStats: SnpMatrix and XSnpMatrix classes and methods.
R package version 1.20.0
[3] Dehman A., Ambroise C., Neuvial P. (2015). Performance of a blockwise
approach in variable selection using linkage disequilibrium information. *BMC
Bioinformatics*, **16**, 148.
[4] Randriamihamison N., Vialaneix N., and Neuvial P. (2021). Applicability and
interpretability of Ward's hierarchical agglomerative clustering with or without
contiguity constraints. *Journal of Classification*, **38**, 363–389.
# Session information
```{r session}
sessionInfo()
```