vignettes/snpClust.Rmd

---
title: "Inferring Linkage Disequilibrium blocks from genotypes"
author: "Shubham Chaturvedi, Pierre Neuvial, Nathalie Vialaneix"
date: "`r Sys.Date()`"
output: 
  html_document:
    toc: yes
vignette: >
  %\VignetteIndexEntry{Inferring Linkage Disequilibrium blocks from genotypes}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r skipNoSNPSTATS}
# IMPORTANT: this vignette is not created if snpStats is not installed
if (!require("snpStats")) {
  knitr::opts_chunk$set(eval = FALSE)
}
```

# Introduction

In this vignette we demonstrate the use of `snpClust` function in the `adjclust`
package. `snpClust` performs adjacency-constrained hierarchical clustering of 
single nucleotide polymorphisms (SNPs), where the similarity between SNPs is 
defined by linkage disequilibrium (LD).

This function implements the algorithm described in [1]. It is an extension of
the algorithm described in [3,4]. Denoting by $p$ the number of SNPs to cluster
and assuming that the similarity between SNPs whose indices are more distant
than $h$, its time complexity is $O(p (\log(p) + h))$, and its space complexity
is $O(hp)$.

```{r loadLib, message=FALSE}
library("adjclust")
```

# Loading and displaying genotype data

The beginning of this vignette closely follows the "LD vignette" of the SnpStats 
package [2]. First, we load genotype data.

```{r loadData, results="hide", message=FALSE}
data("ld.example", package = "snpStats")
```

We focus on the `ceph.1mb` data. 

```{r preData}
geno <- ceph.1mb[, -316]  ## drop one SNP leading to one missing LD value
p <- ncol(geno)
nSamples <- nrow(geno)
geno
```

These data are drawn from the International HapMap Project and concern 602 
SNPs[^1] over a 1Mb region of chromosome 22 in sample of 90 Europeans.

We can compute and display the LD between these SNPs.

[^1]: We have dropped SNP rs2401075 because it produced a missing value due to 
the lack of genetic diversity in the considered sample.

```{r LD}
ld.ceph <- snpStats::ld(geno, stats = "R.squared", depth = p-1)
image(ld.ceph, lwd = 0)
```

# Adjacency-constrained Hierarchical Agglomerative Clustering

The `snpClust` function can handle genotype data as an input:

```{r snpClust}
fit <- snpClust(geno, stats = "R.squared")
```

 
Note that due to numerical errors in the LD estimation, some of the estimated LD values may be slightly larger than 1. These values are rounded to 1 internally.

The above figure suggests that the LD signal is concentrated close to the 
diagonal. We can focus on a diagonal band with the bandwidth parameter `h`:

```{r snpClust-sparse}
fitH <- snpClust(geno, h = 100, stats = "R.squared")
fitH
```

# Output

The output of the `snpClust` is of class `chac`. In particular, it can be 
plotted as a dendrogram silently relying on the function `plot.dendrogram`:

```{r dendro}
plot(fitH, type = "rectangle", leaflab = "perpendicular")
```


Moreover, the output contains an element named `merge` which describes the 
successive merges of the clustering, and an element `gains` which gives the 
improvement in the criterion optimized by the clustering at each successive 
merge.

```{r objectDesc}
head(cbind(fitH$merge, fitH$gains))
```

# Other types of input

In this section we show how the `snpClust` function can also be applied directly
to LD values.

```{r snpClust-LD}
h <- 100
ld.ceph <- snpStats::ld(geno, stats = "R.squared", depth = h, symmetric = TRUE)
image(ld.ceph, lwd = 0)
```

Note that we have forced the `snpStats::ld` function to return a symmetric matrix. 
We can apply `snpClust` directly to this LD matrix (of class `Matrix::dsCMatrix`):

```{r snpClust-sMatrix}
fitL <- snpClust(ld.ceph, h)
```

`snpClust` also handles inputs of class `base::matrix`:

```{r snpClust-matrix, warning=FALSE}
gmat <- as(geno, "matrix")
fitM <- snpClust(geno, h, stats = "R.squared")
```

# References

[1] Ambroise C., Dehman A., Neuvial P., Rigaill G., and Vialaneix N. (2019). 
Adjacency-constrained hierarchical clustering of a band similarity matrix with 
application to genomics. *Algorithms for Molecular Biology*, **14**, 22.

[2] Clayton D. (2015). snpStats: SnpMatrix and XSnpMatrix classes and methods.
R package version 1.20.0

[3] Dehman A., Ambroise C., Neuvial P. (2015). Performance of a blockwise 
approach in variable selection using linkage disequilibrium information. *BMC 
Bioinformatics*, **16**, 148.

[4] Randriamihamison N., Vialaneix N., and Neuvial P. (2021). Applicability and
interpretability of Ward's hierarchical agglomerative clustering with or without
contiguity constraints. *Journal of Classification*, **38**, 363–389.

# Session information

```{r session}
sessionInfo()
```