-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.Rmd
137 lines (101 loc) · 3.36 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
output:
md_document:
variant: markdown_github
---
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-",
cache = TRUE
)
```
<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/tgstat)](https://CRAN.R-project.org/package=tgstat)
[![R-CMD-check](https://github.com/tanaylab/tgstat/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tanaylab/tgstat/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
# tgstat
The goal of `tgstat` is to provide fast and efficient implementation of certain R functions
such as 'cor' and 'dist', along with specific statistical tools.
Various approaches are used to boost the performance, including multi-processing and use of
optimized functions provided by the Basic Linear Algebra Subprograms (BLAS) library.
## Installation
Install from CRAN:
```{r, eval = FALSE}
install.packages("tgstat")
```
For the development version:
```{r, eval = FALSE}
remotes::install_github("tanaylab/tgstat")
```
## Examples
```{r, eval=TRUE}
library(tgstat)
set.seed(seed = 60427)
rows <- 3000
cols <- 3000
vals <- sample(1:(rows * cols / 2), rows * cols, replace = T)
m <- matrix(vals, nrow = rows, ncol = cols)
m_with_NAs <- m
m_with_NAs[sample(1:(rows * cols), rows * cols / 10)] <- NA
dim(m)
```
### Fast computation of correlation matrices
Pearson correlation without BLAS, no NAs:
```{r, eval = TRUE}
options(tgs_use.blas = F)
system.time(tgs_cor(m))
```
Same with BLAS:
```{r, eval = TRUE}
# tgs_cor, with BLAS, no NAs, pearson
options(tgs_use.blas = T)
system.time(tgs_cor(m))
```
Base R version:
```{r, eval = TRUE}
system.time(cor(m))
```
Pearson correlation without BLAS, with NAs:
```{r, eval = TRUE}
options(tgs_use.blas = F)
system.time(tgs_cor(m_with_NAs, pairwise.complete.obs = T))
```
Same with BLAS:
```{r, eval = TRUE}
options(tgs_use.blas = T)
system.time(tgs_cor(m_with_NAs, pairwise.complete.obs = T))
```
Base R version:
```{r, eval = TRUE}
system.time(cor(m_with_NAs, use = "pairwise.complete.obs"))
```
### Fast computation of distance matrices
Distance without BLAS, no NAs:
```{r, eval = TRUE}
options(tgs_use.blas = F)
system.time(tgs_dist(m))
```
Same with BLAS:
```{r, eval = TRUE}
options(tgs_use.blas = T)
system.time(tgs_dist(m))
```
Base R:
```{r, eval = TRUE}
system.time(dist(m, method = "euclidean"))
```
## Notes regarding the usage of `BLAS`
`tgstat` runs best when R is linked with an optimized BLAS implementation.
Many optimized BLAS implementations are available, both proprietary (e.g. Intel's MKL,
Apple's vecLib) and opensource (e.g. OpenBLAS, ATLAS). Unfortunately, R often uses by default
the reference BLAS implementation, which is known to have poor performance.
Having `tgstat` rely on the reference BLAS will result in very poor performance and is strongly
discouraged. If your R implementation uses an optimized BLAS, set `options(tgs_use.blas=TRUE)`
to allow `tgstat` to make BLAS calls. Otherwise, set `options(tgs_use.blas=FALSE)` (default)
which instructs `tgstat` to avoid BLAS and instead rely only on its own optimization methods.
If in doubt, it is possible to run one of `tgstat` CPU intensive functions (e.g. `tgs_cor`) and
compare its run time under both `options(tgs_use.blas=FALSE)`.
Exact instructions for linking R with an optimized BLAS library are system dependent and are out
of scope of this document.