-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhomework-2-KPDuBose.qmd
235 lines (181 loc) · 4.66 KB
/
homework-2-KPDuBose.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
---
title: "Assignment 2 - PHS 7045"
author: "Kline DuBose"
format: gfm
editor_options:
chunk_output_type: console
---
# Due Date
Tuesday, February 28
# Background
For this assignment, you'll be quested with speeding up some code using what you have learned about vectorization and Rcpp.
# Part 1: Vectorizing code
## Function 1
This function generates a `n x k` dataset with all its entries distributed Poisson with mean `lambda`.
```{r}
# function that was given
fun1 <- function(n = 100, k = 4, lambda = 4){
x <- NULL
for (i in 1:n) {
x <- rbind(x, rpois(k, lambda))
}
return(x)
}
# my function
fun1alt <- function(n = 100, k = 4, lambda = 4){
x <- matrix(data = rpois(n * k, lambda = lambda),
ncol = k)
return(x)
}
# benchmarking
bench::mark(
fun1(),
fun1alt(), relative = TRUE, check = FALSE
)
```
## Function 2
Like before, speed up the following functions (it is OK to use StackOverflow)
```{r}
# Total row sums
fun1 <- function(mat) {
n <- nrow(mat)
ans <- double(n)
for (i in 1:n) {
ans[i] <- sum(mat[i, ])
}
ans
}
fun1alt <- function(mat) {
rowSums(mat)
}
# Cumulative sum by row
fun2 <- function(mat) {
n <- nrow(mat)
k <- ncol(mat)
ans <- mat
for (i in 1:n) {
for (j in 2:k) {
ans[i,j] <- mat[i, j] + ans[i, j - 1]
}
}
ans
}
fun2alt <- function(mat) {
n <- nrow(mat)
ans <- mat
for (i in 1:n) {
ans[i,] <- cumsum(mat[i,])
}
ans
}
## Another function I'm trying to make faster
# fun2alt <- function(mat) {
# ans <- mat
# ans[] <- vapply(ans, cumsum, FUN.VALUE = 1)
# ans
# }
# Use the data with this code
set.seed(2315)
dat <- matrix(rnorm(200 * 100), nrow = 200)
# Test for the first
bench::mark(
fun1(dat),
fun1alt(dat), relative = TRUE
)
# Test for the second
bench::mark(
fun2(dat),
fun2alt(dat), relative = TRUE
)
```
## Function 3
Find the column max (hint: Check out the function `max.col()`)
```{r}
#| eval: true
# Data Generating Process (10 x 10,000 matrix)
set.seed(1234)
x <- matrix(rnorm(1e4), nrow = 10)
# Find each column's max value
fun2 <- function(x) {
apply(x, 2, max)
}
# My function
fun2alt <- function(x) {
x[cbind(max.col(t(x)), 1:ncol(x))]
}
# Benchmarking
bench::mark(
fun2(x),
fun2alt(x), relative = TRUE
)
```
# Part 2: Rcpp code
As we saw in the Rcpp week, vectorization may not be the best solution. For this
part, you must write a function using Rcpp that implements the propensity score
matching algorithm. You can use [Week 5's lab](https://github.com/UofUEpiBio/PHS7045-advanced-programming/issues/8#issuecomment-1424974938) as a starting point for the problem. Your C++ file
should look something like the following:
```{Rcpp}
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List psmatch(
NumericVector pscores,
LogicalVector is_treated
)
{
/*... setup the problem creating the output...*/
int n = static_cast<int>(pscores.size());
IntegerVector indices(n);
NumericVector values(n);
LogicalVector treatment = is_treated;
NumericVector treated;
NumericVector untreated;
/*
... Implement your matching (start from Week 5's lab)...
... You have to consider that matches are done against groups, i.e.,
Treated (is_treated == true) must be matched to control
(is_treated == false)
*/
for (int i = 0; i < n; i++){ /* This matches someone in the treatment group to one person
in the control group*/
if (treatment[i]){
double cur_best = std::numeric_limits< double >::max();
auto & cur_i = indices[i];
for (int j = 0; j < n; j++){
if (!treatment[j]){
double d = std::abs(pscores[i] - pscores[j]);
if (d < cur_best){
cur_best = d;
cur_i = j;
}
if (d < values[j]) {
values[j] = d;
indices[j] = i;
}
}
}
treated.push_back(i);
untreated.push_back(cur_i);
}
}
NumericVector values1(treated.size());
NumericVector values2(untreated.size());
for (int i = 0; i < treated.size(); ++i)
values1[i] = pscores[treated[i]];
for (int i = 0; i < untreated.size(); ++i)
values2[i] = pscores[untreated[i]];
return List::create(/*Returns the ID's of the treated group, the matched control
group, and the p-values of each respective patient*/
_["match_treated_id"] = treated + 1, // We add one to match R's indices
_["match_untreated_id"] = untreated + 1,
_["match_pscore_treated"] = values1,
_["match_pscore_untreated"] = values2
);
}
```
```{r}
set.seed(123)
pscores <- runif(30)
is_treated <- sample(c(0,1), 30, replace = TRUE)
psmatch(pscores, is_treated)
```