-
Notifications
You must be signed in to change notification settings - Fork 11
/
matching.R
265 lines (253 loc) · 10.1 KB
/
matching.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
#' @title Relaxed Value Matching
#'
#' @description
#' These functions offer relaxed matching of one vector in another.
#' In contrast to the similar [`match()`] and [`%in%`] functions they
#' just accept `numeric` arguments but have an additional `tolerance`
#' argument that allows relaxed matching.
#'
#' @param x `numeric`, the values to be matched. In contrast to
#' [`match()`] `x` has to be sorted in increasing order and must not contain any
#' `NA`.
#' @param table `numeric`, the values to be matched against. In contrast to
#' [`match()`] `table` has to be sorted in increasing order and must not contain
#' any `NA`.
#' @param tolerance `numeric`, accepted tolerance. Could be of length one or
#' the same length as `x`.
#' @param ppm `numeric(1)` representing a relative, value-specific
#' parts-per-million (PPM) tolerance that is added to `tolerance`.
#' @param duplicates `character(1)`, how to handle duplicated matches. Has to be
#' one of `c("keep", "closest", "remove")`. No abbreviations allowed.
#' @param nomatch `integer(1)`, if the difference
#' between the value in `x` and `table` is larger than
#' `tolerance` `nomatch` is returned.
#' @param .check `logical(1)` turn off checks for increasingly sorted `x`
#' and `table`. It also disables most other input validation checks.
#' This should just be done if it is ensured by other methods
#' that `x` and `table` are sorted, see details.
#'
#' @details
#' For `closest`/`common` the `tolerance` argument could be set to `0` to get
#' the same results as for [`match()`]/[`%in%`]. If it is set to `Inf` (default)
#' the index of the closest values is returned without any restriction.
#'
#' It is not guaranteed that there is a one-to-one matching for neither the
#' `x` to `table` nor the `table` to `x` matching.
#'
#' If multiple elements in `x` match a single element in `table` all their
#' corresponding indices are returned if `duplicates="keep"` is set (default).
#' This behaviour is identical to [`match()`]. For `duplicates="closest"` just
#' the closest element in `x` gets the corresponding index in `table` and
#' for `duplicates="remove"` all elements in `x` that match to the same element
#' in `table` are set to `nomatch`.
#'
#' If a single element in `x` matches multiple elements in `table` the *closest*
#' is returned for `duplicates="keep"` or `duplicates="closest"` (*keeping*
#' multiple matches isn't possible in this case because the return value should
#' be of the same length as `x`). If the differences between `x` and the
#' corresponding matches in `table` are identical the lower index (the smaller
#' element in `table`) is returned. There is one exception: if the lower index
#' is already returned for another `x` with a smaller difference to this
#' `index` the higher one is returned for `duplicates = "closer"`
#' (but only if there is no other `x` that is closer to the higher one).
#' For `duplicates="remove"` all multiple matches are returned as `nomatch` as
#' above.
#'
#' `.checks = TRUE` tests among other input validation checks for increasingly
#' sorted `x` and `table` arguments that are mandatory assumptions for the
#' `closest` algorithm. These checks require to loop through both vectors and
#' compare each element against its precursor.
#' Depending on the length and distribution of `x` and `table` these checks take
#' equal/more time than the whole `closest` algorithm. If it is ensured by other
#' methods that both arguments `x` and `table` are sorted the tests could be
#' skipped by `.check = FALSE`. In the case that `.check = FALSE` is used
#' and one of `x` and `table` is not sorted (or decreasingly sorted)
#' the output would be incorrect in the best case and result in infinity
#' loop in the average and worst case.
#'
#' @return `closest` returns an `integer` vector of the same length as `x`
#' giving the closest position in `table` of the first match or `nomatch` if
#' there is no match.
#'
#' @rdname matching
#' @author Sebastian Gibb, Johannes Rainer
#' @seealso [`match()`]
#' @aliases closest
#' @family grouping/matching functions
#' @useDynLib MsCoreUtils, .registration = TRUE
#' @export
#' @examples
#' ## Define two vectors to match
#' x <- c(1, 3, 5)
#' y <- 1:10
#'
#' ## Compare match and closest
#' match(x, y)
#' closest(x, y)
#'
#' ## If there is no exact match
#' x <- x + 0.1
#' match(x, y) # no match
#' closest(x, y)
#'
#' ## Some new values
#' x <- c(1.11, 45.02, 556.45)
#' y <- c(3.01, 34.12, 45.021, 46.1, 556.449)
#'
#' ## Using a single tolerance value
#' closest(x, y, tolerance = 0.01)
#'
#' ## Using a value-specific tolerance accepting differences of 20 ppm
#' closest(x, y, ppm = 20)
#'
#' ## Same using 50 ppm
#' closest(x, y, ppm = 50)
#'
#' ## Sometimes multiple elements in `x` match to `table`
#' x <- c(1.6, 1.75, 1.8)
#' y <- 1:2
#' closest(x, y, tolerance = 0.5)
#' closest(x, y, tolerance = 0.5, duplicates = "closest")
#' closest(x, y, tolerance = 0.5, duplicates = "remove")
closest <- function(x, table, tolerance = Inf, ppm = 0,
duplicates = c("keep", "closest", "remove"),
nomatch = NA_integer_, .check = TRUE) {
if (.check) {
ntolerance <- length(tolerance)
if (ntolerance != 1L && ntolerance != length(x))
stop("'tolerance' has to be of length 1 or equal to 'length(x)'")
if (!is.numeric(tolerance) || any(tolerance < 0))
stop("'tolerance' has to be a 'numeric' larger or equal zero.")
if(!is.numeric(ppm) || any(ppm < 0))
stop("'ppm' has to be a 'numeric' larger or equal zero.")
if (!is.numeric(nomatch) || length(nomatch) != 1L)
stop("'nomatch' has to be a 'numeric' of length one.")
if (!identical(FALSE, is.unsorted(x)) ||
!identical(FALSE, is.unsorted(table))) {
stop("'x' and 'table' have to be sorted non-decreasingly and ",
"must not contain NA.")
}
}
if (!length(table))
return(rep_len(nomatch, length(x)))
tolerance <- tolerance + ppm(x, ppm) + sqrt(.Machine$double.eps)
switch(duplicates[1L],
"keep" = .Call(
C_closest_dup_keep,
as.double(x), as.double(table),
as.double(tolerance),
as.integer(nomatch)
),
"closest" = .Call(
C_closest_dup_closest,
as.double(x), as.double(table),
as.double(tolerance),
as.integer(nomatch)
),
"remove" = .Call(
C_closest_dup_remove,
as.double(x), as.double(table),
as.double(tolerance),
as.integer(nomatch)
),
stop("'duplicates' has to be one of \"keep\", \"closest\" ",
"or \"remove\".")
)
}
#' @rdname matching
#'
#' @return `common` returns a `logical` vector of length `x` that is `TRUE` if the
#' element in `x` was found in `table`. It is similar to [`%in%`].
#' @aliases common
#' @seealso [`%in%`]
#' @export
#' @examples
#'
#' ## Are there any common values?
#' x <- c(1.6, 1.75, 1.8)
#' y <- 1:2
#' common(x, y, tolerance = 0.5)
#' common(x, y, tolerance = 0.5, duplicates = "closest")
#' common(x, y, tolerance = 0.5, duplicates = "remove")
common <- function(x, table, tolerance = Inf, ppm = 0,
duplicates = c("keep", "closest", "remove"), .check = TRUE) {
!is.na(closest(x, table, tolerance = tolerance, ppm = ppm,
duplicates = duplicates, .check = .check))
}
#' @rdname matching
#'
#' @details
#' `join`: joins two `numeric` vectors by mapping values in `x` with
#' values in `y` and *vice versa* if they are similar enough (provided the
#' `tolerance` and `ppm` specified). The function returns a `matrix` with the
#' indices of mapped values in `x` and `y`. Parameter `type` allows to define
#' how the vectors will be joined: `type = "left"`: values in `x` will be
#' mapped to values in `y`, elements in `y` not matching any value in `x` will
#' be discarded. `type = "right"`: same as `type = "left"` but for `y`.
#' `type = "outer"`: return matches for all values in `x` and in `y`.
#' `type = "inner"`: report only indices of values that could be mapped.
#'
#' @param y `numeric`, the values to be joined. Should be sorted.
#' @param type `character(1)`, defines how `x` and `y` should be joined. See
#' details for `join`.
#' @param .check `logical(1)` turn off checks for increasingly sorted `x` and
#' `y`. This should just be done if it is ensured by other methods that `x` and
#' `y` are sorted, see also [`closest()`].
#' @param ... ignored.
#'
#' @note `join` is based on `closest(x, y, tolerance, duplicates = "closest")`.
#' That means for multiple matches just the closest one is reported.
#'
#' @return `join` returns a `matrix` with two columns, namely `x` and `y`,
#' representing the index of the values in `x` matching the corresponding value
#' in `y` (or `NA` if the value does not match).
#'
#' @export
#' @examples
#'
#' ## Join two vectors
#' x <- c(1, 2, 3, 6)
#' y <- c(3, 4, 5, 6, 7)
#'
#' jo <- join(x, y, type = "outer")
#' jo
#' x[jo$x]
#' y[jo$y]
#'
#' jl <- join(x, y, type = "left")
#' jl
#' x[jl$x]
#' y[jl$y]
#'
#' jr <- join(x, y, type = "right")
#' jr
#' x[jr$x]
#' y[jr$y]
#'
#' ji <- join(x, y, type = "inner")
#' ji
#' x[ji$x]
#' y[ji$y]
join <- function(x, y, tolerance = 0, ppm = 0,
type = c("outer", "left", "right", "inner"), .check = TRUE,
...) {
if (is.integer(x))
x <- as.numeric(x)
if (is.integer(y))
y <- as.numeric(y)
if (.check && (
!identical(FALSE, is.unsorted(x)) ||
!identical(FALSE, is.unsorted(y)))) {
stop("'x' and 'y' have to be sorted non-decreasingly and must not ",
" contain NA.")
}
tolerance <- tolerance + ppm(x, ppm = ppm) + sqrt(.Machine$double.eps)
switch(type[1L],
"outer" = .Call(C_join_outer, x, y, tolerance, NA_integer_),
"left" = .Call(C_join_left, x, y, tolerance, NA_integer_),
"right" = .Call(C_join_right, x, y, tolerance, NA_integer_),
"inner" = .Call(C_join_inner, x, y, tolerance, NA_integer_),
stop("'type' has to be one of \"outer\", \"left\", \"right\", or ",
"\"inner\"")
)
}