-
Notifications
You must be signed in to change notification settings - Fork 23
/
Copy pathsession2b-interval-overlap.qmd
474 lines (350 loc) · 17.7 KB
/
session2b-interval-overlap.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
---
title: "2b: Determining Interval Overlap"
author: "Douglas Bates and Claudia Solis-Lemus"
jupyter: julia-1.8
---
## Load packages to be used
```{julia}
#| code-fold: show
using Arrow # Arrow storage and file format
using BenchmarkTools # tools for benchmarking code
using DataFrames # versatile tabular data format
using IntervalTrees # interval trees from BioJulia
using RangeTrees # a bespoke implementation of interval trees
using Tables # row- or column-oriented tabular data
using Base: intersect! # not exported from Base
datadir = joinpath(@__DIR__, "biofast-data-v1");
```
# Strategy for computing overlaps
- Split the data in the Arrow data tables into a dictionary where the keys are the chromosome tag and values are the `start-stop` pairs expressed as a `Vector{UnitRange}`. For the reference ranges, sort the vectors by increasing `first` value.
- For the reference ranges create two other dictionaries with values from [RangeTrees.jl](https://github.com/dmbates/RangeTrees.jl) and from [IntervalTrees.jl](https://github.com/BioJulia/IntervalTrees.jl), respectively.
- Benchmark the intersection with a target `UnitRange` from each of the representations of the reference ranges. Do this in two ways: `intersect`, which allocates the storage for the result, and `intersect!`, which modifies one of its arguments with the result.
- Compute the coverage from the vector of intersections.
- Apply the methods to the complete set of targets.
## Creating dictionaries of Vector{UnitRange}
- A `UnitRange`, such as `2:10`, includes the end points, accessed by `first` and `last` methods.
```{julia}
typeof(2:10), length(2:10), first(2:10), last(2:10)
```
- However, the positions in the `start` and `stop` columns in a `.bed` file are not both included in the range they represent. The positions correspond to base pairs in the range `start:(stop - 1)` as 0-based indices on the chromosome or `(start + 1):stop` as 1-based indices.
- It doesn't matter which one we use as long as we are consistent in creating ranges from the reference intervals and the target intervals.
- We choose to start counting from 1, just as the [world's foremost expert on counting](https://twitter.com/CountVonCount) does.
- We wrap this conversion in a utility function to help ensure consistency.
```{julia}
asrange(start, stop) = (start+one(start)):stop
```
::: {.callout-note collapse="true"}
### "One-liner" function (actually method) definitions
This method definition uses the compact "one-liner" form, like the math notation `f(x) = x + 1`.
:::
::: {.callout-note collapse="true"}
### `one(x)` versus literal `1`
`one(x)` is used instead of the literal `1` in `asrange` to preserve the integer type (see also `?oneunit`, which is slighly more general).
```{julia}
st = Int32(2314)
typeof(st + 1) # type gets promoted to Int64
```
```{julia}
typeof(st + one(st)) # type not promoted
```
:::
- Create a utility, `chromodict` to read an `Arrow.Table` and convert it to a `Dict{Symbol, Vector{UnitRange{T}}}`, assuming that the table contains columns named `:chromo`, `:start`, and `:stop`.
- We define two methods, one that takes a "row-table", which is a vector of named tuples, and one that takes the file name of the arrow file, reads the (column oriented)table, converts it to a row table, and then calls the first method.
```{julia}
function chromodict(rtbl::Vector{<:NamedTuple})
r1 = first(rtbl)
itype = promote_type(typeof(r1.start), typeof(r1.stop))
vtype = Vector{UnitRange{itype}}
dict = Dict{Symbol,vtype}()
for (; chromo, start, stop) in rtbl
push!(get!(dict, Symbol(chromo), vtype()), asrange(start, stop))
end
return dict
end
function chromodict(fnm::AbstractString)
return chromodict(rowtable(Arrow.Table(joinpath(datadir, fnm))))
end
tarrngvecs = chromodict("ex-rna.arrow")
refrngvecs = chromodict("ex-anno.arrow")
```
::: {.callout-note collapse="true"}
### Mutating `get!` for a `Dict`
The call `get!(dict, Symbol(chromo), vtype())` in `chromodict` returns `dict[Symbol(chromo)]` or the default value, which is an empty `Vector{UnitRange{T}}`.
For the case of the default, it also installs that key/value pair in `dict`.
:::
::: {.callout-note collapse="true"}
### `Symbol`s versus `String`s for `Dict` keys
We use `Symbol`s for the keys in these `Dict`s because they are easier to type and because symbol table lookup is very fast, although that doesn't really matter when we only have 24 distinct keys.
:::
::: {.callout-note collapse="true"}
### Destructuring a struct or NamedTuple
The expression `for (; chromo, start, stop) in rtbl` is equivalent to three local assignments
```julia
for r in rtbl
chromo = r.chromo
start = r.start
stop = r.stop
...
end
```
In other words, it is equivalent to taking the named fields of a `NamedTuple` or a `struct` and making them available in the local namespace under the same names.
:::
::: {.callout-note collapse="true"}
### Interfaces to row- and column-oriented tables in `Tables.jl`
[Tables.jl](https://github.com/JuliaData/Tables.jl) provides interfaces for row- and column-oriented tables, allowing for one orientation to be viewed as the other. In this case an `Arrow.Table` is column-oriented but `rowtable` of this table returns a `Vector{NamedTuple{(:chromo, :start, :stop), Tuple{String, Int32, Int32}}}` to iterate over the rows. When a "method instance" of `chromodict` is compiled for such a table, the schema of the rows will be known.
```{julia}
@code_warntype chromodict(
rowtable(Arrow.Table("biofast-data-v1/ex-anno.arrow")),
)
```
:::
- In `refrngvecs`, each of the values, a `Vector{UnitRange}`, should be sorted by the `first` element of the `UnitRange`.
- Creating an `IntervalTree` requires the ranges to be sorted by `first` element **and** by `last` element when the `first` elements are equal.
- Define a custom `lt` comparison for this.
```{julia}
let
function lt(x::UnitRange{T}, y::UnitRange{T}) where {T}
fx, fy = first(x), first(y)
return fx == fy ? last(x) < last(y) : fx < fy
end
for v in values(refrngvecs)
sort!(v; lt)
end
end
refrngvecs # note changes in refrngvecs[:chr01]
```
::: {.callout-note collapse="true"}
### `let` blocks
A `let/end` block provides a local namespace. The custom `lt` comparison method will not be visible outside the scope of the block.
:::
- We will use the ranges on chromsome 1 for our timing benchmarks. The target for tests of intersection with a single target range will be the last range on chromosome 1 in "ex-rna.arrow".
```{julia}
refrngvec01 = refrngvecs[:chr01]
tarrngvec01 = tarrngvecs[:chr01]
target = last(tarrngvec01)
```
# Intersecting reference ranges with a target
- Create a method to intersect a target range with a `Vector{UnitRange}` returning the intersections as another `Vector{UnitRange}`.
- A common Julia programming idiom for such cases is to create two methods: one for `intersect!`, which modifies an argument that will hold the result, and one for `intersect`, which allocates the result then calls the `intersect!` method.
- For the `intersect!` method we first `empty!` the result vector, which zeros the vector length but does not shrink the memory chunk allocated to the vector, then `push!` intersections onto it. If there are many calls to an `intersect!` method like this only a few will cause memory allocations.
- The elements of our vectors of reference ranges, like `refrngvec01`, are sorted by their `first` element but globally the `last` elements do not have any special relationships. The best we can do to determine all intersections is a sequential scan of the sorted vector of ranges with an early exit when `last(target) < first(vectorelement)`.
```{julia}
function Base.intersect!(
result::Vector{UnitRange{T}},
target::AbstractUnitRange{<:Integer},
refs::Vector{UnitRange{T}},
) where {T}
empty!(result)
target = UnitRange{T}(target) # coerce the type, if necessary
lastt = last(target)
for rng in refs
lastt < first(rng) && break
isect = intersect(target, rng)
!isempty(isect) && push!(result, isect)
end
return result
end
function Base.intersect(
target::AbstractUnitRange{<:Integer},
refs::Vector{UnitRange{T}},
) where {T}
return intersect!(similar(refs, 0), target, refs)
end
savedresult = intersect(target, refrngvec01) # will use for comparisons
```
```{julia}
result = similar(refrngvec01, 0) # 0-length vector of same type
isequal(intersect!(result, target, refrngvec01), savedresult)
```
# Dictionaries of RangeTrees and IntervalTrees
## RangeTrees
- [RangeTrees.jl](https://github.com/dmbates/RangeTrees.jl) provides an implementation of [interval trees](https://en.wikipedia.org/wiki/Interval_tree) using the [augmented binary tree](https://en.wikipedia.org/wiki/Interval_tree#Augmented_tree) formulation.
- Because the tree is represented by its root node, there is no `RangeTree` type or constructor, only a `RangeNode`.
```{julia}
refrngtrees = Dict(k => RangeNode(v) for (k, v) in refrngvecs)
rangetree01 = refrngtrees[:chr01]
treesize(rangetree01), treeheight(rangetree01), treebreadth(rangetree01)
```
```{julia}
print_tree(rangetree01; maxdepth = 3)
```
::: {.callout-note collapse="true"}
### Augmented interval trees
A smaller example may help to understand how this type of interval tree.
Consider the first 7 `UnitRange`s in `refrngvecs[:chr01]`
```{julia}
smallrngvec = refrngvecs[:chr01][1:15]
```
```{julia}
rn = RangeNode(smallrngvec)
print_tree(rn)
```
- The `UnitRange` in the root node is the 7th out of the 15 sorted ranges from which the tree was constructed.
- Each of the nodes in the tree can have 0, 1, or 2 child nodes. Those with 0 children are called the "leaves" of the tree.
```{julia}
collect(Leaves(rn))
```
- In addition to the `UnitRange` it represents, each `RangeNode` stores `maxlast`, the maximum value of `last(rng)` for any `UnitRange` in the tree rooted at this node. For a leaf `maxlast` is simply `last` of its `UnitRange`.
- For other nodes, `maxlast` can be larger than `last` of its `UnitRange`.
- In particular, for the root node `maxlast` is the maximum of all the `last` values of the `UnitRange`s that generated the tree.
```{julia}
maximum(last.(smallrngvec))
```
[RangeTrees.jl](https://github.com/dmbates/RangeTrees.jl) defines methods for generics like `children`, `getroot`, and `nodevalue` from [AbstractTrees.jl](https://github.com/JuliaCollections/AbstractTrees.jl) and these allow for many other generics to be applied to a `RangeNode`.
```{julia}
children(rangetree01)
```
The root of the tree can be obtained from any node using `getroot`. The combination of `getroot` and `children` allows traversal of the tree.
```{julia}
getroot(first(children(first(children(rangetree01))))) # get the root from its grandchild
```
Several methods for generic functions are exported from [RangeTrees.jl](https://github.com/dmbates/RangeTrees.jl)
```{julia}
varinfo(RangeTrees)
```
but most of these are defined in [AbstractTrees.jl](https://github.com/JuliaCollections/AbstractTrees.jl).
:::
- Evaluating and storing `maxlast` in the nodes allows for overlap searches to be truncated at nodes for which `maxlast < first(target)`. The sorting by `first` allows for skipping the right subtree whenever `last(target) < first(thisnode)` (as in the `intersect!` method for `Vector{UnitRange}`).
- `intersect` and `intersect!` methods are already defined in [RangeTrees.jl](https://github.com/dmbates/RangeTrees.jl).
- Check that their results agree with the saved result.
```{julia}
isequal(intersect(target, rangetree01), savedresult)
```
```{julia}
isequal(intersect!(result, target, rangetree01), savedresult)
```
## IntervalTrees
- Create a dictionary of `IntervalTree`s. It is somewhat tedious to get the type of the result correct and we create a function to hide the details.
```{julia}
function toitrees(rngdict::Dict{S,Vector{UnitRange{T}}}) where {S,T}
return Dict(
k => IntervalTree{T,Interval{T}}(Interval.(v)) for (k, v) in rngdict
)
end
refintvltrees = toitrees(refrngvecs)
intvltree01 = refintvltrees[:chr01]
show(intvltree01)
```
- Creating an `intersect!` method is also tedious because the package has its own `Interval` data type and defines `intersect(itr::IntervalTree, (frst, lst))` to return an iterator of `Interval`s in the tree, not the intersection
```{julia}
Interval(target)
```
```{julia}
function Base.intersect!(
res::Vector{UnitRange{T}},
target::AbstractUnitRange,
refs::IntervalTree{T},
) where {T}
empty!(res)
firstt, lastt = first(target), last(target)
for isect in intersect(refs, (firstt, lastt))
push!(res, max(first(isect), firstt):min(last(isect), lastt))
end
return res
end
isequal(intersect!(result, target, intvltree01), savedresult)
```
# Time for a shootout
### Vector{UnitRange}
```{julia}
@benchmark intersect!(res, tar, ref) setup =
(res = result; tar = target; ref = refrngvec01)
```
```{julia}
@benchmark intersect(tar, ref) setup = (tar = target; ref = refrngvec01)
```
### RangeTree
```{julia}
@benchmark intersect!(res, tar, ref) setup =
(res = result; tar = target; ref = rangetree01)
```
```{julia}
@benchmark intersect(tar, ref) setup = (tar = target; ref = rangetree01)
```
### IntervalTree
```{julia}
@benchmark intersect!(res, tar, ref) setup =
(res = result; tar = target; ref = intvltree01)
```
- In what follows we will use the `RangeTree` method for `intersect!` to obtain the intersections.
# Determining coverage
- The `coverage` of a target by a set of reference ranges is the proportion of the base pairs in the target that intersect with one or more of the reference ranges.
- We need to somehow count the number of elements in the union of the ranges returned from `intersect!`.
- This could be done using the `union!` method for a [BitSet](https://docs.julialang.org/en/v1/base/collections/#Base.BitSet) but that approach has two problems: it is comparatively slow and, in Julia versions up to 1.8.0-rc1, it can be wrong.
::: {.callout-note collapse="true"}
### Bug in `union!` for `BitSet`
- While creating these notes we discovered a [bug](https://github.com/JuliaLang/julia/pull/45574) in the `union!` method for a `BitSet` and a `UnitRange`.
- There was a [PR](https://github.com/JuliaLang/julia/pull/45578) to fix it the next morning.
- Versions of Julia prior to 1.8.0-rc2 can (and probably will) return incorrect values of coverage.
- Replacing `union!(bs, isect)` by `union!(bs, BitSet(isect))` avoids this "infelicity" at the expense of more memory usage and compute time.
:::
- There is a better method that takes advantage of the intersecting ranges being sorted by `first`
- The idea is to "keep moving the goalposts". When evaluating the coverage count, add the length of the current reference range's intersection with only the part to the right of what has already been covered. The thing we know about the intersecting ranges is that each successive ranges's `first` position is greater than or equal to the `first` position of all the ranges preceding it. That is, the ranges can't "move left" as we iterate through them.
```{julia}
function coveragecount(
target::AbstractUnitRange,
isects::Vector{UnitRange{T}},
) where {T}
leftpost, rightpost = T(first(target)), T(last(target))
coverage = 0
for isect in isects
coverage += length(intersect(isect, leftpost:rightpost))
leftpost = max(leftpost, last(isect) + one(T))
end
return coverage
end
coveragecount(target, savedresult)
```
# Iterating over a collection of targets
```{julia}
function overlaps(targets::Vector{UnitRange{T}}, refs::RangeNode{T}) where {T}
nover = similar(targets, Int)
covcount = similar(targets, T)
result = empty!(similar(targets, ceil(Int, sqrt(treesize(refs)))))
@inbounds for (i, tar) in enumerate(targets)
nover[i] = length(intersect!(result, tar, refs))
covcount[i] = coveragecount(tar, result)
end
DataFrame(targets = targets, nover = nover, covcount = covcount)
end
overlaps(tarrngvec01, rangetree01)
```
```{julia}
@benchmark overlaps(targets, refs) setup =
(targets = tarrngvec01; refs = rangetree01)
```
# The whole shootin' match
- The computations on different chromosome are independent of each other and can be assigned to different threads when Julia is started with multiple threads.
```{julia}
function overlaps(
tardict::Dict{Symbol,Vector{UnitRange{T}}},
refdict::Dict{Symbol,RangeNode{T,R}},
) where {T,R}
matchedkeys = intersect(keys(tardict), keys(refdict))
result = Dict(k => DataFrame() for k in matchedkeys) # pre-assign key/value pairs
@sync for k in matchedkeys
Threads.@spawn result[k] = overlaps(tardict[k], refdict[k])
end
return result
end
bigresult = overlaps(tarrngvecs, refrngtrees);
bigresult[:chrX]
```
::: {.callout-note collapse="true"}
### Pre-assigning dictionary entries to avoid thread conflict
- The `result` is initialized with all the keys that will be in the result and with values that are empty `DataFrame`s.
- Individual tasks that are `@spawn`ed allocate their `DataFrame` value and merely update the pointer in `result` to that value, eliminating the possibility of collisions with other tasks.
:::
```{julia}
@benchmark overlaps(targets, refs) setup =
(targets = tarrngvecs; refs = refrngtrees)
```
# Conclusions
- The ability to work in the REPL (or VS Code or Jupyter notebooks) encourages iterative refinement of algorithms.
- "Trust but verify" - when making a change or introducing new methods it helps to have results from previous methods for comparison. In general, continuous integration (CI) testing is straightforward for Julia packages and is strongly encouraged.
- There are many tools for benchmarking function execution or storage allocation, allowing a developer to concentrate on where the "real problem" is.
- In certain cases, enhancements like multi-threading can be achieved with very little effort.
# Version information
```{julia}
versioninfo()
```