-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sample entropy #71
Sample entropy #71
Conversation
Codecov Report
@@ Coverage Diff @@
## main #71 +/- ##
==========================================
- Coverage 80.02% 79.65% -0.38%
==========================================
Files 32 33 +1
Lines 706 747 +41
==========================================
+ Hits 565 595 +30
- Misses 141 152 +11
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Locating and counting neighbors is the bottleneck for performance for this algorithm. Using a The newly added For this algorithm, dropping I am therefore using the using NearestNeighbors, Neighborhood
function computeprobs_onlycts(x; k::Int, r, metric = Chebyshev())
N = length(x)
pts = genembed(x, 0:(k - 1))
# For each `k`-dimensional xᵢ ∈ pts, locate its within-range-`r` nearest neighbors,
# excluding the point `xₖ` as a neighbor to itself.
tree = KDTree(pts, metric)
# inrangecount includes the point itself, so subtract 1
cts = [inrangecount(tree, pᵢ, r) - 1 for pᵢ in pts]
# Pᵐ := The probability that two sequences will match for k points
Pᵐ = 0
c = N - k - 1
for ct in cts
Pᵐ += ct / c
end
Pᵐ /= N - k
return Pᵐ
end
function computeprobs(x; k::Int, r, metric = Chebyshev())
N = length(x)
pts = genembed(x, 0:(k - 1))
# For each `k`-dimensional xᵢ ∈ pts, locate its within-range-`r` nearest neighbors,
# excluding the point `xₖ` as a neighbor to itself.
tree = KDTree(pts, metric)
theiler = Theiler(0) # w = 0 in the Theiler window means self-exclusion
idxs = bulkisearch(tree, pts, WithinRange(r), theiler)
# Pᵐ := The probability that two sequences will match for k points
Pᵐ = 0
c = N - k - 1
for nn_idxsᵢ in idxs
Pᵐ += length(nn_idxsᵢ) / c
end
Pᵐ /= N - k
return Pᵐ
end
function sample_entropy(x; m = 2, r = StatsBase.std(x), base = MathConstants.e,
metric = Chebyshev())
Aᵐ⁺¹ = computeprobs(x; k = m + 1, r = r, metric = metric)
Bᵐ = computeprobs(x; k = m, r = r, metric = metric)
return -log(base, Aᵐ⁺¹ / Bᵐ)
end
function sample_entropy_ctsonly(x; m = 2, r = StatsBase.std(x),
base = MathConstants.e, metric = Chebyshev())
Aᵐ⁺¹ = computeprobs_onlycts(x; k = m + 1, r = r, metric = metric)
Bᵐ = computeprobs_onlycts(x; k = m, r = r, metric = metric)
return -log(base, Aᵐ⁺¹ / Bᵐ)
end
using BenchmarkTools
x = rand(10000)
sample_entropy(x, r = 0.25, m = 2)
@btime sample_entropy($x, r = 0.25, m = 2)
# 587.655 ms (139491 allocations: 741.82 MiB)
sample_entropy_ctsonly(x, r = 0.25, m = 2)
@btime sample_entropy_ctsonly($x, r = 0.25, m = 2)
# 283.440 ms (117696 allocations: 3.38 MiB) |
You don't need to add NearestNeighbors to project.toml. Do instead: |
@Datseris This PR has been updated to use the new API.
|
```math | ||
\\begin{aligned} | ||
B(r, m, N) = \\sum_{i = 1}^{N-m\\tau} \\sum_{j = 1, j \\neq i}^{N-m\\tau} \\theta(d({\\bf x}_i^m, {\\bf x}_j^m) \\leq r) \\\\ | ||
A(r, m, N) = \\sum_{i = 1}^{N-m\\tau} \\sum_{j = 1, j \\neq i}^{N-m\\tau} \\theta(d({\\bf x}_i^{m+1}, {\\bf x}_j^{m+1}) \\leq r) \\\\ | ||
\\end{aligned}, | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm this is supsiciously similar to Cao's method for deducing an optimal embedding... Right? https://juliadynamics.github.io/DynamicalSystems.jl/dev/embedding/traditional/#DelayEmbeddings.delay_afnn you don't compute the average distance, but you compute how many are within distance r
, and how this changes from embedding m
to embedding m+1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not intimately familiar with Cao's method, but just had a quick glance at the source code. It seems there is some overlap. I see that you use bulkisearch
in the _average_a
function. I saw at least a 2x speed up and 2+ time less allocations here by using inrangecount
, so using it there too could probably improve performance.
Co-authored-by: George Datseris <datseris.george@gmail.com>
Add convenience method Missing word Address review comments Improve syntax
I think all comments should be addressed now. The PR also uses the new version of Neighborhood.jl. |
No description provided.