Skip to content

Commit

Permalink
More encodings (#299)
Browse files Browse the repository at this point in the history
* add a tutorial file: up to probabilities so far

* re-write the surrounding docs to account for tutorial file

* rename `Bayes` to `BayesianRegularization`

Because Bayes and Bayesian are so huge terms in the literature, it doesn't feel appropriate to use it here like that.

* correctly rename file

* add discrete info section to tutorial

* correct merging

* fix complexit ydocstring

* finish first draft of tutorial

* use literate to build the tutorial.

* port emphasis of what is a new entropy

* Correct file name

* Typo

* Minor typo and text fixes to the tutorial.

* Punctuation.

* Amplitude and first difference encoding

* Finish first difference and amplitude encodings

* Add CombinationEncoding

* `encode`/`decode` for state vectors for `GaussianCDF``

* Systematic tests for encoding.

* Add `CombinationEncoding` to docs

* Test `CombinationEncoding`

* Remove utils file

* Inner/outer constructors and tests

* Better descriptions of the new encodings

* Clarify inputs to `CombinationEncoding`

* Use outer constructors, not inner

* Add comment on not why we're not enforcing multi-element vectors.

* Fix tests

* Disallow `CombinationEncoding` as input to `CombinationEncoding`s

* Correct `total_outcomes` - use `prod`, not `sum`

* Change names

* Add references in `Encoding` docstring

* Remove redundant type info

* Base.show for the encodings.

* Use `RectangularBinEncoding` internally for GaussianCDFEncoding

Fixes #300 too.

* `RectangularBinEncoding` internally for the new encodings

* Add test

* More tests

* More tests

* Update src/encoding_implementations/gaussian_cdf.jl

Co-authored-by: George Datseris <datseris.george@gmail.com>

* Update src/encoding_implementations/combination_encoding.jl

Co-authored-by: George Datseris <datseris.george@gmail.com>

* Analytical encoding/decoding tests

* Analytical tests for `CombinationEncoding`

* Symbol naming, and drop extra doctest

* Better description

* Remove type restriction. Code will error at lower level if relevant

* Remove show methods

Do at abstract level later

* New constructor

* Update src/encoding_implementations/combination_encoding.jl

Co-authored-by: George Datseris <datseris.george@gmail.com>

* Ensure encodings for `CombinationEncoding` is always a tuple

* Return a tuple of decoded symbols

* Enforce encoding tuple input. Use number of encodings a type param

* Test convenience constructor

* Fix and rearrange tests

* Fix tests

---------

Co-authored-by: Datseris <datseris.george@gmail.com>
  • Loading branch information
kahaaga and Datseris authored Aug 25, 2023
1 parent dfc973b commit 231ae8c
Show file tree
Hide file tree
Showing 20 changed files with 710 additions and 187 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,16 @@ Further additions to the library in v3:
- Add the 1976 Lempel-Ziv complexity measure (`LempelZiv76`).
- New entropy definition: identification entropy (`Identification`).
- Minor documentation fixes.
- `GaussianCDFEncoding` now can be used with vector-valued inputs.

### Bug fixes

- `outcome_space` for `Dispersion` now correctly returns the all possible **sorted** outcomes
(as promised by the `outcome_space` docstring).
- `decode` with `GaussianCDFEncoding` now correctly returns only the left-sides of the
`[0, 1]` subintervals, and always returns the decoded symbol as a `Vector{SVector}`
(consistent with `RectangularBinEncoding`), regardless of whether the input is a scalar
or a vector.

### Renaming

Expand Down
3 changes: 3 additions & 0 deletions docs/src/probabilities.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,4 +143,7 @@ decode
OrdinalPatternEncoding
GaussianCDFEncoding
RectangularBinEncoding
RelativeMeanEncoding
RelativeFirstDifferenceEncoding
CombinationEncoding
```
3 changes: 3 additions & 0 deletions src/core/encodings.jl
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ Current available encodings are:
- [`OrdinalPatternEncoding`](@ref).
- [`GaussianCDFEncoding`](@ref).
- [`RectangularBinEncoding`](@ref).
- [`RelativeMeanEncoding`](@ref).
- [`RelativeFirstDifferenceEncoding`](@ref).
- [`CombinationEncoding`](@ref), which can combine any of the above encodings.
"""
abstract type Encoding end

Expand Down
92 changes: 92 additions & 0 deletions src/encoding_implementations/combination_encoding.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
export CombinationEncoding

"""
CombinationEncoding <: Encoding
CombinationEncoding(encodings)
A `CombinationEncoding` takes multiple [`Encoding`](@ref)s and creates a combined
encoding that can be used to encode inputs that are compatible with the
given `encodings`.
## Encoding/decoding
When used with [`encode`](@ref), each [`Encoding`](@ref) in `encodings` returns
integers in the set `1, 2, …, n_e`, where `n_e` is the total number of outcomes
for a particular encoding. For `k` different encodings, we can thus construct the
cartesian coordinate `(c₁, c₂, …, cₖ)` (`cᵢ ∈ 1, 2, …, n_i`), which can uniquely
be identified by an integer. We can thus identify each unique *combined* encoding
with a single integer.
When used with [`decode`](@ref), the integer symbol is converted to its corresponding
cartesian coordinate, which is used to retrieve the decoded symbols for each of
the encodings, and a tuple of the decoded symbols are returned.
The total number of outcomes is `prod(total_outcomes(e) for e in encodings)`.
## Examples
```julia
using ComplexityMeasures
# We want to encode the vector `x`.
x = [0.9, 0.2, 0.3]
# To do so, we will use a combination of first-difference encoding, amplitude encoding,
# and ordinal pattern encoding.
encodings = (
RelativeFirstDifferenceEncoding(0, 1; n = 2),
RelativeMeanEncoding(0, 1; n = 5),
OrdinalPatternEncoding(3) # x is a three-element vector
)
c = CombinationEncoding(encodings)
# Encode `x` as integer
ω = encode(c, x)
# Decode symbol (into a vector of decodings, one for each encodings `e ∈ encodings`).
# In this particular case, the first two element will be left-bin edges, and
# the last element will be the decoded ordinal pattern (indices that would sort `x`).
d = decode(c, ω)
```
"""
struct CombinationEncoding{N, L, C} <: Encoding
# An iterable of encodings.
encodings::NTuple{N, Encoding}

# internal fields: LinearIndices/CartesianIndices for encodings/decodings.
linear_indices::L
cartesian_indices::C

function CombinationEncoding(encodings::NTuple{N, Encoding}, l::L, c::C) where {N, L, C}
if any(e isa CombinationEncoding for e in encodings)
s = "CombinationEncoding doesn't accept a CombinationEncoding as one of its " *
"sub-encodings."
throw(ArgumentError(s))
end
new{N, L, C}(encodings, l, c)
end
end
CombinationEncoding(encodings) = CombinationEncoding(encodings...)
function CombinationEncoding(encodings::Vararg{Encoding, N}) where N
ranges = tuple([1:total_outcomes(e) for e in encodings]...)
linear_indices = LinearIndices(ranges)
cartesian_indices = CartesianIndices(ranges)
return CombinationEncoding(tuple(encodings...), linear_indices, cartesian_indices)
end

function encode(encoding::CombinationEncoding, χ)
symbols = CartesianIndex(map(e -> encode(e, χ), encoding.encodings))
ω::Int = encoding.linear_indices[symbols]
return ω
end

function decode(encoding::CombinationEncoding, ω::Int)
es = encoding.encodings
cidx = encoding.cartesian_indices[ω]
return map(e -> decode(e, cidx[findfirst(eᵢ -> eᵢ == e, es)]), es)
end

function total_outcomes(encoding::CombinationEncoding)
return prod(total_outcomes.(encoding.encodings))
end
3 changes: 3 additions & 0 deletions src/encoding_implementations/encoding_implementations.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,6 @@ include("fasthist.jl")
include("rectangular_binning.jl")
include("gaussian_cdf.jl")
include("ordinal_pattern.jl")
include("relative_mean_encoding.jl")
include("relative_first_difference_encoding.jl")
include("combination_encoding.jl")
91 changes: 74 additions & 17 deletions src/encoding_implementations/gaussian_cdf.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,25 @@ export GaussianCDFEncoding

"""
GaussianCDFEncoding <: Encoding
GaussianCDFEncoding(; μ, σ, c::Int = 3)
GaussianCDFEncoding{m}(; μ, σ, c::Int = 3)
An encoding scheme that [`encode`](@ref)s a scalar value into one of the integers
An encoding scheme that [`encode`](@ref)s a scalar or vector `χ` into one of the integers
`sᵢ ∈ [1, 2, …, c]` based on the normal cumulative distribution function (NCDF),
and [`decode`](@ref)s the `sᵢ` into subintervals of `[0, 1]` (with some loss of information).
Notice that the decoding step does not yield an element of any outcome space of the
estimators that use `GaussianCDFEncoding` internally, such as [`Dispersion`](@ref).
That is because these estimators additionally delay embed the encoded data.
## Initializing a `GaussianCDFEncoding`
The size of the input to be encoded must be known beforehand. One must therefore set
`m = length(χ)`, where `χ` is the input (`m = 1` for scalars, `m ≥ 2` for vectors).
To do so, one must explicitly give `m` as a type parameter: e.g.
`encoding = GaussianCDFEncoding{3}(; μ = 0.0, σ = 0.1)` to encode 3-element vectors,
or `encoding = GaussianCDFEncoding{1}(; μ = 0.0, σ = 0.1)` to encode scalars.
## Description
`GaussianCDFEncoding` first maps an input point ``x`` (scalar) to a new real number
### Encoding/decoding scalars
`GaussianCDFEncoding` first maps an input scalar ``χ`` to a new real number
``y_ \\in [0, 1]`` by using the normal cumulative distribution function (CDF) with the
given mean `μ` and standard deviation `σ`, according to the map
Expand All @@ -31,6 +37,24 @@ Next, the interval `[0, 1]` is equidistantly binned and enumerated ``1, 2, \\ldo
Because of the floor operation, some information is lost, so when used with
[`decode`](@ref), each decoded `sᵢ` is mapped to a *subinterval* of `[0, 1]`.
This subinterval is returned as a length-`1` `Vector{SVector}`.
Notice that the decoding step does not yield an element of any outcome space of the
estimators that use `GaussianCDFEncoding` internally, such as [`Dispersion`](@ref).
That is because these estimators additionally delay embed the encoded data.
### Encoding/decoding vectors
If `GaussianCDFEncoding` is used with a vector `χ`, then each element of `χ` is
encoded separately, resulting in a `length(χ)` sequence of integers which may be
treated as a `CartesianIndex`. The encoded symbol `s ∈ [1, 2, …, c]` is then just the
linear index corresponding to this cartesian index (similar to how
[CombinationEncoding](@ref) works).
When [`decode`](@ref)d, the integer symbol `s` is converted back into its `CartesianIndex`
representation, which is just a sequence of integers that refer to subdivisions
of the `[0, 1]` interval. The relevant subintervals are then returned as a length-`χ`
`Vector{SVector}`.
## Examples
Expand All @@ -55,31 +79,64 @@ julia> decode(encoding, 3)
0.6
```
"""
struct GaussianCDFEncoding{T} <: Encoding
struct GaussianCDFEncoding{m, T, L <: LinearIndices, C <: CartesianIndices, R} <: Encoding
c::Int
σ::T
μ::T
# We require the input data, because we need σ and μ for encoding single values.
function GaussianCDFEncoding(; μ::T, σ::T, c::Int = 3) where T
new{T}(c, σ, μ)

# internal fields: LinearIndices/CartesianIndices for encodings/decodings. binencoder
# for discretizing the interval [0, 1]
linear_indices::L
cartesian_indices::C
binencoder::R # RectangularBinEncoding

# The input `m` restricts what length the input scalar/vector can be.
function GaussianCDFEncoding{m}(; μ::T, σ::T, c::Int = 3) where {m, T}
m >= 1 || throw(ArgumentError("m must be an integer ≥ 1. Got $m."))
ranges = tuple([1:c for i in 1:m]...)
cartesian_indices = CartesianIndices(ranges)
linear_indices = LinearIndices(ranges)
L = typeof(linear_indices)
C = typeof(cartesian_indices)
binencoder = RectangularBinEncoding(FixedRectangularBinning(0, 1, c + 1))
R = typeof(binencoder)
new{m, T, L, C, R}(c, σ, μ, linear_indices, cartesian_indices, binencoder)
end
end

total_outcomes(encoding::GaussianCDFEncoding) = encoding.c
# Backwards compatibility (previously, only scalars were encodable)
GaussianCDFEncoding(; kwargs...) = GaussianCDFEncoding{1}(; kwargs...)

function total_outcomes(encoding::GaussianCDFEncoding{m}) where m
c = encoding.c
return c^m
end

gaussian(x, μ, σ) = exp((-(x - μ)^2)/(2σ^2))

function encode(encoding::GaussianCDFEncoding, x::Real)
(; c, σ, μ) = encoding
σ, μ = encoding.σ, encoding.μ
# We only need the value of the integral (not the error), so
# index first element returned from quadgk
k = 1/*sqrt(2π))
y = k * first(quadgk(x -> gaussian(x, μ, σ), -Inf, x))
return floor(Int, y / (1 / c)) + 1
# The integral estimate sometime returns a value slightly above 1.0, so we need
# to adjust to be sure that all points fall within the FixedRectangularBinning.
y_corrected = min(y, 1.0)
return encode(encoding.binencoder, y_corrected)
end

function decode(encoding::GaussianCDFEncoding, i::Int)
c = encoding.c
lower_interval_bound = (i - 1)/(c)
return SVector(lower_interval_bound, prevfloat(lower_interval_bound + 1/c))
function encode(encoding::GaussianCDFEncoding{m}, x::AbstractVector) where m
L = length(x)
if L != m
throw(ArgumentError("length(`x`) must equal `m` (got length(x)=$L, m=$m)"))
end
symbols = encode.(Ref(encoding), x)
ω::Int = encoding.linear_indices[symbols...]
return ω
end

function decode(encoding::GaussianCDFEncoding, ω::Int)
cidxs = Tuple(encoding.cartesian_indices[ω])
return [decode(encoding.binencoder, cᵢ) for cᵢ in cidxs]
end
99 changes: 99 additions & 0 deletions src/encoding_implementations/relative_first_difference_encoding.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
export RelativeFirstDifferenceEncoding

"""
RelativeFirstDifferenceEncoding <: Encoding
RelativeFirstDifferenceEncoding(minval::Real, maxval::Real; n = 2)
`RelativeFirstDifferenceEncoding` encodes a vector based on the relative position the average
of the *first differences* of the vectors has with respect to a predefined minimum and
maximum value (`minval` and `maxval`, respectively).
## Description
This encoding is inspired by Azami & Escudero[^Azami2016]'s algorithm for amplitude-aware
permutation entropy. They use a linear combination of amplitude information and
first differences information of state vectors to correct probabilities. Here, however,
we explicitly encode the first differences part of the correction as an a integer symbol
`Λ ∈ [1, 2, …, n]`. The amplitude part of the encoding is available
as the [`RelativeMeanEncoding`](@ref) encoding.
## Encoding/decoding
When used with [`encode`](@ref), an ``m``-element state vector
``\\bf{x} = (x_1, x_2, \\ldots, x_m)`` is encoded
as ``Λ = \\dfrac{1}{m - 1}\\sum_{k=2}^m |x_{k} - x_{k-1}|``. The value of ``Λ`` is then
normalized to lie on the interval `[0, 1]`, assuming that the minimum/maximum value any
single ``abs(x_k - x_{k-1})`` can take is `minval`/`maxval`, respectively. Finally, the
interval `[0, 1]` is discretized into `n` discrete bins, enumerated by positive integers
`1, 2, …, n`, and the number of the bin that the normalized ``Λ`` falls into is returned.
The smaller the mean first difference of the state vector is, the smaller the bin number is.
The higher the mean first difference of the state vectors is, the higher the bin number is.
When used with [`decode`](@ref), the left-edge of the bin that the normalized ``Λ``
fell into is returned.
## Performance tips
If you are encoding multiple input vectors, it is more efficient to construct a
[`RelativeFirstDifferenceEncoding`](@ref) instance and re-use it:
```julia
minval, maxval = 0, 1
encoding = RelativeFirstDifferenceEncoding(minval, maxval; n = 4)
pts = [rand(3) for i = 1:1000]
[encode(encoding, x) for x in pts]
```
[^Azami2016]:
Azami, H., & Escudero, J. (2016). Amplitude-aware permutation entropy:
Illustration in spike detection and signal segmentation. Computer methods and
programs in biomedicine, 128, 40-51.
"""
Base.@kwdef struct RelativeFirstDifferenceEncoding{R} <: Encoding
n::Int = 2
minval::Real
maxval::Real
binencoder::R # RectangularBinEncoding

function RelativeFirstDifferenceEncoding(n::Int, minval::Real, maxval::Real, binencoder::R) where R
if minval > maxval
s = "Need minval <= maxval. Got minval=$minval and maxval=$maxval."
throw(ArgumentError(s))
end
if n < 1
throw(ArgumentError("n must be ≥ 1"))
end
new{typeof(binencoder)}(n, minval, maxval, binencoder)
end
end

function RelativeFirstDifferenceEncoding(minval::Real, maxval::Real; n = 2)
binencoder = RectangularBinEncoding(FixedRectangularBinning(0, 1, n + 1))
return RelativeFirstDifferenceEncoding(n, minval, maxval, binencoder)
end

function encode(encoding::RelativeFirstDifferenceEncoding, x::AbstractVector{<:Real})
(; n, minval, maxval, binencoder) = encoding

L = length(x)
Λ = 0.0 # a loop is much faster than using `diff` (which allocates a new vector)
for i = 2:L
Λ += abs(x[i] - x[i - 1])
end
Λ /= (L - 1)

# Normalize to [0, 1]
Λ_normalized =- minval) / (maxval - minval)

# Return an integer from the set {1, 2, …, encoding.n}
return encode(binencoder, Λ_normalized)
end

function decode(encoding::RelativeFirstDifferenceEncoding, ω::Int)
# Return the left-edge of the bin.
return decode(encoding.binencoder, ω)
end

function total_outcomes(encoding::RelativeFirstDifferenceEncoding)
return total_outcomes(encoding.binencoder)
end
Loading

0 comments on commit 231ae8c

Please sign in to comment.