More encodings (#299)

* add a tutorial file: up to probabilities so far * re-write the surrounding docs to account for tutorial file * rename `Bayes` to `BayesianRegularization` Because Bayes and Bayesian are so huge terms in the literature, it doesn't feel appropriate to use it here like that. * correctly rename file * add discrete info section to tutorial * correct merging * fix complexit ydocstring * finish first draft of tutorial * use literate to build the tutorial. * port emphasis of what is a new entropy * Correct file name * Typo * Minor typo and text fixes to the tutorial. * Punctuation. * Amplitude and first difference encoding * Finish first difference and amplitude encodings * Add CombinationEncoding * `encode`/`decode` for state vectors for `GaussianCDF`` * Systematic tests for encoding. * Add `CombinationEncoding` to docs * Test `CombinationEncoding` * Remove utils file * Inner/outer constructors and tests * Better descriptions of the new encodings * Clarify inputs to `CombinationEncoding` * Use outer constructors, not inner * Add comment on not why we're not enforcing multi-element vectors. * Fix tests * Disallow `CombinationEncoding` as input to `CombinationEncoding`s * Correct `total_outcomes` - use `prod`, not `sum` * Change names * Add references in `Encoding` docstring * Remove redundant type info * Base.show for the encodings. * Use `RectangularBinEncoding` internally for GaussianCDFEncoding Fixes #300 too. * `RectangularBinEncoding` internally for the new encodings * Add test * More tests * More tests * Update src/encoding_implementations/gaussian_cdf.jl Co-authored-by: George Datseris <datseris.george@gmail.com> * Update src/encoding_implementations/combination_encoding.jl Co-authored-by: George Datseris <datseris.george@gmail.com> * Analytical encoding/decoding tests * Analytical tests for `CombinationEncoding` * Symbol naming, and drop extra doctest * Better description * Remove type restriction. Code will error at lower level if relevant * Remove show methods Do at abstract level later * New constructor * Update src/encoding_implementations/combination_encoding.jl Co-authored-by: George Datseris <datseris.george@gmail.com> * Ensure encodings for `CombinationEncoding` is always a tuple * Return a tuple of decoded symbols * Enforce encoding tuple input. Use number of encodings a type param * Test convenience constructor * Fix and rearrange tests * Fix tests --------- Co-authored-by: Datseris <datseris.george@gmail.com>
JuliaDynamics · Aug 25, 2023 · 231ae8c · 231ae8c
1 parent dfc973b
commit 231ae8c
Show file tree

Hide file tree

Showing 20 changed files with 710 additions and 187 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -15,11 +15,16 @@ Further additions to the library in v3:
 - Add the 1976 Lempel-Ziv complexity measure (`LempelZiv76`).
 - New entropy definition: identification entropy (`Identification`).
 - Minor documentation fixes.
+- `GaussianCDFEncoding` now can be used with vector-valued inputs.
 
 ### Bug fixes
 
 - `outcome_space` for `Dispersion` now correctly returns the all possible **sorted** outcomes
     (as promised by the `outcome_space` docstring).
+- `decode` with `GaussianCDFEncoding` now correctly returns only the left-sides of the
+    `[0, 1]` subintervals, and always returns the decoded symbol as a `Vector{SVector}`
+    (consistent with `RectangularBinEncoding`), regardless of whether the input is a scalar
+    or a vector.
 
 ### Renaming
 

diff --git a/docs/src/probabilities.md b/docs/src/probabilities.md
@@ -143,4 +143,7 @@ decode
 OrdinalPatternEncoding
 GaussianCDFEncoding
 RectangularBinEncoding
+RelativeMeanEncoding
+RelativeFirstDifferenceEncoding
+CombinationEncoding
 ```
diff --git a/src/core/encodings.jl b/src/core/encodings.jl
@@ -13,6 +13,9 @@ Current available encodings are:
 - [`OrdinalPatternEncoding`](@ref).
 - [`GaussianCDFEncoding`](@ref).
 - [`RectangularBinEncoding`](@ref).
+- [`RelativeMeanEncoding`](@ref).
+- [`RelativeFirstDifferenceEncoding`](@ref).
+- [`CombinationEncoding`](@ref), which can combine any of the above encodings.
 """
 abstract type Encoding end
 

diff --git a/src/encoding_implementations/combination_encoding.jl b/src/encoding_implementations/combination_encoding.jl
@@ -0,0 +1,92 @@
+export CombinationEncoding
+
+"""
+    CombinationEncoding <: Encoding
+    CombinationEncoding(encodings)
+
+A `CombinationEncoding` takes multiple [`Encoding`](@ref)s and creates a combined
+encoding that can be used to encode inputs that are compatible with the
+given `encodings`.
+
+## Encoding/decoding
+
+When used with [`encode`](@ref), each [`Encoding`](@ref) in `encodings` returns
+integers in the set `1, 2, …, n_e`, where `n_e` is the total number of outcomes
+for a particular encoding. For `k` different encodings, we can thus construct the
+cartesian coordinate `(c₁, c₂, …, cₖ)` (`cᵢ ∈ 1, 2, …, n_i`), which can uniquely
+be identified by an integer. We can thus identify each unique *combined* encoding
+with a single integer.
+
+When used with [`decode`](@ref), the integer symbol is converted to its corresponding
+cartesian coordinate, which is used to retrieve the decoded symbols for each of
+the encodings, and a tuple of the decoded symbols are returned.
+
+The total number of outcomes is `prod(total_outcomes(e) for e in encodings)`.
+
+## Examples
+
+```julia
+using ComplexityMeasures
+
+# We want to encode the vector `x`.
+x = [0.9, 0.2, 0.3]
+
+# To do so, we will use a combination of first-difference encoding, amplitude encoding,
+# and ordinal pattern encoding.
+
+encodings = (
+    RelativeFirstDifferenceEncoding(0, 1; n = 2),
+    RelativeMeanEncoding(0, 1; n = 5),
+    OrdinalPatternEncoding(3) # x is a three-element vector
+    )
+c = CombinationEncoding(encodings)
+
+# Encode `x` as integer
+ω = encode(c, x)
+
+# Decode symbol (into a vector of decodings, one for each encodings `e ∈ encodings`).
+# In this particular case, the first two element will be left-bin edges, and
+# the last element will be the decoded ordinal pattern (indices that would sort `x`).
+d = decode(c, ω)
+```
+"""
+struct CombinationEncoding{N, L, C} <: Encoding
+    # An iterable of encodings.
+    encodings::NTuple{N, Encoding}
+
+    # internal fields: LinearIndices/CartesianIndices for encodings/decodings.
+    linear_indices::L
+    cartesian_indices::C
+
+    function CombinationEncoding(encodings::NTuple{N, Encoding}, l::L, c::C) where {N, L, C}
+        if any(e isa CombinationEncoding for e in encodings)
+            s = "CombinationEncoding doesn't accept a CombinationEncoding as one of its " *
+             "sub-encodings."
+            throw(ArgumentError(s))
+        end
+        new{N, L, C}(encodings, l, c)
+    end
+end
+CombinationEncoding(encodings) = CombinationEncoding(encodings...)
+function CombinationEncoding(encodings::Vararg{Encoding, N}) where N
+    ranges = tuple([1:total_outcomes(e) for e in encodings]...)
+    linear_indices = LinearIndices(ranges)
+    cartesian_indices = CartesianIndices(ranges)
+    return CombinationEncoding(tuple(encodings...), linear_indices, cartesian_indices)
+end
+
+function encode(encoding::CombinationEncoding, χ)
+    symbols = CartesianIndex(map(e -> encode(e, χ), encoding.encodings))
+    ω::Int = encoding.linear_indices[symbols]
+    return ω
+end
+
+function decode(encoding::CombinationEncoding, ω::Int)
+    es = encoding.encodings
+    cidx = encoding.cartesian_indices[ω]
+    return map(e -> decode(e, cidx[findfirst(eᵢ -> eᵢ == e, es)]), es)
+end
+
+function total_outcomes(encoding::CombinationEncoding)
+    return prod(total_outcomes.(encoding.encodings))
+end
diff --git a/src/encoding_implementations/encoding_implementations.jl b/src/encoding_implementations/encoding_implementations.jl
@@ -2,3 +2,6 @@ include("fasthist.jl")
 include("rectangular_binning.jl")
 include("gaussian_cdf.jl")
 include("ordinal_pattern.jl")
+include("relative_mean_encoding.jl")
+include("relative_first_difference_encoding.jl")
+include("combination_encoding.jl")
diff --git a/src/encoding_implementations/gaussian_cdf.jl b/src/encoding_implementations/gaussian_cdf.jl
@@ -4,19 +4,25 @@ export GaussianCDFEncoding
 
 """
     GaussianCDFEncoding <: Encoding
-    GaussianCDFEncoding(; μ, σ, c::Int = 3)
+    GaussianCDFEncoding{m}(; μ, σ, c::Int = 3)
 
-An encoding scheme that [`encode`](@ref)s a scalar value into one of the integers
+An encoding scheme that [`encode`](@ref)s a scalar or vector `χ` into one of the integers
 `sᵢ ∈ [1, 2, …, c]` based on the normal cumulative distribution function (NCDF),
 and [`decode`](@ref)s the `sᵢ` into subintervals of `[0, 1]` (with some loss of information).
 
-Notice that the decoding step does not yield an element of any outcome space of the
-estimators that use `GaussianCDFEncoding` internally, such as [`Dispersion`](@ref).
-That is because these estimators additionally delay embed the encoded data.
+## Initializing a `GaussianCDFEncoding`
+
+The size of the input to be encoded must be known beforehand. One must therefore set
+`m = length(χ)`, where `χ` is the input (`m = 1` for scalars, `m ≥ 2` for vectors).
+To do so, one must explicitly give `m` as a type parameter: e.g.
+`encoding = GaussianCDFEncoding{3}(; μ = 0.0, σ = 0.1)` to encode 3-element vectors,
+or `encoding = GaussianCDFEncoding{1}(; μ = 0.0, σ = 0.1)` to encode scalars.
 
 ## Description
 
-`GaussianCDFEncoding` first maps an input point ``x``  (scalar) to a new real number
+### Encoding/decoding scalars
+
+`GaussianCDFEncoding` first maps an input scalar ``χ`` to a new real number
 ``y_ \\in [0, 1]`` by using the normal cumulative distribution function (CDF) with the
 given mean `μ` and standard deviation `σ`, according to the map
 
@@ -31,6 +37,24 @@ Next, the interval `[0, 1]` is equidistantly binned and enumerated ``1, 2, \\ldo
 
 Because of the floor operation, some information is lost, so when used with
 [`decode`](@ref), each decoded `sᵢ` is mapped to a *subinterval* of `[0, 1]`.
+This subinterval is returned as a length-`1` `Vector{SVector}`.
+
+Notice that the decoding step does not yield an element of any outcome space of the
+estimators that use `GaussianCDFEncoding` internally, such as [`Dispersion`](@ref).
+That is because these estimators additionally delay embed the encoded data.
+
+### Encoding/decoding vectors
+
+If `GaussianCDFEncoding` is used with a vector `χ`, then each element of `χ` is
+encoded separately, resulting in a `length(χ)` sequence of integers which may be
+treated as a `CartesianIndex`. The encoded symbol `s ∈ [1, 2, …, c]` is then just the
+linear index corresponding to this cartesian index (similar to how
+[CombinationEncoding](@ref) works).
+
+When [`decode`](@ref)d, the integer symbol `s` is converted back into its `CartesianIndex`
+representation,  which is just a sequence of integers that refer to subdivisions
+of the `[0, 1]` interval. The relevant subintervals are then returned as a length-`χ`
+`Vector{SVector}`.
 
 ## Examples
 
@@ -55,31 +79,64 @@ julia> decode(encoding, 3)
  0.6
 ```
 """
-struct GaussianCDFEncoding{T} <: Encoding
+struct GaussianCDFEncoding{m, T, L <: LinearIndices, C <: CartesianIndices, R} <: Encoding
     c::Int
     σ::T
     μ::T
-    # We require the input data, because we need σ and μ for encoding single values.
-    function GaussianCDFEncoding(; μ::T, σ::T, c::Int = 3) where T
-        new{T}(c, σ, μ)
+
+    # internal fields: LinearIndices/CartesianIndices for encodings/decodings. binencoder
+    # for discretizing the interval [0, 1]
+    linear_indices::L
+    cartesian_indices::C
+    binencoder::R # RectangularBinEncoding
+
+    # The input `m` restricts what length the input scalar/vector can be.
+    function GaussianCDFEncoding{m}(; μ::T, σ::T, c::Int = 3) where {m, T}
+        m >= 1 || throw(ArgumentError("m must be an integer ≥ 1. Got $m."))
+        ranges = tuple([1:c for i in 1:m]...)
+        cartesian_indices = CartesianIndices(ranges)
+        linear_indices = LinearIndices(ranges)
+        L = typeof(linear_indices)
+        C = typeof(cartesian_indices)
+        binencoder = RectangularBinEncoding(FixedRectangularBinning(0, 1, c + 1))
+        R = typeof(binencoder)
+        new{m, T, L, C, R}(c, σ, μ, linear_indices, cartesian_indices, binencoder)
     end
 end
 
-total_outcomes(encoding::GaussianCDFEncoding) = encoding.c
+# Backwards compatibility (previously, only scalars were encodable)
+GaussianCDFEncoding(; kwargs...) = GaussianCDFEncoding{1}(; kwargs...)
+
+function total_outcomes(encoding::GaussianCDFEncoding{m}) where m
+    c = encoding.c
+    return c^m
+end
 
 gaussian(x, μ, σ) = exp((-(x - μ)^2)/(2σ^2))
 
 function encode(encoding::GaussianCDFEncoding, x::Real)
-    (; c, σ, μ) = encoding
+    σ, μ = encoding.σ, encoding.μ
     # We only need the value of the integral (not the error), so
     # index first element returned from quadgk
     k = 1/(σ*sqrt(2π))
     y = k * first(quadgk(x -> gaussian(x, μ, σ), -Inf, x))
-    return floor(Int, y / (1 / c)) + 1
+    # The integral estimate sometime returns a value slightly above 1.0, so we need
+    # to adjust to be sure that all points fall within the FixedRectangularBinning.
+    y_corrected = min(y, 1.0)
+    return encode(encoding.binencoder, y_corrected)
 end
 
-function decode(encoding::GaussianCDFEncoding, i::Int)
-    c = encoding.c
-    lower_interval_bound = (i - 1)/(c)
-    return SVector(lower_interval_bound, prevfloat(lower_interval_bound + 1/c))
+function encode(encoding::GaussianCDFEncoding{m}, x::AbstractVector) where m
+    L = length(x)
+    if L != m
+        throw(ArgumentError("length(`x`) must equal `m` (got length(x)=$L, m=$m)"))
+    end
+    symbols = encode.(Ref(encoding), x)
+    ω::Int = encoding.linear_indices[symbols...]
+    return ω
+end
+
+function decode(encoding::GaussianCDFEncoding, ω::Int)
+    cidxs = Tuple(encoding.cartesian_indices[ω])
+    return [decode(encoding.binencoder, cᵢ) for cᵢ in cidxs]
 end
diff --git a/src/encoding_implementations/relative_first_difference_encoding.jl b/src/encoding_implementations/relative_first_difference_encoding.jl
@@ -0,0 +1,99 @@
+export RelativeFirstDifferenceEncoding
+
+"""
+    RelativeFirstDifferenceEncoding <: Encoding
+    RelativeFirstDifferenceEncoding(minval::Real, maxval::Real; n = 2)
+
+`RelativeFirstDifferenceEncoding` encodes a vector based on the relative position the average
+of the *first differences* of the vectors has  with respect to a predefined minimum and
+maximum value (`minval` and `maxval`, respectively).
+
+## Description
+
+This encoding is inspired by Azami & Escudero[^Azami2016]'s algorithm for amplitude-aware
+permutation entropy. They use a linear combination of amplitude information and
+first differences information of state vectors to correct probabilities. Here, however,
+we explicitly encode the first differences part of the correction as an a integer symbol
+`Λ ∈ [1, 2, …, n]`. The amplitude part of the encoding is available
+as the [`RelativeMeanEncoding`](@ref) encoding.
+
+## Encoding/decoding
+
+When used with [`encode`](@ref), an ``m``-element state vector
+``\\bf{x} = (x_1, x_2, \\ldots, x_m)`` is encoded
+as ``Λ = \\dfrac{1}{m - 1}\\sum_{k=2}^m |x_{k} - x_{k-1}|``. The value of ``Λ`` is then
+normalized to lie on the interval `[0, 1]`, assuming that the minimum/maximum value any
+single ``abs(x_k - x_{k-1})`` can take is `minval`/`maxval`, respectively. Finally, the
+interval `[0, 1]` is discretized into `n` discrete bins, enumerated by positive integers
+`1, 2, …, n`, and the number of the bin that the normalized ``Λ`` falls into is returned.
+The smaller the mean first difference of the state vector is, the smaller the bin number is.
+The higher the mean first difference of the state vectors is, the higher the bin number is.
+
+When used with [`decode`](@ref), the left-edge of the bin that the normalized ``Λ``
+fell into is returned.
+
+## Performance tips
+
+If you are encoding multiple input vectors, it is more efficient to construct a
+[`RelativeFirstDifferenceEncoding`](@ref) instance and re-use it:
+
+```julia
+minval, maxval = 0, 1
+encoding = RelativeFirstDifferenceEncoding(minval, maxval; n = 4)
+pts = [rand(3) for i = 1:1000]
+[encode(encoding, x) for x in pts]
+```
+
+[^Azami2016]:
+    Azami, H., & Escudero, J. (2016). Amplitude-aware permutation entropy:
+    Illustration in spike detection and signal segmentation. Computer methods and
+    programs in biomedicine, 128, 40-51.
+"""
+Base.@kwdef struct RelativeFirstDifferenceEncoding{R} <: Encoding
+    n::Int = 2
+    minval::Real
+    maxval::Real
+    binencoder::R # RectangularBinEncoding
+
+    function RelativeFirstDifferenceEncoding(n::Int, minval::Real, maxval::Real, binencoder::R) where R
+        if minval > maxval
+            s = "Need minval <= maxval. Got minval=$minval and maxval=$maxval."
+            throw(ArgumentError(s))
+        end
+        if n < 1
+            throw(ArgumentError("n must be ≥ 1"))
+        end
+        new{typeof(binencoder)}(n, minval, maxval, binencoder)
+    end
+end
+
+function RelativeFirstDifferenceEncoding(minval::Real, maxval::Real; n = 2)
+    binencoder = RectangularBinEncoding(FixedRectangularBinning(0, 1, n + 1))
+    return RelativeFirstDifferenceEncoding(n, minval, maxval, binencoder)
+end
+
+function encode(encoding::RelativeFirstDifferenceEncoding, x::AbstractVector{<:Real})
+    (; n, minval, maxval, binencoder) = encoding
+
+    L = length(x)
+    Λ = 0.0 # a loop is much faster than using `diff` (which allocates a new vector)
+    for i = 2:L
+       Λ += abs(x[i] - x[i - 1])
+    end
+    Λ /= (L - 1)
+
+    # Normalize to [0, 1]
+    Λ_normalized = (Λ - minval) / (maxval - minval)
+
+    # Return an integer from the set {1, 2, …, encoding.n}
+    return encode(binencoder, Λ_normalized)
+end
+
+function decode(encoding::RelativeFirstDifferenceEncoding, ω::Int)
+    # Return the left-edge of the bin.
+    return decode(encoding.binencoder, ω)
+end
+
+function total_outcomes(encoding::RelativeFirstDifferenceEncoding)
+    return total_outcomes(encoding.binencoder)
+end