Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More encodings #299

Merged
merged 59 commits into from
Aug 25, 2023
Merged
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
965b9df
add a tutorial file: up to probabilities so far
Datseris Aug 18, 2023
e45cb1d
re-write the surrounding docs to account for tutorial file
Datseris Aug 18, 2023
107d5ca
rename `Bayes` to `BayesianRegularization`
Datseris Aug 18, 2023
5b0c7f5
correctly rename file
Datseris Aug 19, 2023
c25997b
add discrete info section to tutorial
Datseris Aug 19, 2023
72f4090
Merge remote-tracking branch 'origin/main' into tutorial
Datseris Aug 19, 2023
43049af
correct merging
Datseris Aug 19, 2023
d1e745a
fix complexit ydocstring
Datseris Aug 19, 2023
a86102a
finish first draft of tutorial
Datseris Aug 19, 2023
baec12b
use literate to build the tutorial.
Datseris Aug 19, 2023
81b6e10
port emphasis of what is a new entropy
Datseris Aug 19, 2023
84d3956
Correct file name
kahaaga Aug 19, 2023
baa44d7
Typo
kahaaga Aug 19, 2023
4f909ae
Minor typo and text fixes to the tutorial.
kahaaga Aug 19, 2023
24b4b38
Punctuation.
kahaaga Aug 19, 2023
ee892be
Amplitude and first difference encoding
kahaaga Aug 20, 2023
e11d46d
Merge branch 'main' into amplitude_encoding
kahaaga Aug 21, 2023
4c30e78
Finish first difference and amplitude encodings
kahaaga Aug 21, 2023
c4e7548
Add CombinationEncoding
kahaaga Aug 21, 2023
481b1bc
`encode`/`decode` for state vectors for `GaussianCDF``
kahaaga Aug 22, 2023
fa7ff8f
Systematic tests for encoding.
kahaaga Aug 22, 2023
d8079ca
Add `CombinationEncoding` to docs
kahaaga Aug 22, 2023
3255acf
Test `CombinationEncoding`
kahaaga Aug 22, 2023
3a045ad
Merge branch 'main' into amplitude_encoding
kahaaga Aug 22, 2023
394d618
Remove utils file
kahaaga Aug 22, 2023
66207ed
Inner/outer constructors and tests
kahaaga Aug 22, 2023
bfb9403
Better descriptions of the new encodings
kahaaga Aug 22, 2023
605f603
Clarify inputs to `CombinationEncoding`
kahaaga Aug 22, 2023
0a12e38
Use outer constructors, not inner
kahaaga Aug 22, 2023
e7a9606
Add comment on not why we're not enforcing multi-element vectors.
kahaaga Aug 22, 2023
f459bdb
Fix tests
kahaaga Aug 22, 2023
b3df322
Disallow `CombinationEncoding` as input to `CombinationEncoding`s
kahaaga Aug 22, 2023
1801107
Correct `total_outcomes` - use `prod`, not `sum`
kahaaga Aug 22, 2023
c6002b8
Change names
kahaaga Aug 23, 2023
f533cf0
Add references in `Encoding` docstring
kahaaga Aug 23, 2023
ef6c46a
Remove redundant type info
kahaaga Aug 23, 2023
67e93d5
Base.show for the encodings.
kahaaga Aug 23, 2023
d55bf58
Use `RectangularBinEncoding` internally for GaussianCDFEncoding
kahaaga Aug 24, 2023
714d869
`RectangularBinEncoding` internally for the new encodings
kahaaga Aug 24, 2023
45b6031
Add test
kahaaga Aug 24, 2023
d74a71b
More tests
kahaaga Aug 24, 2023
a872497
More tests
kahaaga Aug 24, 2023
1955f50
Merge remote-tracking branch 'origin/amplitude_encoding' into amplitu…
kahaaga Aug 24, 2023
cb04fad
Update src/encoding_implementations/gaussian_cdf.jl
kahaaga Aug 24, 2023
eb12867
Update src/encoding_implementations/combination_encoding.jl
kahaaga Aug 24, 2023
4c3f65f
Analytical encoding/decoding tests
kahaaga Aug 24, 2023
ce071b2
Analytical tests for `CombinationEncoding`
kahaaga Aug 24, 2023
42ab904
Symbol naming, and drop extra doctest
kahaaga Aug 24, 2023
1c9aab6
Better description
kahaaga Aug 24, 2023
d42dedd
Remove type restriction. Code will error at lower level if relevant
kahaaga Aug 24, 2023
c82e746
Remove show methods
kahaaga Aug 24, 2023
3a44f7d
New constructor
kahaaga Aug 25, 2023
e8cd945
Update src/encoding_implementations/combination_encoding.jl
kahaaga Aug 25, 2023
58802bf
Ensure encodings for `CombinationEncoding` is always a tuple
kahaaga Aug 25, 2023
1d1fba0
Return a tuple of decoded symbols
kahaaga Aug 25, 2023
c7d6d96
Enforce encoding tuple input. Use number of encodings a type param
kahaaga Aug 25, 2023
6c5365a
Test convenience constructor
kahaaga Aug 25, 2023
8ae7157
Fix and rearrange tests
kahaaga Aug 25, 2023
ad13c9b
Fix tests
kahaaga Aug 25, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,16 @@ Further additions to the library in v3:
- Add the 1976 Lempel-Ziv complexity measure (`LempelZiv76`).
- New entropy definition: identification entropy (`Identification`).
- Minor documentation fixes.
- `GaussianCDFEncoding` now can be used with vector-valued inputs.

### Bug fixes

- `outcome_space` for `Dispersion` now correctly returns the all possible **sorted** outcomes
(as promised by the `outcome_space` docstring).
- `decode` with `GaussianCDFEncoding` now correctly returns only the left-sides of the
`[0, 1]` subintervals, and always returns the decoded symbol as a `Vector{SVector}`
(consistent with `RectangularBinEncoding`), regardless of whether the input is a scalar
or a vector.

### Renaming

Expand Down
3 changes: 3 additions & 0 deletions docs/src/probabilities.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,4 +143,7 @@ decode
OrdinalPatternEncoding
GaussianCDFEncoding
RectangularBinEncoding
RelativeMeanEncoding
RelativeFirstDifferenceEncoding
CombinationEncoding
```
3 changes: 3 additions & 0 deletions src/core/encodings.jl
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ Current available encodings are:
- [`OrdinalPatternEncoding`](@ref).
- [`GaussianCDFEncoding`](@ref).
- [`RectangularBinEncoding`](@ref).
- [`RelativeMeanEncoding`](@ref).
- [`RelativeFirstDifferenceEncoding`](@ref).
- [`CombinationEncoding`](@ref), which can combine any of the above encodings.
"""
abstract type Encoding end

Expand Down
97 changes: 97 additions & 0 deletions src/encoding_implementations/combination_encoding.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
export CombinationEncoding

"""
CombinationEncoding <: Encoding
CombinationEncoding(encodings)

A `CombinationEncoding` takes multiple [`Encoding`](@ref)s and create a combined
encoding that can be used to encode vectors.
kahaaga marked this conversation as resolved.
Show resolved Hide resolved

## Encoding/decoding

When used with [`encode`](@ref), each [`Encoding`](@ref) in `encodings` returns
integers in the set `1, 2, …, n_e`, where `n_e` is the total number of outcomes
for a particular encoding. For `k` different encodings, we can thus construct the
cartesian coordinate `(c₁, c₂, …, cₖ)` (`cᵢ ∈ 1, 2, …, n_i`), which can uniquely
be identified by an integer. We can thus identify each unique *combined* encoding
with a single integer.

When used with [`decode`](@ref), the integer symbol is converted to its corresponding
cartesian coordinate, which is used to retrieve the decoded symbols for each of
the encodings.

The total number of outcomes is `prod(total_outcomes(e) for e in encodings)`.

## Examples

```julia
using ComplexityMeasures

# We want to encode the vector `x`.
x = [0.9, 0.2, 0.3]

# To do so, we will use a combination of first-difference encoding, amplitude encoding,
# and ordinal pattern encoding.

encodings = [
RelativeFirstDifferenceEncoding(0, 1; n = 2),
RelativeMeanEncoding(0, 1; n = 5),
OrdinalPatternEncoding(3) # x is a three-element vector
]
kahaaga marked this conversation as resolved.
Show resolved Hide resolved
c = CombinationEncoding(encodings)

# Encode `x` as integer
ω = encode(c, x)

# Decode symbol (into a vector of decodings, one for each encodings `e ∈ encodings`).
# In this particular case, the first two element will be left-bin edges, and
# the last element will be the decoded ordinal pattern (indices that would sort `x`).
d = decode(c, ω)
```
"""
struct CombinationEncoding{VE, L, C} <: Encoding
# An iterable of encodings.
encodings::VE

# internal fields: LinearIndices/CartesianIndices for encodings/decodings.
linear_indices::L
cartesian_indices::C

function CombinationEncoding(encodings::VE, l::L, c::C) where {VE, L, C}
if any(e isa CombinationEncoding for e in encodings)
s = "CombinationEncoding doesn't accept a CombinationEncoding as one of its " *
"sub-encodings."
throw(ArgumentError(s))
end
new{VE, L, C}(encodings, l, c)
end
end

function CombinationEncoding(encodings::Vararg{<:Encoding, N}) where N
ranges = tuple([1:total_outcomes(e) for e in encodings]...)
linear_indices = LinearIndices(ranges)
cartesian_indices = CartesianIndices(ranges)
return CombinationEncoding(encodings, linear_indices, cartesian_indices)
end
CombinationEncoding(encodings::Vector{<:Encoding}) = CombinationEncoding(encodings...)

# We could in principle allow any `x` here, but not all encodings support encoding
# single numbers. In particular, the `RelativeFirstDifferenceEncoding` isn't even defined
# for single numbers, and `OrdinalPatternEncoding` also isn't defined for single numbers.
# Therefore, we enforce vector-valued input with encoding.
function encode(encoding::CombinationEncoding, x::AbstractVector{<:Real})
kahaaga marked this conversation as resolved.
Show resolved Hide resolved
# note: we don't enforce length(x) >= 2 here, because some combinations of
# encodings may work on single-element vectors (even though most don't).
symbols = [encode(e, x) for e in encoding.encodings]
ω::Int = encoding.linear_indices[symbols...]
return ω
end

function decode(encoding::CombinationEncoding, ω::Int)
cidx = encoding.cartesian_indices[ω]
return [decode(e, cidx[i]) for (i, e) in enumerate(encoding.encodings)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar as above, don't make a vector, use map so that a tuple is created.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar as above, don't make a vector, use map so that a tuple is created.

I'm not sure I get this comment. mapping anything returns a vector:

julia> map(x -> x, 1:5)
5-element Vector{Int64}:
 1
 2
 3
 4
 5

Copy link
Member

@Datseris Datseris Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map only returns a vector if the collection is a vector. If its a tuple it returns a tuplke and doesnt allocate anything:

julia> map(cos, (1,2,3))
(0.5403023058681398, -0.4161468365471424, -0.9899924966004454)

so encodings must be atuple.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, a tuple is equivalent with the cartesian indices, so we also reduce computations when converting.

end

function total_outcomes(encoding::CombinationEncoding)
return prod(total_outcomes(e) for e in encoding.encodings)
kahaaga marked this conversation as resolved.
Show resolved Hide resolved
end
3 changes: 3 additions & 0 deletions src/encoding_implementations/encoding_implementations.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,6 @@ include("fasthist.jl")
include("rectangular_binning.jl")
include("gaussian_cdf.jl")
include("ordinal_pattern.jl")
include("relative_mean_encoding.jl")
include("relative_first_difference_encoding.jl")
include("combination_encoding.jl")
108 changes: 94 additions & 14 deletions src/encoding_implementations/gaussian_cdf.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,26 @@

"""
GaussianCDFEncoding <: Encoding
GaussianCDFEncoding(; μ, σ, c::Int = 3)
GaussianCDFEncoding(m::Int = 1; μ, σ, c::Int = 3)
GaussianCDFEncoding(x::AbstractVector; μ, σ, c::Int = 3)
kahaaga marked this conversation as resolved.
Show resolved Hide resolved

An encoding scheme that [`encode`](@ref)s a scalar value into one of the integers
An encoding scheme that [`encode`](@ref)s a scalar or vector `x` into one of the integers
`sᵢ ∈ [1, 2, …, c]` based on the normal cumulative distribution function (NCDF),
and [`decode`](@ref)s the `sᵢ` into subintervals of `[0, 1]` (with some loss of information).

The size of the input to be encoded must be known beforehand, and one must set
`m = length(x)`, where `x` is the input (`m = 1` for scalars, `m ≥ 2` for vectors).
Alternatively, provide the vector `x` to the constructor to infer `m` automatically.

Notice that the decoding step does not yield an element of any outcome space of the
estimators that use `GaussianCDFEncoding` internally, such as [`Dispersion`](@ref).
That is because these estimators additionally delay embed the encoded data.

## Description

`GaussianCDFEncoding` first maps an input point ``x`` (scalar) to a new real number
### Encoding/decoding scalars

`GaussianCDFEncoding` first maps an input scalar ``x`` to a new real number
``y_ \\in [0, 1]`` by using the normal cumulative distribution function (CDF) with the
given mean `μ` and standard deviation `σ`, according to the map

Expand All @@ -31,6 +38,20 @@

Because of the floor operation, some information is lost, so when used with
[`decode`](@ref), each decoded `sᵢ` is mapped to a *subinterval* of `[0, 1]`.
This subinterval is returned as a length-`1` `Vector{SVector}`.

### Encoding/decoding vectors

If `GaussianCDFEncoding` is used with a vector `x`, then each element of `x` is
encoded separately, resulting in a `length(x)` sequence of integers which may be
treated as a `CartesianIndex`. The encoded symbol `s ∈ [1, 2, …, c]` is then just the
linear index corresponding to this cartesian index (similar to how
[CombinationEncoding](@ref) works).

When [`decode`](@ref)d, the integer symbol `s` is converted back into its `CartesianIndex`
representation, which is just a sequence of integers that refer to subdivisions
of the `[0, 1]` interval. The relevant subintervals are then returned as a length-`x`
`Vector{SVector}`.

## Examples

Expand All @@ -54,32 +75,91 @@
0.4
0.6
```

One can also encode the entire vector as an integer.

```jldoctest
julia> using ComplexityMeasures, Statistics

julia> x = [0.1, 0.4, 0.7, -2.1, 8.0];

julia> μ, σ = mean(x), std(x); encoding = GaussianCDFEncoding(x; μ, σ, c = 2)
GaussianCDFEncoding(m=5; c=2, μ=1.42, σ=3.840182287340016)

julia> symbol = encode(encoding, x)
17

julia> decode(encoding, symbol)
5-element Vector{SVector{1, Float64}}:
[0.0]
[0.0]
[0.0]
[0.0]
[0.5000000000000001]
```
kahaaga marked this conversation as resolved.
Show resolved Hide resolved
"""
struct GaussianCDFEncoding{T} <: Encoding
struct GaussianCDFEncoding{m, T, L <: LinearIndices, C <: CartesianIndices, R} <: Encoding
m::Int
c::Int
σ::T
μ::T
# We require the input data, because we need σ and μ for encoding single values.
function GaussianCDFEncoding(; μ::T, σ::T, c::Int = 3) where T
new{T}(c, σ, μ)

# internal fields: LinearIndices/CartesianIndices for encodings/decodings. binencoder
# for discretizing the interval [0, 1]
linear_indices::L
cartesian_indices::C
binencoder::R # RectangularBinEncoding

# The input `m` restricts what length the input scalar/vector can be.
function GaussianCDFEncoding(m::Int = 1; μ::T, σ::T, c::Int = 3) where T
kahaaga marked this conversation as resolved.
Show resolved Hide resolved
m >= 1 || throw(ArgumentError("m must be an integer ≥ 1. Got $m."))
ranges = tuple([1:c for i in 1:m]...)
cartesian_indices = CartesianIndices(ranges)
linear_indices = LinearIndices(ranges)
L = typeof(linear_indices)
C = typeof(cartesian_indices)
binencoder = RectangularBinEncoding(FixedRectangularBinning(0, 1, c + 1))
R = typeof(binencoder)
new{m, T, L, C, R}(m, c, σ, μ, linear_indices, cartesian_indices, binencoder)
end
end
GaussianCDFEncoding(x::AbstractVector; kwargs...) = GaussianCDFEncoding(length(x); kwargs...)

function Base.show(io::IO, e::GaussianCDFEncoding{m, T, L, C}) where {m, T, L, C}
c, μ, σ = e.c, e.μ, e.σ
print(io, "GaussianCDFEncoding(m=$m; c=$c, μ=$μ, σ=$σ)")

Check warning on line 130 in src/encoding_implementations/gaussian_cdf.jl

View check run for this annotation

Codecov / codecov/patch

src/encoding_implementations/gaussian_cdf.jl#L128-L130

Added lines #L128 - L130 were not covered by tests
end
kahaaga marked this conversation as resolved.
Show resolved Hide resolved

total_outcomes(encoding::GaussianCDFEncoding) = encoding.c
function total_outcomes(encoding::GaussianCDFEncoding{m}) where m
c = encoding.c
return prod(c for i = 1:m)
kahaaga marked this conversation as resolved.
Show resolved Hide resolved
end

gaussian(x, μ, σ) = exp((-(x - μ)^2)/(2σ^2))

function encode(encoding::GaussianCDFEncoding, x::Real)
(; c, σ, μ) = encoding
σ, μ = encoding.σ, encoding.μ
# We only need the value of the integral (not the error), so
# index first element returned from quadgk
k = 1/(σ*sqrt(2π))
y = k * first(quadgk(x -> gaussian(x, μ, σ), -Inf, x))
return floor(Int, y / (1 / c)) + 1
# The integral estimate sometime returns a value slightly above 1.0, so we need
# to adjust to be sure that all points fall within the FixedRectangularBinning.
y_corrected = min(y, 1.0)
return encode(encoding.binencoder, y_corrected)
end

function decode(encoding::GaussianCDFEncoding, i::Int)
c = encoding.c
lower_interval_bound = (i - 1)/(c)
return SVector(lower_interval_bound, prevfloat(lower_interval_bound + 1/c))
function encode(encoding::GaussianCDFEncoding{m}, x::AbstractVector) where m
L = length(x)
if L != m
throw(ArgumentError("length(`x`) must equal `m` (got length(x)=$L, m=$m)"))
end
symbols = encode.(Ref(encoding), x)
ω::Int = encoding.linear_indices[symbols...]
return ω
end

function decode(encoding::GaussianCDFEncoding, ω::Int)
cidxs = Tuple(encoding.cartesian_indices[ω])
return [decode(encoding.binencoder, cᵢ) for cᵢ in cidxs]
end
4 changes: 4 additions & 0 deletions src/encoding_implementations/ordinal_pattern.jl
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,10 @@
return OrdinalPatternEncoding{m, F}(zero(MVector{m, Int}), lt)
end

function Base.show(io::IO, e::OrdinalPatternEncoding{M}) where {M}
print(io, "OrdinalPatternEncoding{3}(lt = $(e.lt))")

Check warning on line 64 in src/encoding_implementations/ordinal_pattern.jl

View check run for this annotation

Codecov / codecov/patch

src/encoding_implementations/ordinal_pattern.jl#L63-L64

Added lines #L63 - L64 were not covered by tests
end

# So that SymbolicPerm stuff fallback here
total_outcomes(::OrdinalPatternEncoding{m}) where {m} = factorial(m)
outcome_space(::OrdinalPatternEncoding{m}) where {m} = permutations(1:m) |> collect
Expand Down
Loading
Loading