More encodings #299

kahaaga · 2023-08-22T11:36:59Z

Summary

After this PR, we have the following Encodings:

OrdinalPatternEncoding
GaussianCDFEncoding (extended to state vectors in this PR, was just for integers before)
RectangularBinEncoding
AmplitudeEncoding (new)
FirstDifferenceEncoding (new)

Any combination (of two or more) of these five encodings is also a valid encoding (by using the new CombinationEncoding).
Thus, we will officially support sum(binomial(5, i) for i = 1:5) = 32 different encodings (of combination length 1 to 5) for explicitly discretising a StateSpaceSet into a Vector{Int}.

The docs are here.

Implications upstream

Since we with the new CombinationEncoding have 32 different ways of discretising, in CausalityTools.jl we will have 32 plugin estimators of KL-divergence, 32 plugin estimators of discrete mutual information, 32 plugin estimators for transfer entropy, etc. This doesn't even include more advanced estimation schemes that go beyond plugin estimation.

New encodings

For the OrdinalPatterns outcome space, we explicitly use encode to discretise. However, for AmplitudeAwareOrdinalPatterns, we don't - because it was intended in the original paper as a correction to the raw counts of outcomes.

Here, I introduce two new encodings: AmplitudeEncoding and FirstDifferenceEncoding, which uses average absolute amplitude information, and average first difference magnitude information (relative to some user-defined min/max ranges), to encode state vectors into integer symbols. Since the encodings accept min/max values as mandatory inputs, we can rescale the average amplitude or first difference information to the interval [0, 1], and then discretise this interval. The average amplitude/first difference falls into one of these bins, and the encoded symbol/integer is the number of this bin (enumeration starts from 1). This is similar to how GaussianCDF is implemented, and I use RectangularBinEncoding internally for the discretisation onto the binned unit interval.

Roughly speaking, the original amplitude-aware permutation entropy uses a linear combination of these two quantities (although not discretised) as a correction to the counts. Having these encodings as Encoding instances allows a user to explicitly discretise data using amplitude/first difference information. This is directly relevant upstream, because I'm using it to discretise data for multi-variable measures such as transfer entropy.

A similar encoding can be defined based on the WeightedOrdinalPatterns too, but isn't included in this PR.

Combination encodings

These new measures are useful by themselves, but even more powerful when applied together. They may even be combined with the existing encodings. To treat this formally, I have defined a CombinationEncoding, which accepts multiple encodings, and constructs a combined encoding by treating the integer symbols (each an integer in the range [1, 2, ...., n], where n varies between encodings) as a cartesian index.

For example, if we use three different encodings, and encode the vector x = [0.1, 0.5, 0.4, 0.2], then we could get the integers s_a = 2 for AmplitudeEncoding, s_f = 1 for FirstDifferenceEncoding and s_o = 4 for OrdinalPatternEncoding. The combined encoding is then (2, 1, 4). Because the total number of outcomes is always known for each encoding, we can map this combined code to a unique integer, and vice versa, by using LinearIndices and CartesianIndices.

We can thus apply any combination of encodings to quantify different properties of state vectors. The more encodings that are combined, the larger the number of possible outcomes (integer symbols) are. The total number of possible outcomes is simply the sum of possible outcomes for each input encoding.

Other changes

Rearranged the tests for encodings. Tests for each Encoding now is in a separate file. Utility tests are moved to a file of its own.
Implemented encode(::GaussianCDFEncoding, x::AbstractVector{<:}), so that GaussianCDFEncoding is compatible with CombinationEncoding.

This PR is originally part of CausalityTools, and the encodings here are used in an upcoming paper of mine. But since we now define a public API for encodings and these encodings are completely generic, we should offer everything here.

Because Bayes and Bayesian are so huge terms in the literature, it doesn't feel appropriate to use it here like that.

codecov · 2023-08-22T11:44:56Z

Codecov Report

Merging #299 (58802bf) into main (dfc973b) will increase coverage by 0.80%.
The diff coverage is 97.75%.

@@            Coverage Diff             @@
##             main     #299      +/-   ##
==========================================
+ Coverage   86.41%   87.22%   +0.80%     
==========================================
  Files          72       75       +3     
  Lines        1767     1847      +80     
==========================================
+ Hits         1527     1611      +84     
+ Misses        240      236       -4

Files Changed	Coverage Δ
...encoding_implementations/relative_mean_encoding.jl	`94.44% <94.44%> (ø)`
...lementations/relative_first_difference_encoding.jl	`95.83% <95.83%> (ø)`
...c/encoding_implementations/combination_encoding.jl	`100.00% <100.00%> (ø)`
src/encoding_implementations/gaussian_cdf.jl	`100.00% <100.00%> (+38.46%)`	⬆️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Datseris

We need to discuss a bit the usefulness of adding things that are not used in the "main" API of outcome spaces, and have not been used in the literature explicitly either.

src/encoding_implementations/gaussian_cdf.jl

Datseris · 2023-08-22T12:35:39Z

src/encoding_implementations/amplitude_encoding.jl

+export AmplitudeEncoding
+
+"""
+ AmplitudeEncoding <: Encoding


So to get things straight, the AmplitudeAwareOrdinalPattern outcome space is the combination encoding of OrdinalPatternEncoding and AmplitudeEncoding?

Unfortunately the name AmplitudeEncoding is too simple, while this encoding is really complex, and its name doesn't convey what it does. My mind went into value histogram with the name.

More importantly, is this encoding useful in the first place? It hasn't been used in the literature as far as I understand. We should be mindful into not adding too many things in the code base that may have no use on their own.

As far as I can tell, this PR does not alter any of the probabilities function calls. So, I have to wonder, should we first do some research that shows that all these extra functionality is useful in some way?

I just don't see the reason for existence of an encoding without a corresponding outcome space to use it.

So to get things straight, the AmplitudeAwareOrdinalPattern outcome space is the combination encoding of OrdinalPatternEncoding and AmplitudeEncoding?

When used with probabilities, AmplitudeAwareOrdinalPattern is (roughly) the combination of OrdinalPatternEncoding with a linear combination of AmplitudeEncoding and FirstDifferenceEncoding dictating the weights.

I just don't see the reason for existence of an encoding without a corresponding outcome space to use it.

As stated in the bottom statement in the initial PR comment, the reason is to have complete flexibility in how to discretize data in upstream functions. I am using the new encodings in an upcoming paper to develop some information transfer algorithms that improve on the "naive"/only ordinal pattern based ones.

They are not yet in any papers, because nobody has used them in this way yet.

src/encoding_implementations/combination_encoding.jl

src/encoding_implementations/amplitude_encoding.jl

src/encoding_implementations/firstdifference_encoding.jl

src/encoding_implementations/utils.jl

test/runtests.jl

…de_encoding

Datseris · 2023-08-24T13:07:44Z

so this is good to go?

Datseris · 2023-08-24T13:07:57Z

or, good for a review?

Datseris · 2023-08-24T13:08:03Z

yes, I'll try to do it asap.

src/encoding_implementations/gaussian_cdf.jl

src/encoding_implementations/combination_encoding.jl

Datseris · 2023-08-24T15:15:48Z

src/encoding_implementations/combination_encoding.jl

+
+function decode(encoding::CombinationEncoding, ω::Int)
+ cidx = encoding.cartesian_indices[ω]
+ return [decode(e, cidx[i]) for (i, e) in enumerate(encoding.encodings)]


map only returns a vector if the collection is a vector. If its a tuple it returns a tuplke and doesnt allocate anything:

julia> map(cos, (1,2,3)) (0.5403023058681398, -0.4161468365471424, -0.9899924966004454)

so encodings must be atuple.

src/encoding_implementations/gaussian_cdf.jl

test/encodings/encodings/combination_encoding.jl

Co-authored-by: George Datseris <datseris.george@gmail.com>

Do at abstract level later

src/encoding_implementations/gaussian_cdf.jl

kahaaga · 2023-08-25T08:38:08Z

@kahaaga you haven't addressed this comment. You should make this change and then merge. You should also create the default constructor GaussianCDFEncoding(; μ ...) that assumes m=1.

I'm not finished with addressing the reviews yet. Working on it as we speak 👍

(If you got a notification for a PR review, it should have been for #302).

Co-authored-by: George Datseris <datseris.george@gmail.com>

Datseris and others added 23 commits August 18, 2023 20:54

add a tutorial file: up to probabilities so far

965b9df

re-write the surrounding docs to account for tutorial file

e45cb1d

rename Bayes to BayesianRegularization

107d5ca

Because Bayes and Bayesian are so huge terms in the literature, it doesn't feel appropriate to use it here like that.

correctly rename file

5b0c7f5

add discrete info section to tutorial

c25997b

Merge remote-tracking branch 'origin/main' into tutorial

72f4090

correct merging

43049af

fix complexit ydocstring

d1e745a

finish first draft of tutorial

a86102a

use literate to build the tutorial.

baec12b

port emphasis of what is a new entropy

81b6e10

Correct file name

84d3956

Typo

baa44d7

Minor typo and text fixes to the tutorial.

4f909ae

Punctuation.

24b4b38

Amplitude and first difference encoding

ee892be

Merge branch 'main' into amplitude_encoding

e11d46d

Finish first difference and amplitude encodings

4c30e78

Add CombinationEncoding

c4e7548

encode/decode for state vectors for GaussianCDF`

481b1bc

Systematic tests for encoding.

fa7ff8f

Add CombinationEncoding to docs

d8079ca

Test CombinationEncoding

3255acf

kahaaga added tests Related with tests improvement Improvement of an existing feature cleanup encodings Related to the Encodings API labels Aug 22, 2023

kahaaga requested a review from Datseris August 22, 2023 11:37

Datseris reviewed Aug 22, 2023

View reviewed changes

kahaaga added 4 commits August 24, 2023 14:13

Add test

45b6031

More tests

d74a71b

More tests

a872497

Merge remote-tracking branch 'origin/amplitude_encoding' into amplitu…

1955f50

…de_encoding

kahaaga requested a review from Datseris August 24, 2023 12:21

Datseris reviewed Aug 24, 2023

View reviewed changes

kahaaga and others added 8 commits August 24, 2023 17:24

Update src/encoding_implementations/gaussian_cdf.jl

cb04fad

Co-authored-by: George Datseris <datseris.george@gmail.com>

Update src/encoding_implementations/combination_encoding.jl

eb12867

Co-authored-by: George Datseris <datseris.george@gmail.com>

Analytical encoding/decoding tests

4c3f65f

Analytical tests for CombinationEncoding

ce071b2

Symbol naming, and drop extra doctest

42ab904

Better description

1c9aab6

Remove type restriction. Code will error at lower level if relevant

d42dedd

Remove show methods

c82e746

Do at abstract level later

Datseris requested changes Aug 25, 2023

View reviewed changes

src/encoding_implementations/gaussian_cdf.jl Outdated Show resolved Hide resolved

Datseris mentioned this pull request Aug 25, 2023

Use DocumenterCitations.jl for references #302

Merged

kahaaga and others added 8 commits August 25, 2023 10:50

New constructor

3a44f7d

Update src/encoding_implementations/combination_encoding.jl

e8cd945

Co-authored-by: George Datseris <datseris.george@gmail.com>

Ensure encodings for CombinationEncoding is always a tuple

58802bf

Return a tuple of decoded symbols

1d1fba0

Enforce encoding tuple input. Use number of encodings a type param

c7d6d96

Test convenience constructor

6c5365a

Fix and rearrange tests

8ae7157

Fix tests

ad13c9b

kahaaga merged commit 231ae8c into main Aug 25, 2023
2 checks passed

kahaaga deleted the amplitude_encoding branch August 25, 2023 10:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More encodings #299

More encodings #299

kahaaga commented Aug 22, 2023 •

edited

Loading

codecov bot commented Aug 22, 2023 •

edited

Loading

Datseris left a comment

Datseris Aug 22, 2023

Datseris Aug 22, 2023

Datseris Aug 22, 2023

kahaaga Aug 22, 2023

Datseris commented Aug 24, 2023

Datseris commented Aug 24, 2023

Datseris commented Aug 24, 2023

Datseris Aug 24, 2023 •

edited

Loading

kahaaga commented Aug 25, 2023

More encodings #299

More encodings #299

Conversation

kahaaga commented Aug 22, 2023 • edited Loading

Summary

Implications upstream

New encodings

Combination encodings

Other changes

codecov bot commented Aug 22, 2023 • edited Loading

Codecov Report

Datseris left a comment

Choose a reason for hiding this comment

Datseris Aug 22, 2023

Choose a reason for hiding this comment

Datseris Aug 22, 2023

Choose a reason for hiding this comment

Datseris Aug 22, 2023

Choose a reason for hiding this comment

kahaaga Aug 22, 2023

Choose a reason for hiding this comment

Datseris commented Aug 24, 2023

Datseris commented Aug 24, 2023

Datseris commented Aug 24, 2023

Datseris Aug 24, 2023 • edited Loading

Choose a reason for hiding this comment

kahaaga commented Aug 25, 2023

kahaaga commented Aug 22, 2023 •

edited

Loading

codecov bot commented Aug 22, 2023 •

edited

Loading

Datseris Aug 24, 2023 •

edited

Loading