Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More encodings #299

Merged
merged 59 commits into from
Aug 25, 2023
Merged

More encodings #299

merged 59 commits into from
Aug 25, 2023

Conversation

kahaaga
Copy link
Member

@kahaaga kahaaga commented Aug 22, 2023

Summary

After this PR, we have the following Encodings:

  • OrdinalPatternEncoding
  • GaussianCDFEncoding (extended to state vectors in this PR, was just for integers before)
  • RectangularBinEncoding
  • AmplitudeEncoding (new)
  • FirstDifferenceEncoding (new)

Any combination (of two or more) of these five encodings is also a valid encoding (by using the new CombinationEncoding).
Thus, we will officially support sum(binomial(5, i) for i = 1:5) = 32 different encodings (of combination length 1 to 5) for explicitly discretising a StateSpaceSet into a Vector{Int}.

The docs are here.

Implications upstream

Since we with the new CombinationEncoding have 32 different ways of discretising, in CausalityTools.jl we will have 32 plugin estimators of KL-divergence, 32 plugin estimators of discrete mutual information, 32 plugin estimators for transfer entropy, etc. This doesn't even include more advanced estimation schemes that go beyond plugin estimation.

New encodings

For the OrdinalPatterns outcome space, we explicitly use encode to discretise. However, for AmplitudeAwareOrdinalPatterns, we don't - because it was intended in the original paper as a correction to the raw counts of outcomes.

Here, I introduce two new encodings: AmplitudeEncoding and FirstDifferenceEncoding, which uses average absolute amplitude information, and average first difference magnitude information (relative to some user-defined min/max ranges), to encode state vectors into integer symbols. Since the encodings accept min/max values as mandatory inputs, we can rescale the average amplitude or first difference information to the interval [0, 1], and then discretise this interval. The average amplitude/first difference falls into one of these bins, and the encoded symbol/integer is the number of this bin (enumeration starts from 1). This is similar to how GaussianCDF is implemented, and I use RectangularBinEncoding internally for the discretisation onto the binned unit interval.

Roughly speaking, the original amplitude-aware permutation entropy uses a linear combination of these two quantities (although not discretised) as a correction to the counts. Having these encodings as Encoding instances allows a user to explicitly discretise data using amplitude/first difference information. This is directly relevant upstream, because I'm using it to discretise data for multi-variable measures such as transfer entropy.

A similar encoding can be defined based on the WeightedOrdinalPatterns too, but isn't included in this PR.

Combination encodings

These new measures are useful by themselves, but even more powerful when applied together. They may even be combined with the existing encodings. To treat this formally, I have defined a CombinationEncoding, which accepts multiple encodings, and constructs a combined encoding by treating the integer symbols (each an integer in the range [1, 2, ...., n], where n varies between encodings) as a cartesian index.

For example, if we use three different encodings, and encode the vector x = [0.1, 0.5, 0.4, 0.2], then we could get the integers s_a = 2 for AmplitudeEncoding, s_f = 1 for FirstDifferenceEncoding and s_o = 4 for OrdinalPatternEncoding. The combined encoding is then (2, 1, 4). Because the total number of outcomes is always known for each encoding, we can map this combined code to a unique integer, and vice versa, by using LinearIndices and CartesianIndices.

We can thus apply any combination of encodings to quantify different properties of state vectors. The more encodings that are combined, the larger the number of possible outcomes (integer symbols) are. The total number of possible outcomes is simply the sum of possible outcomes for each input encoding.

Other changes

  • Rearranged the tests for encodings. Tests for each Encoding now is in a separate file. Utility tests are moved to a file of its own.
  • Implemented encode(::GaussianCDFEncoding, x::AbstractVector{<:}), so that GaussianCDFEncoding is compatible with CombinationEncoding.

This PR is originally part of CausalityTools, and the encodings here are used in an upcoming paper of mine. But since we now define a public API for encodings and these encodings are completely generic, we should offer everything here.

@kahaaga kahaaga added tests Related with tests improvement Improvement of an existing feature cleanup encodings Related to the Encodings API labels Aug 22, 2023
@kahaaga kahaaga requested a review from Datseris August 22, 2023 11:37
@codecov
Copy link

codecov bot commented Aug 22, 2023

Codecov Report

Merging #299 (58802bf) into main (dfc973b) will increase coverage by 0.80%.
The diff coverage is 97.75%.

@@            Coverage Diff             @@
##             main     #299      +/-   ##
==========================================
+ Coverage   86.41%   87.22%   +0.80%     
==========================================
  Files          72       75       +3     
  Lines        1767     1847      +80     
==========================================
+ Hits         1527     1611      +84     
+ Misses        240      236       -4     
Files Changed Coverage Δ
...encoding_implementations/relative_mean_encoding.jl 94.44% <94.44%> (ø)
...lementations/relative_first_difference_encoding.jl 95.83% <95.83%> (ø)
...c/encoding_implementations/combination_encoding.jl 100.00% <100.00%> (ø)
src/encoding_implementations/gaussian_cdf.jl 100.00% <100.00%> (+38.46%) ⬆️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@Datseris Datseris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to discuss a bit the usefulness of adding things that are not used in the "main" API of outcome spaces, and have not been used in the literature explicitly either.

src/encoding_implementations/gaussian_cdf.jl Outdated Show resolved Hide resolved
export AmplitudeEncoding

"""
AmplitudeEncoding <: Encoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to get things straight, the AmplitudeAwareOrdinalPattern outcome space is the combination encoding of OrdinalPatternEncoding and AmplitudeEncoding?

Unfortunately the name AmplitudeEncoding is too simple, while this encoding is really complex, and its name doesn't convey what it does. My mind went into value histogram with the name.

More importantly, is this encoding useful in the first place? It hasn't been used in the literature as far as I understand. We should be mindful into not adding too many things in the code base that may have no use on their own.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, this PR does not alter any of the probabilities function calls. So, I have to wonder, should we first do some research that shows that all these extra functionality is useful in some way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just don't see the reason for existence of an encoding without a corresponding outcome space to use it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to get things straight, the AmplitudeAwareOrdinalPattern outcome space is the combination encoding of OrdinalPatternEncoding and AmplitudeEncoding?

When used with probabilities, AmplitudeAwareOrdinalPattern is (roughly) the combination of OrdinalPatternEncoding with a linear combination of AmplitudeEncoding and FirstDifferenceEncoding dictating the weights.

I just don't see the reason for existence of an encoding without a corresponding outcome space to use it.

As stated in the bottom statement in the initial PR comment, the reason is to have complete flexibility in how to discretize data in upstream functions. I am using the new encodings in an upcoming paper to develop some information transfer algorithms that improve on the "naive"/only ordinal pattern based ones.

They are not yet in any papers, because nobody has used them in this way yet.

src/encoding_implementations/combination_encoding.jl Outdated Show resolved Hide resolved
src/encoding_implementations/combination_encoding.jl Outdated Show resolved Hide resolved
src/encoding_implementations/combination_encoding.jl Outdated Show resolved Hide resolved
src/encoding_implementations/amplitude_encoding.jl Outdated Show resolved Hide resolved
src/encoding_implementations/firstdifference_encoding.jl Outdated Show resolved Hide resolved
src/encoding_implementations/firstdifference_encoding.jl Outdated Show resolved Hide resolved
src/encoding_implementations/utils.jl Outdated Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
@kahaaga kahaaga requested a review from Datseris August 24, 2023 12:21
@Datseris
Copy link
Member

so this is good to go?

@Datseris
Copy link
Member

or, good for a review?

@Datseris
Copy link
Member

yes, I'll try to do it asap.

src/encoding_implementations/gaussian_cdf.jl Outdated Show resolved Hide resolved
src/encoding_implementations/combination_encoding.jl Outdated Show resolved Hide resolved
src/encoding_implementations/combination_encoding.jl Outdated Show resolved Hide resolved
src/encoding_implementations/combination_encoding.jl Outdated Show resolved Hide resolved

function decode(encoding::CombinationEncoding, ω::Int)
cidx = encoding.cartesian_indices[ω]
return [decode(e, cidx[i]) for (i, e) in enumerate(encoding.encodings)]
Copy link
Member

@Datseris Datseris Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map only returns a vector if the collection is a vector. If its a tuple it returns a tuplke and doesnt allocate anything:

julia> map(cos, (1,2,3))
(0.5403023058681398, -0.4161468365471424, -0.9899924966004454)

so encodings must be atuple.

src/encoding_implementations/gaussian_cdf.jl Outdated Show resolved Hide resolved
src/encoding_implementations/gaussian_cdf.jl Outdated Show resolved Hide resolved
src/encoding_implementations/gaussian_cdf.jl Outdated Show resolved Hide resolved
src/encoding_implementations/gaussian_cdf.jl Outdated Show resolved Hide resolved
test/encodings/encodings/combination_encoding.jl Outdated Show resolved Hide resolved
@kahaaga
Copy link
Member Author

kahaaga commented Aug 25, 2023

@kahaaga you haven't addressed this comment. You should make this change and then merge. You should also create the default constructor GaussianCDFEncoding(; μ ...) that assumes m=1.

I'm not finished with addressing the reviews yet. Working on it as we speak 👍

(If you got a notification for a PR review, it should have been for #302).

@kahaaga kahaaga merged commit 231ae8c into main Aug 25, 2023
2 checks passed
@kahaaga kahaaga deleted the amplitude_encoding branch August 25, 2023 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup encodings Related to the Encodings API improvement Improvement of an existing feature tests Related with tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants