Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rank-based outcome space #421

Open
kahaaga opened this issue Jul 31, 2024 · 4 comments
Open

Rank-based outcome space #421

kahaaga opened this issue Jul 31, 2024 · 4 comments
Labels
encodings Related to the Encodings API enhancement New feature or request that is non-breaking new outcome space

Comments

@kahaaga
Copy link
Member

kahaaga commented Jul 31, 2024

Describe the feature you'd like to have

While implementing the Chatterjee coefficient of correlation in Associations.jl, I needed to compute ranks of an input vector, according to some method. That made me think: why not have an encoding and an outcome space in ComplexityMeasures.jl that does the same? I imagine something like this:

The encoding takes as an input the input data, which allows a mapping encoding the raw values xi onto an index idx_i, and decode that index onto xi. If length(x) == length(rank(x)), then there are as many outcomes as there are values in x (so the outcome space is a bit useless for complexity quantification). However, in datasets with repetitions, we can have length(rank(x)) << length(x). Then each rank index is an outcome, and we meaningfully estimate counts/probabilities by counting how many input data points map to each rank index.

Cite scientific papers related to the feature/algorithm

I have no references atm. This may or may not have been done by someone before.

If possible, sketch out an implementation strategy

Relatively straight-forward: follow the dev docs on implementing the encoding and outcome space.

A few notes:

  • The encoding must store the raw data, so that we can both and encode/decode.
  • There are many possible rank methods. The user can control which method to use by applying keyword argument to the encoding/outcome space (probably most elegant with dispatch on a T <: RankType). StatsBase.jl implements a few such ranking methods we can use for inspiration. I also implemented a randomization-tie-breaking variant for the Chatterjee coefficient that we would use.
@kahaaga kahaaga added enhancement New feature or request that is non-breaking encodings Related to the Encodings API new outcome space labels Jul 31, 2024
@Datseris
Copy link
Member

What's the index...? I must misunderstood something because I can't imagine a scenario where the length of the indices is not the same as the length of the data.

@Datseris
Copy link
Member

The encoding must store the raw data, so that we can both and encode/decode.

This is free as Julia stores objects by reference.

@kahaaga
Copy link
Member Author

kahaaga commented Jul 31, 2024

What's the index...? I must misunderstood something because I can't imagine a scenario where the length of the indices is not the same as the length of the data.

Sorry, I typed before thinking. What I meant is:

If length(x) == length(unique(rank(x))), then there are as many outcomes as there are values in x (so the outcome space is a bit useless for complexity quantification). However, in datasets with repetitions, we can have length(unique(rank(x))) << length(x).

Does that clarify it?

@Datseris
Copy link
Member

right, yeah this makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encodings Related to the Encodings API enhancement New feature or request that is non-breaking new outcome space
Projects
None yet
Development

No branches or pull requests

2 participants