Rank-based outcome space #421

kahaaga · 2024-07-31T06:44:40Z

Describe the feature you'd like to have

While implementing the Chatterjee coefficient of correlation in Associations.jl, I needed to compute ranks of an input vector, according to some method. That made me think: why not have an encoding and an outcome space in ComplexityMeasures.jl that does the same? I imagine something like this:

The encoding takes as an input the input data, which allows a mapping encoding the raw values xi onto an index idx_i, and decode that index onto xi. If length(x) == length(rank(x)), then there are as many outcomes as there are values in x (so the outcome space is a bit useless for complexity quantification). However, in datasets with repetitions, we can have length(rank(x)) << length(x). Then each rank index is an outcome, and we meaningfully estimate counts/probabilities by counting how many input data points map to each rank index.

Cite scientific papers related to the feature/algorithm

I have no references atm. This may or may not have been done by someone before.

If possible, sketch out an implementation strategy

Relatively straight-forward: follow the dev docs on implementing the encoding and outcome space.

A few notes:

The encoding must store the raw data, so that we can both and encode/decode.
There are many possible rank methods. The user can control which method to use by applying keyword argument to the encoding/outcome space (probably most elegant with dispatch on a T <: RankType). StatsBase.jl implements a few such ranking methods we can use for inspiration. I also implemented a randomization-tie-breaking variant for the Chatterjee coefficient that we would use.

The text was updated successfully, but these errors were encountered:

Datseris · 2024-07-31T06:49:44Z

What's the index...? I must misunderstood something because I can't imagine a scenario where the length of the indices is not the same as the length of the data.

Datseris · 2024-07-31T06:50:14Z

The encoding must store the raw data, so that we can both and encode/decode.

This is free as Julia stores objects by reference.

kahaaga · 2024-07-31T06:52:59Z

What's the index...? I must misunderstood something because I can't imagine a scenario where the length of the indices is not the same as the length of the data.

Sorry, I typed before thinking. What I meant is:

If length(x) == length(unique(rank(x))), then there are as many outcomes as there are values in x (so the outcome space is a bit useless for complexity quantification). However, in datasets with repetitions, we can have length(unique(rank(x))) << length(x).

Does that clarify it?

Datseris · 2024-07-31T07:22:11Z

right, yeah this makes sense.

kahaaga added enhancement New feature or request that is non-breaking encodings Related to the Encodings API new outcome space labels Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rank-based outcome space #421

Rank-based outcome space #421

kahaaga commented Jul 31, 2024

Datseris commented Jul 31, 2024

Datseris commented Jul 31, 2024

kahaaga commented Jul 31, 2024 •

edited

Loading

Datseris commented Jul 31, 2024

Rank-based outcome space #421

Rank-based outcome space #421

Comments

kahaaga commented Jul 31, 2024

Describe the feature you'd like to have

Cite scientific papers related to the feature/algorithm

If possible, sketch out an implementation strategy

Datseris commented Jul 31, 2024

Datseris commented Jul 31, 2024

kahaaga commented Jul 31, 2024 • edited Loading

Datseris commented Jul 31, 2024

kahaaga commented Jul 31, 2024 •

edited

Loading