Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rank vector values #16

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

jcquarto
Copy link

@jcquarto jcquarto commented Nov 12, 2021

Ranking vector values

I often need to rank the values in a DF column (ie Vector). Esp when they represent values in an ordered way, such as a times series.

new methods for Vectors:

df[:a].rank() : returns a Vector of the same length as df[:a] with values of each element's ranked order (1 for first, 2 for second, etc.). Ties result in the same rank, but increment the rank of those below it.

So if the original vector is like [20,50,42,11], calling rank() on it will return a vector like [3,1,2,4], because 20 is the 3rd highest value, 50 is the 1st highest value, etc. An example of a tie would be [42,50,42,20] returning a ranking of [2,1,2,4], since there are two 42s they use up both the 2nd and 3rd spot (this is semi-standard way of dealing with ties when ranking)

sometimes you want to rank in reverse order , so lower values are "better". In this case use df[:a].rank(ascending=false)

more methods:

I also often want to know, "given the last value in this time series that is the best one in how many periods?", for example, "this weeks sales of Widget X is the highest in 3 weeks!".

df[:a].best_in() : returns the number of elements back from the last one has to go to find an element ranked better than the last.

Returning to the original example
if the original vector is like [20,50,42,11], calling best_in() on it returns 1, because the default ranking is ascending and the last value is the worst of the 4 elements. it's the best since ...itself. But if the ranking is in descending order, then the last value is the best value of all of them so it would return 4 : "the best in 4 periods". For this case use best_in(ascending=false)

as a mirror image, there is a method
df[:a].worst_in() : returns the number of elements back from the last one has to go to find an element ranked worst than the last.

This can be useful for red flags such as "this months sales numbers are the worst in 5 months!"

Also updated: the README.md file, with (hopefully) better written documentation that I wrong above. There are also tests for all this.

comments on code

Since the Vector class converts nils to NaNs, it can be a bit problematic to deal with those when doing <=> sorting and comparisons. So my code converts the incoming vector to an array for the purposes of ranking and then converts back for output for the rank() method. However both best_in and worst_in return integers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant