Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ranking vector values
I often need to rank the values in a DF column (ie Vector). Esp when they represent values in an ordered way, such as a times series.
new methods for Vectors:
df[:a].rank()
: returns a Vector of the same length asdf[:a]
with values of each element's ranked order (1 for first, 2 for second, etc.). Ties result in the same rank, but increment the rank of those below it.So if the original vector is like [20,50,42,11], calling
rank()
on it will return a vector like [3,1,2,4], because 20 is the 3rd highest value, 50 is the 1st highest value, etc. An example of a tie would be [42,50,42,20] returning a ranking of [2,1,2,4], since there are two 42s they use up both the 2nd and 3rd spot (this is semi-standard way of dealing with ties when ranking)sometimes you want to rank in reverse order , so lower values are "better". In this case use
df[:a].rank(ascending=false)
more methods:
I also often want to know, "given the last value in this time series that is the best one in how many periods?", for example, "this weeks sales of Widget X is the highest in 3 weeks!".
df[:a].best_in()
: returns the number of elements back from the last one has to go to find an element ranked better than the last.Returning to the original example
if the original vector is like [20,50,42,11], calling
best_in()
on it returns 1, because the default ranking is ascending and the last value is the worst of the 4 elements. it's the best since ...itself. But if the ranking is in descending order, then the last value is the best value of all of them so it would return 4 : "the best in 4 periods". For this case usebest_in(ascending=false)
as a mirror image, there is a method
df[:a].worst_in()
: returns the number of elements back from the last one has to go to find an element ranked worst than the last.This can be useful for red flags such as "this months sales numbers are the worst in 5 months!"
Also updated: the README.md file, with (hopefully) better written documentation that I wrong above. There are also tests for all this.
comments on code
Since the Vector class converts nils to NaNs, it can be a bit problematic to deal with those when doing <=> sorting and comparisons. So my code converts the incoming vector to an array for the purposes of ranking and then converts back for output for the
rank()
method. However bothbest_in
andworst_in
return integers