-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement median aggregate #2411
Conversation
a42c260
to
386ee70
Compare
this is ready for a look now. |
90047e4
to
7cf2247
Compare
@neonstalwart would you mind rebasing this? |
no problem. first thing tomorrow |
7cf2247
to
5314d4a
Compare
@toddboom rebased. one thing i wondered was if |
} | ||
} | ||
|
||
func getSortedRange(data []float64, start int, count int) []float64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While these functions are unexported, they are not trivial. Would doc strings help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's your convention/requirements? i'll try to add something if you like.
i just tried to make the names descriptive since i think that things like
// MapMin collects the values to pass to the reducer
are kind of redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A doc string introducing getSortedRange
would be useful -- why it exists, and its advantages over the standard library. Basically answering the question I raised.
Thanks @neonstalwart -- tests look good, results make sense. I would like to know why you just didn't use the standard library to sort the data before selecting the mean. I might be missing something about median, so please let me know. |
the partitioning and discarding is O(N) in the average case compared to O(NlgN) for the library sort. i tried to make the partitioning and discarding generic enough that they could be used to do things like get the largest/smallest N elements without sorting the whole list. assuming that your data set is large, sorting the whole thing is more work than needed when you just want a small subset of the sorted set. |
the last piece of feedback remaining to be addressed is to add unit tests for |
Great, thanks @neonstalwart -- looking forward to it. |
bbad2ae
to
c47f803
Compare
@otoolep i added a few tests for in the process of adding tests i thought i would add some benchmarks to compare with the built-in sort and found that i had missed the mark by a lot (about 3 times slower) due to poor memory management. i was able to get closer to where it should be by making some tweaks and on my machine i'm now seeing |
Nice. It seems like you used the standard Go benchmark approach to profiling the code, correct? Can you show us the full output? |
@otoolep here's the full output of another test run - the tests are included in this PR > go test -v -bench=BenchmarkGetSortedRange -run=XXX ./influxql
PASS
BenchmarkGetSortedRangeByPivot 300000 5822 ns/op
BenchmarkGetSortedRangeBySort 100000 10717 ns/op
ok github.com/influxdb/influxdb/influxql 3.112s |
Great -- thanks @neonstalwart. I will take 1 final look at this, and then merge. Thanks again for the thorough job. |
@pauldix -- you want to take a quick look? |
refs #1824
this isn't quite working yet but if anyone wants to provide early feedback about the approach and/or help bring it over the line, that would be great!
i've included
getSortedRange
as a building block for some other aggregations. it partitions the input until we have a range of values that we are interested in and then sorts just that sub-range. the partitioning is an attempt at avoid sorting the whole series. the partitioning to find the range should be O(n) in the average case.