Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement median aggregate #2411

Merged
merged 5 commits into from
May 6, 2015
Merged

Conversation

neonstalwart
Copy link
Contributor

refs #1824

this isn't quite working yet but if anyone wants to provide early feedback about the approach and/or help bring it over the line, that would be great!

i've included getSortedRange as a building block for some other aggregations. it partitions the input until we have a range of values that we are interested in and then sorts just that sub-range. the partitioning is an attempt at avoid sorting the whole series. the partitioning to find the range should be O(n) in the average case.

@neonstalwart
Copy link
Contributor Author

this is ready for a look now.

@neonstalwart neonstalwart force-pushed the median-aggregate branch 2 times, most recently from 90047e4 to 7cf2247 Compare April 24, 2015 22:27
@toddboom
Copy link
Contributor

@neonstalwart would you mind rebasing this?

@neonstalwart
Copy link
Contributor Author

no problem. first thing tomorrow

@neonstalwart
Copy link
Contributor Author

@toddboom rebased. one thing i wondered was if MapStddev should get a more generic name but i couldn't think of something good off the top of my head - MapFloat64 just doesn't seem right.

}
}

func getSortedRange(data []float64, start int, count int) []float64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While these functions are unexported, they are not trivial. Would doc strings help?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's your convention/requirements? i'll try to add something if you like.

i just tried to make the names descriptive since i think that things like

// MapMin collects the values to pass to the reducer

are kind of redundant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A doc string introducing getSortedRange would be useful -- why it exists, and its advantages over the standard library. Basically answering the question I raised.

@otoolep
Copy link
Contributor

otoolep commented May 1, 2015

Thanks @neonstalwart -- tests look good, results make sense.

I would like to know why you just didn't use the standard library to sort the data before selecting the mean. I might be missing something about median, so please let me know.

@neonstalwart
Copy link
Contributor Author

the partitioning and discarding is O(N) in the average case compared to O(NlgN) for the library sort.

i tried to make the partitioning and discarding generic enough that they could be used to do things like get the largest/smallest N elements without sorting the whole list. assuming that your data set is large, sorting the whole thing is more work than needed when you just want a small subset of the sorted set.

@neonstalwart
Copy link
Contributor Author

the last piece of feedback remaining to be addressed is to add unit tests for getSortedRange and friends. it will probably be next week before that happens.

@otoolep
Copy link
Contributor

otoolep commented May 1, 2015

Great, thanks @neonstalwart -- looking forward to it.

@neonstalwart
Copy link
Contributor Author

@otoolep i added a few tests for getSortedRange. i'm open to any suggestions for other things to test - sometimes it's easier to see things from the outside.

in the process of adding tests i thought i would add some benchmarks to compare with the built-in sort and found that i had missed the mark by a lot (about 3 times slower) due to poor memory management. i was able to get closer to where it should be by making some tweaks and on my machine i'm now seeing getSortedRange is about 40% faster than using the built-in sort (5961 ns/op vs 10084 ns/op) on those benchmarks.

@otoolep
Copy link
Contributor

otoolep commented May 4, 2015

Nice.

It seems like you used the standard Go benchmark approach to profiling the code, correct? Can you show us the full output?

@neonstalwart
Copy link
Contributor Author

@otoolep here's the full output of another test run - the tests are included in this PR

> go test -v -bench=BenchmarkGetSortedRange -run=XXX ./influxql
PASS
BenchmarkGetSortedRangeByPivot    300000              5822 ns/op
BenchmarkGetSortedRangeBySort     100000             10717 ns/op
ok      github.com/influxdb/influxdb/influxql   3.112s

@otoolep
Copy link
Contributor

otoolep commented May 5, 2015

Great -- thanks @neonstalwart. I will take 1 final look at this, and then merge.

Thanks again for the thorough job.

@otoolep
Copy link
Contributor

otoolep commented May 5, 2015

@pauldix -- you want to take a quick look?

toddboom added a commit that referenced this pull request May 6, 2015
@toddboom toddboom merged commit 710576e into influxdata:master May 6, 2015
@neonstalwart neonstalwart deleted the median-aggregate branch May 6, 2015 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants