Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(query/stdlib): enable read group min max #19158

Merged
merged 1 commit into from
Aug 5, 2020

Conversation

fchikwekwe
Copy link
Contributor

@fchikwekwe fchikwekwe commented Jul 30, 2020

This PR will enable readGroup min/max in OSS behind a feature flag.

To do:

  • Create feature flags specific to min/max for ReadGroup
  • Update storage capabilities in OSS and/or idpe to include min/max for ReadGroup
  • Update planner rule PushDownGroupRule and add tests
  • Add tests to storage/flux/table_test.go (tests added, and passing, but there are issues with GroupTable as seen below)
  • Add flux end-to-end tests (test(stdlib/planner): test readGroup max/min flux#3051)
  • Add launcher test to OSS
  • Add pipeline test to idpe

@fchikwekwe fchikwekwe requested a review from a team July 30, 2020 16:53
@fchikwekwe fchikwekwe force-pushed the feat/read-group-min-max branch 5 times, most recently from cfde09c to df6b766 Compare July 31, 2020 17:29
Comment on lines 747 to 748
if err != nil {
panic(err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that I should panic here, but I'm not sure how to handle this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It seems to me that if it's not possible to guarantee that only supported aggregate types will be passed down here then advance() bool needs to be refactored to advance() (bool, error).

It looks above that t.do, which takes the advance function returns an error, so perhaps this would be a relatively small refactor?

@fchikwekwe fchikwekwe force-pushed the feat/read-group-min-max branch 4 times, most recently from b0a8f4f to 5c6a1bd Compare July 31, 2020 22:57
Comment on lines 760 to 776
length := arr.Size()
for i := 0; i < length; i++ {
t, v := groupBy(arr.Timestamps, arr.Values)
timestamps = append(timestamps, t)
values = append(values, v)
}

{{if and (ne .Name "Boolean") (ne .Name "String")}}
timestamp, next = groupBy(timestamps, values)
//todo(faith): trying to aggregate values between array cursors if the subsequent
// value is still part of the same group. Unsure of the logic here.
if len != 0 {
value += next
} else {
value = next
}
{{end}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing something in my logic here to preserve values between array cursors. I feel like the overall logic is pretty close, but there's something off here.

Comment on lines 886 to 885
func groupByCount{{.Name}}(timestamps []int64, values []{{.Type}})(int64, {{.Type}}) {
//todo(faith): I need a way to keep track of the values between array cursors
return math.MaxInt64, values[0]
}
{{end}}

{{if and (ne .Name "Boolean") (ne .Name "String")}}
func groupBySum{{.Name}}(timestamps []int64, values []{{.Type}})(int64, {{.Type}}) {
//todo(faith): I need a way to keep track of the values between array cursors
return math.MaxInt64, values[0]
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the affected cases that are causing the remaining tests to fail.

@jsternberg jsternberg force-pushed the feat/read-group-min-max branch from 5c6a1bd to 4ea9b11 Compare August 3, 2020 22:27
Copy link
Contributor

@e-dard e-dard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me @fchikwekwe. I haven't been keeping track of the pushdown work but I found this PR easy to digest :-)

Only requesting changes because I don't think we can let storage panic if we are sending down unsupported aggregates. I'm guessing a slight refactor on advance would fix that?

The performance stuff on sum aggregates is probably not important but I thought it was worth highlighting anyway.

}
}

func groupByMinFloat(timestamps []int64, values []float64) (int64, float64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in understanding if we can in some cases have an invariant that timestamps is sorted? We always return data out of the storage engine in timestamp order, because that's how we store it. If we have done any merging of series then data could be out of order.

Anyway, not sure if this is something worth thinking about but I just wondered is all :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this specific circumstance I think the timestamps would generally not be sorted. The reason is these methods are all aggregation methods of series that have already been aggregated so we're dealing with aggregating the series together here. So for min we can have the first series select a point at 40 seconds and then in the second series at 10 seconds.

// their final result does not contain _time, so this timestamp value can be anything
// and it won't matter.

func groupBySumFloat(timestamps []int64, values []float64) (int64, float64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically this is something that would benefit from going as fast as possible, and it shouldn't be too complicated to improve performance for a nice tight loop like this.

One thing that might be worth trying here is to unroll the loop. Typically I have found with right loops like this you can usually improve performance by 30-40% depending on the situation.

I haven't tested this but I would try something like:

        // we need to know how many values might not fit into blocks of four.
       // we can sum them up afterwards.
        remainder := len(values) % 4

        // we will store the intermediate results in this array.
	var sums [4]float64

        // sum up any extra values...
	for _, v := range a[len(a)-remainder:] {
		sums[0] += v
	}
        
        // iterate over four values at a time. This can improve overall perf
        // because of a bunch of reasons (bounds checking, pipelining, better cache, fewers instructions). 
	for i := 0; i < len(a)-remainder; i += 4 {
		sums[0] += a[i+3] //  as a habit I put the biggest index first incase it helps reduce bounds checking...
		sums[1] += a[i+2]
		sums[2] += a[i+1]
		sums[3] += a[i]
	}

        // just need to sum up the intermediate results
        return math.MaxInt64, sums[0] + sums[1] + sums[2] + sums[3]

Of course, I'm speculating and benchmarks are the source of truth 😄. But I would expect something like the above to be quite a bit faster for larger groups of values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cool idea and I'd be interested in benchmarking this as part of a future PR.

{{end}}

{{if and (ne .Name "Boolean") (ne .Name "String")}}
func groupBySum{{.Name}}(timestamps []int64, values []{{.Type}})(int64, {{.Type}}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since timestamps is not used it can be omitted for clarity.

Suggested change
func groupBySum{{.Name}}(timestamps []int64, values []{{.Type}})(int64, {{.Type}}) {
func groupBySum{{.Name}}(_ []int64, values []{{.Type}})(int64, {{.Type}}) {

Comment on lines 747 to 748
if err != nil {
panic(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It seems to me that if it's not possible to guarantee that only supported aggregate types will be passed down here then advance() bool needs to be refactored to advance() (bool, error).

It looks above that t.do, which takes the advance function returns an error, so perhaps this would be a relatively small refactor?

@@ -2564,6 +2564,115 @@ func TestStorageReader_EmptyTableNoEmptyWindows(t *testing.T) {
}
}

func TestStorageReader_ReadGroup(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice tests here

@jsternberg jsternberg force-pushed the feat/read-group-min-max branch 2 times, most recently from 3583c14 to 2bdfe43 Compare August 4, 2020 16:58
@jsternberg
Copy link
Contributor

@e-dard I think I addressed your feedback and I think it passes tests now. Can you re-review?

@e-dard
Copy link
Contributor

e-dard commented Aug 4, 2020

@jsternberg howdy. Is advance still able to panic? I don't know if it's github caching issue but still looks like it's panicking.

@jsternberg
Copy link
Contributor

I think it's a caching issue. It doesn't panic anymore. Can you refresh to try and confirm that?

Copy link
Contributor

@e-dard e-dard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@jsternberg jsternberg requested review from a team and wolffcm and removed request for a team August 4, 2020 17:23
Copy link

@wolffcm wolffcm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a question about the extra plan node we're adding in the planner rule. I don't think we need them anymore. My guess is that you can remove the extra node, and update the planner rule test, and the other tests will continue to pass and you'll be good to go.

I also had a suggestion about the templated code in storage/flux but that's more subjective.

SelectorConfig: execute.SelectorConfig{
Column: execute.DefaultValueColLabel,
},
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that @yzhang1991 recently completed work so that storage will more completely compute aggregate values. It used to be that it would produce an aggregate value per input series, rather than per output group.

That was why this extra min node was needed, but I think that it is not needed anymore? Or if it is still needed, we should be doing the re-aggregation in storage in the same place we do it for sum, count, etc, rather than here.

Note that for sum, count, first and last above we just return the single source node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try this out.

return groupByLast{{.Name}}, nil
case datatypes.AggregateTypeCount:
{{if eq .Name "Integer"}}
return groupByCount{{.Name}}, nil
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little confusing, since groupByCount actually does a sum, as it should, but I had to double-check to make sure. Maybe just put a comment here explaining this?

// their final result does not contain _time, so this timestamp value can be anything
// and it won't matter.
{{if eq .Name "Integer"}}
func groupByCount{{.Name}}(timestamps []int64, values []{{.Type}}) (int64, {{.Type}}) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Can we just call groupBySum directly everywhere this would be called? It might make this a little easier to understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's needed, but I didn't want it to be confusing. Do you think it would be better named as aggregateCountGroups? That might represent more of what this is. It is an aggregation of existing groups rather than the grouping themselves.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I like that name just fine.

Enables the mix and max aggregates for the ReadGroupAggregte pushdown behind a feature flag.

Co-authored-by: Jonathan A. Sternberg <jonathan@influxdata.com>
@jsternberg jsternberg force-pushed the feat/read-group-min-max branch from 2bdfe43 to 647f315 Compare August 4, 2020 20:46
@jsternberg jsternberg requested a review from wolffcm August 4, 2020 22:04
Copy link

@wolffcm wolffcm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@jsternberg jsternberg merged commit d48dc69 into master Aug 5, 2020
@jsternberg jsternberg deleted the feat/read-group-min-max branch August 5, 2020 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants