feat(query/stdlib): enable read group min max #19158

fchikwekwe · 2020-07-30T16:53:46Z

This PR will enable readGroup min/max in OSS behind a feature flag.

To do:

Create feature flags specific to min/max for ReadGroup
Update storage capabilities in OSS and/or idpe to include min/max for ReadGroup
Update planner rule PushDownGroupRule and add tests
Add tests to storage/flux/table_test.go (tests added, and passing, but there are issues with GroupTable as seen below)
Add flux end-to-end tests (test(stdlib/planner): test readGroup max/min flux#3051)
Add launcher test to OSS
Add pipeline test to idpe

fchikwekwe · 2020-07-31T17:31:01Z

storage/flux/table.gen.go.tmpl

+	if err != nil {
+		panic(err)


I don't think that I should panic here, but I'm not sure how to handle this case.

I agree. It seems to me that if it's not possible to guarantee that only supported aggregate types will be passed down here then advance() bool needs to be refactored to advance() (bool, error).

It looks above that t.do, which takes the advance function returns an error, so perhaps this would be a relatively small refactor?

fchikwekwe · 2020-07-31T22:58:36Z

storage/flux/table.gen.go.tmpl

+		length := arr.Size()
+		for i := 0; i < length; i++ {
+			t, v := groupBy(arr.Timestamps, arr.Values)
+			timestamps = append(timestamps, t)
+			values = append(values, v)
+		}
+
+		{{if and (ne .Name "Boolean") (ne .Name "String")}}
+		timestamp, next = groupBy(timestamps, values)
+		//todo(faith): trying to aggregate values between array cursors if the subsequent
+		// value is still part of the same group. Unsure of the logic here.
+		if len != 0 {
+			value += next
+		} else {
+			value = next
+		}
+		{{end}}


I'm missing something in my logic here to preserve values between array cursors. I feel like the overall logic is pretty close, but there's something off here.

fchikwekwe · 2020-07-31T22:59:08Z

storage/flux/table.gen.go.tmpl

+func groupByCount{{.Name}}(timestamps []int64, values []{{.Type}})(int64, {{.Type}}) {
+	//todo(faith): I need a way to keep track of the values between array cursors
+	return math.MaxInt64, values[0]
+}
+{{end}}
+
+{{if and (ne .Name "Boolean") (ne .Name "String")}}
+func groupBySum{{.Name}}(timestamps []int64, values []{{.Type}})(int64, {{.Type}}) {
+	//todo(faith): I need a way to keep track of the values between array cursors
+	return math.MaxInt64, values[0]
+}


These are the affected cases that are causing the remaining tests to fail.

e-dard

This looks great to me @fchikwekwe. I haven't been keeping track of the pushdown work but I found this PR easy to digest :-)

Only requesting changes because I don't think we can let storage panic if we are sending down unsupported aggregates. I'm guessing a slight refactor on advance would fix that?

The performance stuff on sum aggregates is probably not important but I thought it was worth highlighting anyway.

e-dard · 2020-08-04T09:26:52Z

storage/flux/table.gen.go

+	}
+}
+
+func groupByMinFloat(timestamps []int64, values []float64) (int64, float64) {


I'm interested in understanding if we can in some cases have an invariant that timestamps is sorted? We always return data out of the storage engine in timestamp order, because that's how we store it. If we have done any merging of series then data could be out of order.

Anyway, not sure if this is something worth thinking about but I just wondered is all :-)

In this specific circumstance I think the timestamps would generally not be sorted. The reason is these methods are all aggregation methods of series that have already been aggregated so we're dealing with aggregating the series together here. So for min we can have the first series select a point at 40 seconds and then in the second series at 10 seconds.

e-dard · 2020-08-04T10:47:14Z

storage/flux/table.gen.go

+// their final result does not contain _time, so this timestamp value can be anything
+// and it won't matter.
+
+func groupBySumFloat(timestamps []int64, values []float64) (int64, float64) {


Typically this is something that would benefit from going as fast as possible, and it shouldn't be too complicated to improve performance for a nice tight loop like this.

One thing that might be worth trying here is to unroll the loop. Typically I have found with right loops like this you can usually improve performance by 30-40% depending on the situation.

I haven't tested this but I would try something like:

// we need to know how many values might not fit into blocks of four. // we can sum them up afterwards. remainder := len(values) % 4 // we will store the intermediate results in this array. var sums [4]float64 // sum up any extra values... for _, v := range a[len(a)-remainder:] { sums[0] += v } // iterate over four values at a time. This can improve overall perf // because of a bunch of reasons (bounds checking, pipelining, better cache, fewers instructions). for i := 0; i < len(a)-remainder; i += 4 { sums[0] += a[i+3] // as a habit I put the biggest index first incase it helps reduce bounds checking... sums[1] += a[i+2] sums[2] += a[i+1] sums[3] += a[i] } // just need to sum up the intermediate results return math.MaxInt64, sums[0] + sums[1] + sums[2] + sums[3]

Of course, I'm speculating and benchmarks are the source of truth 😄. But I would expect something like the above to be quite a bit faster for larger groups of values.

This is a cool idea and I'd be interested in benchmarking this as part of a future PR.

e-dard · 2020-08-04T10:51:14Z

storage/flux/table.gen.go.tmpl

+{{end}}
+
+{{if and (ne .Name "Boolean") (ne .Name "String")}}
+func groupBySum{{.Name}}(timestamps []int64, values []{{.Type}})(int64, {{.Type}}) {


Since timestamps is not used it can be omitted for clarity.

Suggested change

func groupBySum{{.Name}}(timestamps []int64, values []{{.Type}})(int64, {{.Type}}) {

func groupBySum{{.Name}}(_ []int64, values []{{.Type}})(int64, {{.Type}}) {

e-dard · 2020-08-04T10:54:05Z

storage/flux/table.gen.go.tmpl

+	if err != nil {
+		panic(err)


I agree. It seems to me that if it's not possible to guarantee that only supported aggregate types will be passed down here then advance() bool needs to be refactored to advance() (bool, error).

It looks above that t.do, which takes the advance function returns an error, so perhaps this would be a relatively small refactor?

e-dard · 2020-08-04T10:54:29Z

storage/flux/table_test.go

@@ -2564,6 +2564,115 @@ func TestStorageReader_EmptyTableNoEmptyWindows(t *testing.T) {
 	}
 }

+func TestStorageReader_ReadGroup(t *testing.T) {


👍 nice tests here

jsternberg · 2020-08-04T16:59:20Z

@e-dard I think I addressed your feedback and I think it passes tests now. Can you re-review?

e-dard · 2020-08-04T17:02:50Z

@jsternberg howdy. Is advance still able to panic? I don't know if it's github caching issue but still looks like it's panicking.

jsternberg · 2020-08-04T17:13:19Z

I think it's a caching issue. It doesn't panic anymore. Can you refresh to try and confirm that?

e-dard

LGTM 👍

wolffcm

I had a question about the extra plan node we're adding in the planner rule. I don't think we need them anymore. My guess is that you can remove the extra node, and update the planner rule test, and the other tests will continue to pass and you'll be good to go.

I also had a suggestion about the templated code in storage/flux but that's more subjective.

wolffcm · 2020-08-04T17:59:29Z

query/stdlib/influxdata/influxdb/rules.go

+				SelectorConfig: execute.SelectorConfig{
+					Column: execute.DefaultValueColLabel,
+				},
+			})


I think that @yzhang1991 recently completed work so that storage will more completely compute aggregate values. It used to be that it would produce an aggregate value per input series, rather than per output group.

That was why this extra min node was needed, but I think that it is not needed anymore? Or if it is still needed, we should be doing the re-aggregation in storage in the same place we do it for sum, count, etc, rather than here.

Note that for sum, count, first and last above we just return the single source node.

I'll try this out.

wolffcm · 2020-08-04T18:07:49Z

storage/flux/table.gen.go.tmpl

+		return groupByLast{{.Name}}, nil
+	case datatypes.AggregateTypeCount:
+		{{if eq .Name "Integer"}}
+		return groupByCount{{.Name}}, nil


This is a little confusing, since groupByCount actually does a sum, as it should, but I had to double-check to make sure. Maybe just put a comment here explaining this?

wolffcm · 2020-08-04T18:09:05Z

storage/flux/table.gen.go.tmpl

+// their final result does not contain _time, so this timestamp value can be anything
+// and it won't matter.
+{{if eq .Name "Integer"}}
+func groupByCount{{.Name}}(timestamps []int64, values []{{.Type}}) (int64, {{.Type}}) {


Do we need this? Can we just call groupBySum directly everywhere this would be called? It might make this a little easier to understand.

I don't think it's needed, but I didn't want it to be confusing. Do you think it would be better named as aggregateCountGroups? That might represent more of what this is. It is an aggregation of existing groups rather than the grouping themselves.

Yes, I like that name just fine.

Enables the mix and max aggregates for the ReadGroupAggregte pushdown behind a feature flag. Co-authored-by: Jonathan A. Sternberg <jonathan@influxdata.com>

wolffcm

Looks good!

fchikwekwe requested a review from a team July 30, 2020 16:53

fchikwekwe force-pushed the feat/read-group-min-max branch 5 times, most recently from cfde09c to df6b766 Compare July 31, 2020 17:29

fchikwekwe commented Jul 31, 2020

View reviewed changes

fchikwekwe force-pushed the feat/read-group-min-max branch 4 times, most recently from b0a8f4f to 5c6a1bd Compare July 31, 2020 22:57

fchikwekwe commented Jul 31, 2020

View reviewed changes

jsternberg force-pushed the feat/read-group-min-max branch from 5c6a1bd to 4ea9b11 Compare August 3, 2020 22:27

e-dard suggested changes Aug 4, 2020

View reviewed changes

jsternberg force-pushed the feat/read-group-min-max branch 2 times, most recently from 3583c14 to 2bdfe43 Compare August 4, 2020 16:58

e-dard approved these changes Aug 4, 2020

View reviewed changes

jsternberg requested review from a team and wolffcm and removed request for a team August 4, 2020 17:23

wolffcm suggested changes Aug 4, 2020

View reviewed changes

feat(query/stdlib): add min and max to ReadGroup

647f315

Enables the mix and max aggregates for the ReadGroupAggregte pushdown behind a feature flag. Co-authored-by: Jonathan A. Sternberg <jonathan@influxdata.com>

jsternberg force-pushed the feat/read-group-min-max branch from 2bdfe43 to 647f315 Compare August 4, 2020 20:46

jsternberg requested a review from wolffcm August 4, 2020 22:04

wolffcm approved these changes Aug 4, 2020

View reviewed changes

jsternberg merged commit d48dc69 into master Aug 5, 2020

jsternberg deleted the feat/read-group-min-max branch August 5, 2020 14:40

stuartcarnie mentioned this pull request Aug 26, 2020

Port rejected commits after Flux has been upgraded #19444

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(query/stdlib): enable read group min max #19158

feat(query/stdlib): enable read group min max #19158

fchikwekwe commented Jul 30, 2020 •

edited by jsternberg

Loading

fchikwekwe Jul 31, 2020

e-dard Aug 4, 2020

fchikwekwe Jul 31, 2020

fchikwekwe Jul 31, 2020

e-dard left a comment

e-dard Aug 4, 2020

jsternberg Aug 4, 2020

e-dard Aug 4, 2020

jsternberg Aug 4, 2020

e-dard Aug 4, 2020

e-dard Aug 4, 2020

e-dard Aug 4, 2020

jsternberg commented Aug 4, 2020

e-dard commented Aug 4, 2020

jsternberg commented Aug 4, 2020

e-dard left a comment

wolffcm left a comment

wolffcm Aug 4, 2020

jsternberg Aug 4, 2020

wolffcm Aug 4, 2020

wolffcm Aug 4, 2020

jsternberg Aug 4, 2020

wolffcm Aug 4, 2020

wolffcm left a comment

	func groupBySum{{.Name}}(timestamps []int64, values []{{.Type}})(int64, {{.Type}}) {
	func groupBySum{{.Name}}(_ []int64, values []{{.Type}})(int64, {{.Type}}) {

feat(query/stdlib): enable read group min max #19158

feat(query/stdlib): enable read group min max #19158

Conversation

fchikwekwe commented Jul 30, 2020 • edited by jsternberg Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-dard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsternberg commented Aug 4, 2020

e-dard commented Aug 4, 2020

jsternberg commented Aug 4, 2020

e-dard left a comment

Choose a reason for hiding this comment

wolffcm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wolffcm left a comment

Choose a reason for hiding this comment

fchikwekwe commented Jul 30, 2020 •

edited by jsternberg

Loading