Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Min and max numerical aggregates #3078

Merged
merged 12 commits into from
Oct 2, 2024

Conversation

nasdf
Copy link
Member

@nasdf nasdf commented Sep 27, 2024

Relevant issue(s)

Resolves #2978

Description

This PR adds support for two new query aggregates min and max.

Tasks

  • I made sure the code is well commented, particularly hard-to-understand areas.
  • I made sure the repository-held documentation is changed accordingly.
  • I made sure the pull request title adheres to the conventional commit style (the subset used in the project can be found in tools/configs/chglog/config.yml).
  • I made sure to discuss its limitations such as threats to validity, vulnerability to mistake and misuse, robustness to invalidation of assumptions, resource requirements, ...

How has this been tested?

Added integration tests

Specify the platform(s) on which this was tested:

  • MacOS

@nasdf nasdf added feature New feature or request area/query Related to the query component labels Sep 27, 2024
@nasdf nasdf added this to the DefraDB v0.14 milestone Sep 27, 2024
@nasdf nasdf self-assigned this Sep 27, 2024
@nasdf nasdf requested a review from a team September 27, 2024 00:10
Copy link

codecov bot commented Sep 27, 2024

Codecov Report

Attention: Patch coverage is 93.67681% with 27 lines in your changes missing coverage. Please review.

Project coverage is 79.85%. Comparing base (c986312) to head (3cff29d).
Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
internal/planner/max.go 92.57% 8 Missing and 3 partials ⚠️
internal/planner/min.go 92.57% 8 Missing and 3 partials ⚠️
internal/request/graphql/schema/generate.go 93.65% 2 Missing and 2 partials ⚠️
internal/planner/sum.go 95.83% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3078      +/-   ##
===========================================
+ Coverage    79.67%   79.85%   +0.18%     
===========================================
  Files          348      351       +3     
  Lines        27014    27358     +344     
===========================================
+ Hits         21522    21845     +323     
- Misses        3962     3975      +13     
- Partials      1530     1538       +8     
Flag Coverage Δ
all-tests 79.85% <93.68%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
internal/planner/aggregate.go 100.00% <100.00%> (ø)
internal/planner/explain.go 60.87% <ø> (ø)
internal/planner/select.go 85.36% <100.00%> (+0.25%) ⬆️
internal/planner/top.go 70.63% <100.00%> (+0.96%) ⬆️
internal/planner/sum.go 87.72% <95.83%> (-1.94%) ⬇️
internal/request/graphql/schema/generate.go 88.15% <93.65%> (+0.04%) ⬆️
internal/planner/max.go 92.57% <92.57%> (ø)
internal/planner/min.go 92.57% <92.57%> (ø)

... and 14 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c986312...3cff29d. Read the comment docs.

}

// Note: this test should follow a different code path to `_avg` on it's own
// utilising the existing `_min` node instead of adding a new one. This test cannot
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: _max

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want too review this but will wait until the tests are passing, please re-request my review when they are and I'll get right on this :)

@nasdf nasdf force-pushed the nasdf/feat/min-max-aggregate branch from db85fa3 to 58a5093 Compare September 27, 2024 17:22
@nasdf
Copy link
Member Author

nasdf commented Sep 27, 2024

I want too review this but will wait until the tests are passing, please re-request my review when they are and I'll get right on this :)

Should be all fixed now. There was an issue with casting max value floats to ints which I think may exist in the sum aggregate where I copied the code from.

@nasdf nasdf requested a review from AndrewSisley September 27, 2024 17:52
Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just some minor stuff to sort out first IMO, and the below question.

question: It does not lok like you have added max/min to non-numeric types such as DateTime, String, Boolean, etc, and the gql related code does not look like it will facilitate the introduction of that support - was this an oversight or do you have a plan on how to introduce it later? I am concerned that doing so will be a breaking change to the GQL syntax due to the return type.

return a.Value() < b.Value()
}

func reverse[T any](original func(T, T) bool) func(T, T) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: inverse might be a better name? I'm not sure though. Maybe a line of documentation would be handy as reverse suggests this is order related, but the implementation and signature does not suggest it is limited to that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if err != nil {
return nil, err
}
// If one source property is a float, the result will be a float - no need to check the rest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This comment looks copy pasted and doesn't quite make sense for min/max If one source property is a float, the result will be a float is not really correct, although this is a valid enough way to determine return-type. I suggest tweaking the comment slightly for min/max.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing which part is incorrect.

Copy link
Contributor

@AndrewSisley AndrewSisley Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you had:

[]any{
  float(3.5),
  int(9999),
}

max would be an int. Summing/averaging the set would always result in a float, but min/maxing the set would not. The comment makes sense for other aggregates, but not for min/max.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I see what you mean. I've refactored the logic so it actually returns the correct type now.

Correct me if I'm wrong but I don't think we have any operations that would allow us to test mixed type aggregates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong but I don't think we have any operations that would allow us to test mixed type aggregates.

No you are correct, I'm only commenting on the comment, the code/tests are all good IMO :)

I've refactored the logic so it actually returns the correct type now.

Nice, thanks!

func (n *maxNode) Kind() string { return "maxNode" }
func (n *maxNode) Init() error { return n.plan.Init() }
func (n *maxNode) Start() error { return n.plan.Start() }
func (n *maxNode) Spans(spans core.Spans) { n.plan.Spans(spans) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: This should probably be covered by tests, I think having a docID param on the aggregate should hit this, otherwise it could be deleted/broken and it wouldn't break the build.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a test that should cover it

max := -math.MaxFloat64
for _, source := range n.aggregateMapping {
child := n.currentValue.Fields[source.Index]
collectionMax := -math.MaxFloat64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: The code in this for loop looks like it will return -math.MaxFloat64 if the target is empty/nil, have you tested this, and is it the behaviour we want (nil might be better, otherwise an empty set will have the same result as an item with a literal -math.MaxFloat64?)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored to return nil on empty values

max := -math.MaxFloat64
for _, source := range n.aggregateMapping {
child := n.currentValue.Fields[source.Index]
collectionMax := -math.MaxFloat64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggstion: You seem to alternate between using ``-math.MaxFloat64and-math.MinFloat64`, this makes spotting its significance/other-refs harder and adds a easily avoided little confusion - I suggest picking one and using it throughout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored, this is no longer applicable

plan planNode

isFloat bool
virtualFieldIndex int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I'm sorry about the bother, because you are copying my undocumented work in the other aggregates, but I think it would be worth a line of documentation on virtualFieldIndex on these nodes, its not very clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@fredcarle fredcarle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I like what you did with the aggregate generics and thanks for adding all those tests. Just need to resole Andy's comments before merge.

@nasdf
Copy link
Member Author

nasdf commented Sep 30, 2024

Looks good to me, just some minor stuff to sort out first IMO, and the below question.

question: It does not lok like you have added max/min to non-numeric types such as DateTime, String, Boolean, etc, and the gql related code does not look like it will facilitate the introduction of that support - was this an oversight or do you have a plan on how to introduce it later? I am concerned that doing so will be a breaking change to the GQL syntax due to the return type.

I'd prefer to add support for non-numerical types in a new PR. I think it will require a good amount of changes to the aggregate types and I'm not worried about a breaking GQL change if it's needed.

internal/planner/max.go Show resolved Hide resolved
internal/planner/max.go Show resolved Hide resolved
internal/planner/max.go Show resolved Hide resolved
internal/planner/min.go Show resolved Hide resolved
internal/planner/min.go Show resolved Hide resolved
tests/integration/explain.go Show resolved Hide resolved
@AndrewSisley
Copy link
Contributor

I think it will require a good amount of changes to the aggregate types

Which aggregate types? The new types added in this PR? Or some existing stuff to?

I am concerned that you will just end up re-writing a lot the new stuff added in this PR to handle (IMO) a pretty core sub-feature (DateTime).

Note: I am alright with you doing that in another PR is everyone else is, especially given this has already been reviewed now (and so a lot of the extra time has already been spent), but my preference would be to get the long-term syntax (and internal code structure) correct first.

@nasdf nasdf changed the title feat: Min and max aggregates feat: Min and max numerical aggregates Sep 30, 2024
var max *float64
for _, source := range n.aggregateMapping {
child := n.currentValue.Fields[source.Index]
var collectionMax *float64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: This is incorrect for large integers, you cannot convert a very big int to a float and then back again and expect it to have the same value.

suggestion: Add failing tests before fixing this (if you can, you might struggle to set an Int to Int64MaxVal atm via our public clients, as GQL is technically only 32bit Ints)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. the below will result in a negative number:

a := math.MaxInt64 - 1
		println(fmt.Sprint(a))
		b := float64(a)
		c := int64(b)
		panic(fmt.Sprint(c))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored to use big.Float

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you try to test this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added tests for int max value

@nasdf
Copy link
Member Author

nasdf commented Sep 30, 2024

Which aggregate types? The new types added in this PR? Or some existing stuff to?

I haven't thought through all of it yet, but because the aggregates are expected to return a float and can then be used as input into other aggregates, there will be issues with an aggregate that can return a union type of String, DateTime, etc.

I am concerned that you will just end up re-writing a lot the new stuff added in this PR to handle (IMO) a pretty core sub-feature (DateTime).

The planner code is very generic. I'm not too worried about this.

@nasdf nasdf requested a review from shahzadlone September 30, 2024 22:32
@nasdf nasdf requested a review from AndrewSisley September 30, 2024 22:32
Copy link
Member

@shahzadlone shahzadlone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for doing the suggestions

@AndrewSisley
Copy link
Contributor

there will be issues with an aggregate that can return a union type of String, DateTime, etc.

Yes, that is why I am concerned you are painting the syntax into a corner and we will need to make an avoidable breaking change in the (possibly near) future in order to correct this problem. The breaking change will require us to change our tests, and users to change their code.

I do not think this is a good thing to do, but will not block the PR over it. Please resolve the cast problem/test request before merge though.

@nasdf
Copy link
Member Author

nasdf commented Oct 1, 2024

there will be issues with an aggregate that can return a union type of String, DateTime, etc.

Yes, that is why I am concerned you are painting the syntax into a corner and we will need to make an avoidable breaking change in the (possibly near) future in order to correct this problem. The breaking change will require us to change our tests, and users to change their code.

I do not think this is a good thing to do, but will not block the PR over it. Please resolve the cast problem/test request before merge though.

The GQL type system won't allow for an aggregate that returns a union to be used as input in another aggregate.

I think the best approach will be to add new aggregates: _minDateTime, _maxString, etc.

This shouldn't result in any breaking changes to the existing aggregates.

@AndrewSisley
Copy link
Contributor

I think the best approach will be to add new aggregates: _minDateTime, _maxString, etc.

This shouldn't result in any breaking changes to the existing aggregates.

Only if we dont consider it weird that we have _min and not _minFloat, if that is the long term syntax we want I would prefer that all the types be consistent.

@nasdf nasdf force-pushed the nasdf/feat/min-max-aggregate branch from 623e275 to 4f8b79c Compare October 1, 2024 15:59
Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked over the long term plan in discord, is fully compatible with this PR - thanks for all of this Keenan, and for the discussion :)

@nasdf nasdf merged commit e59f6d9 into sourcenetwork:develop Oct 2, 2024
43 checks passed
@islamaliev
Copy link
Contributor

bug bash result: all looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/query Related to the query component feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for _min and _max aggregates
5 participants