Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add datetime aggregation fixes #144

Merged
merged 1 commit into from
Nov 9, 2022

Conversation

Yury-Fridlyand
Copy link

@Yury-Fridlyand Yury-Fridlyand commented Oct 26, 2022

Signed-off-by: Yury-Fridlyand yuryf@bitquilltech.com

Description

Make listed aggregations work with datetime types.

Fixes

  • min
  • max
  • avg

Not included

  • var_samp
  • var_pop
  • stddev_samp
  • stddev_pop
  • sum [1]

Implementation details:

Presicion

Aggregation done with milliseconds precision - this follows OpenSearch approach. See code snippets: one, two.
We convert datetimes to millis and back on our own (see below), so we can scale up to nanoseconds. Such fix requires additional changes - few tests fail.

ExpressionAggregationScript::execute

A callback function, called by OpenSearch node to extract a value during processing aggregation script.
This added for backward compatibility:


This - to extract value. Fortunately, toEpochMilli() returns negative values for pre-Epoch timestamps, so we are able to use it for group/compare any values.
switch ((ExprCoreType)expr.type()) {
case TIME:
// workaround for session context issue
// TODO remove once fixed
return MILLIS.between(LocalTime.MIN, expr.timeValue());
case DATE:
case DATETIME:
case TIMESTAMP:
return expr.timestampValue().toEpochMilli();
default:
return expr.value();
}

OpenSearch accepts dates in Joda lib types and converts to milliseconds since Epoch. We have java datetime types, to I do conversion there.
https://github.com/opensearch-project/OpenSearch/blob/140e8d3e6c91519edc47be07b4cd053fdfac1769/server/src/main/java/org/opensearch/search/aggregations/support/values/ScriptDoubleValues.java#L123

OpenSearchExprValueFactory parseTimestamp and constructTimestamp

OpenSearch returns aggregated values as strings, even milliseconds since Epoch. I have to extract it, so I combined these functions together.

private ExprValue parseTimestamp(Content value) {
if (value.isString()) {
// value may contain epoch millis as a string, trying to extract it
try {
return parseTimestamp(new ObjectContent(Long.parseLong(value.stringValue())));
} catch (NumberFormatException ignored) { /* nothing to do, try another format */ }
try {
return new ExprTimestampValue(
// Using OpenSearch DateFormatters for now.
DateFormatters.from(DATE_TIME_FORMATTER.parse(value.stringValue())).toInstant());
} catch (DateTimeParseException e) {
throw new IllegalStateException(String.format(
"Construct ExprTimestampValue from \"%s\" failed, unsupported date format.",
value.stringValue()), e);
}
}
if (value.isNumber()) {
return new ExprTimestampValue(Instant.ofEpochMilli(value.longValue()));
}
return new ExprTimestampValue((Instant) value.objectValue());
}

ExprValueUtils

These two methods used by avg in-memory aggregation for datetime types.

public static long extractEpochMilliFromAnyDateTimeType(ExprValue value) {

public static ExprValue convertEpochMilliToDateTimeType(long value, ExprCoreType type) {

AvgAggregator

  1. Added new field to store which type is being aggregated
  2. AvgState was renamed to DoubleAvgState
  3. DateTimeAvgState does the same logic, but for datetime types.
    @Override
    public ExprValue result() {
    if (count == 0) {
    return ExprNullValue.of();
    }
    return ExprValueUtils.convertEpochMilliToDateTimeType(Math.round(total / count), dataType);
    }
    @Override
    protected AvgState iterate(ExprValue value) {
    total += ExprValueUtils.extractEpochMilliFromAnyDateTimeType(value);
    return super.iterate(value);
    }
  4. Added abstract class AvgState

AggregatorFunction

New signature were added.

.put(new FunctionSignature(functionName, Collections.singletonList(DATE)),
arguments -> new AvgAggregator(arguments, DATE))
.put(new FunctionSignature(functionName, Collections.singletonList(DATETIME)),
arguments -> new AvgAggregator(arguments, DATETIME))
.put(new FunctionSignature(functionName, Collections.singletonList(TIME)),
arguments -> new AvgAggregator(arguments, TIME))
.put(new FunctionSignature(functionName, Collections.singletonList(TIMESTAMP)),
arguments -> new AvgAggregator(arguments, TIMESTAMP))

Actually, plugin was able to do push down aggregation on datetime types, but this weren't accepted.

[1] It works on MySQL, but I don't see any reason to implement this for datetime types.

Test queries

In-memory aggregation

// Playing with 'over (PARTITION BY `datetime1`)' - `datetime1` column has the same value for all rows
// so partitioning by this column has no sense and doesn't (shouldn't) affect the results
// Aggregations with `OVER` clause are executed in memory (in SQL plugin memory),
// Aggregations without it are performed the OpenSearch node itself (pushed down to opensearch)

SELECT avg(CAST(time1 AS time)) OVER(PARTITION BY datetime1) from calcs;

(cast required to ensure that you're working with TIME).
To try other types, use

CAST(date0 AS date)            
datetime(CAST(time0 AS STRING))
CAST(time1 AS time)            
CAST(datetime0 AS timestamp)   

Push Down aggregation (metric)

SELECT min(CAST(date0 AS date)) from calcs;

Push Down aggregation (composite_buckets)

SELECT CAST(datetime0 AS timestamp) FROM bank GROUP BY 1;

Issues Resolved

opensearch-project#645

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@codecov

This comment was marked as spam.

Comment on lines 55 to 56
case TIME:
return MILLIS.between(LocalTime.MIN, expr.timeValue());
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workaround for sesstion context issue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a TODO to fix this

.shouldMatch(new ExprTimestampValue("1984-03-17 22:16:42")
.timestampValue().toEpochMilli());
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add the all-nulls case here too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning to add ITs that cover too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing UT framework is very limited and can't properly cover such case. There is a test for that and it didn't detect such case:

@Test
void can_execute_expression_with_null_field() {
assertThat()
.docValues("age", null)
.evaluate(ref("age", INTEGER))
.shouldMatch(null);
}

case TIME:
state.total += MILLIS.between(LocalTime.MIN, value.timeValue());
break;
case DOUBLE:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot compute AVG for SHORT, INTEGER, FLOAT, etc...?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These types are casted to DOUBLE automatically

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the automatic casting of types to double be tested in AvgAggregatorTest? I think only INTEGER and DOUBLE are only in there for number values.

docs/user/dql/aggregations.rst Outdated Show resolved Hide resolved
@MaxKsyunz MaxKsyunz self-requested a review October 28, 2022 18:53
@Yury-Fridlyand Yury-Fridlyand marked this pull request as ready for review November 1, 2022 02:35
public static long extractEpochMilliFromAnyDateTimeType(ExprValue value) {
switch ((ExprCoreType)value.type()) {
case DATE:
return MILLIS.between(LocalDateTime.of(1970, 1, 1, 0, 0), value.datetimeValue());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be wrong, but it seems like we could use the Instant.toEpochMilli for this call too. Something like:

value.datetimeValue().toInstant().toEpochMilli();

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!
Initially case DATE: fallen back with TIMESTAMP and DATETIME, conversion thru timestamp (Instant) returned wrong result, because timestamp changes TZ to UTC.
DATE -> MILLIS -> DATE -> MILLIS check failed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. While thinking on your comment I found a mistake. Fixed in 478141e.

@GumpacG
Copy link

GumpacG commented Nov 3, 2022

I think I may have found a bug where min of a datetime column containing nulls does not return null.
Screen Shot 2022-11-03 at 8 54 48 AM
Screen Shot 2022-11-03 at 8 56 55 AM

@Yury-Fridlyand
Copy link
Author

Yury-Fridlyand commented Nov 3, 2022

I think I may have found a bug where min of a datetime column containing nulls does not return null.

Nice catch! It is fixed in #145 (not merged yet on upstream - opensearch-project#1000).

@GumpacG
Copy link

GumpacG commented Nov 3, 2022

I think I may have found a bug where min of a datetime column containing nulls does not return null.

Nice catch! It is fixed in #145 (not merged yet on upstream - opensearch-project#1000).

I see. Thanks. LGTM!

@GumpacG
Copy link

GumpacG commented Nov 4, 2022

I think I may have found a bug where min of a datetime column containing nulls does not return null.

Nice catch! It is fixed in #145 (not merged yet on upstream - opensearch-project#1000).

I think we should add integration tests for these cases once the PR upstream is merged.

* Add aggregator fixes and some useless unit tests.
* Add `avg` aggregation on datetime types.
* Rework in-memory `AVG`. Fix parsing value returned from the OpenSearch node.

Signed-off-by: Yury-Fridlyand <yuryf@bitquilltech.com>
@Yury-Fridlyand
Copy link
Author

I think I may have found a bug where min of a datetime column containing nulls does not return null.

Nice catch! It is fixed in #145 (not merged yet on upstream - opensearch-project#1000).

I think we should add integration tests for these cases once the PR upstream is merged.

Done! Thank you for the idea.

@Yury-Fridlyand Yury-Fridlyand merged commit 4a9dfda into integ-datetime-aggreg Nov 9, 2022
@Yury-Fridlyand Yury-Fridlyand deleted the dev-datetime-aggreg branch November 9, 2022 16:59
Copy link

@forestmvey forestmvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@MitchellGale MitchellGale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants