Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add serialization of ScalarValue::Binary and ScalarValue::LargeBinary, ScalarValue::Time64 #3534

Merged
merged 1 commit into from
Sep 21, 2022

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Sep 19, 2022

Which issue does this PR close?

Draft as it builds on

Part of #3531

Rationale for this change

See #3531

What changes are included in this PR?

  1. Implement serialization code for: ScalarValue::{,Large}Binary
  2. Implement serialization code for ScalarValue::Time64
  3. Fix code for ScalarValue::Timestamp*
  4. Add tests

Are there any user-facing changes?

Better serialization support

@@ -796,6 +799,9 @@ enum PrimitiveScalarType{
TIME_MILLISECOND = 22;
INTERVAL_YEARMONTH = 23;
INTERVAL_DAYTIME = 24;

BINARY = 25;
LARGE_BINARY = 26;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly TIME_NANOSECOND already existed

Copy link
Contributor Author

@alamb alamb Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a huge fan of the duplication between PrimitiveScalarType and ArrowType -- I am just following the existing patterns in this PR, but I will attempt to fix this in a follow on PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out this was a bug in my original implementation (which was caught by #3537)

@alamb alamb marked this pull request as ready for review September 19, 2022 18:15
@alamb alamb marked this pull request as draft September 19, 2022 20:05
@@ -786,16 +789,21 @@ enum PrimitiveScalarType{
UTF8 = 11;
LARGE_UTF8 = 12;
DATE32 = 13;
TIME_MICROSECOND = 14;
TIME_NANOSECOND = 15;
TIMESTAMP_MICROSECOND = 14;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed these fields because they are for Timestamp not actually Time (which are different in Arrow).

DataType::Time64(TimeUnit::Nanosecond)
}
protobuf::PrimitiveScalarType::TimestampMicrosecond => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the names of PrimitiveScalarType from Time here to Timestamp be consistent with the ScalarValue variants as well as the arrow type system

PrimitiveScalarType::Decimal128 => Self::Decimal128(None, 0, 0),
PrimitiveScalarType::Date64 => Self::Date64(None),
PrimitiveScalarType::TimeSecond => Self::TimestampSecond(None, None),
PrimitiveScalarType::TimeMillisecond => {
PrimitiveScalarType::TimestampSecond => Self::TimestampSecond(None, None),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were incorrectly previously set to be Time rather than Timestamp

@alamb alamb marked this pull request as ready for review September 20, 2022 10:58
@@ -1098,7 +1098,7 @@ impl TryFrom<&ScalarValue> for protobuf::ScalarValue {
})
}
datafusion::scalar::ScalarValue::TimestampMicrosecond(val, tz) => {
create_proto_scalar(val, PrimitiveScalarType::TimeMicrosecond, |s| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these names were super confusing as the protobuf definition used Time and DataType and ScalarValue used Timestamp.

Making it more confusing is that ScalarValue::Time64 is not a timestamp (it is the time of day!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think TimeMicrosecond stands for Timestamp with time unit as TimeUnit::MicroSecond so that it names TimeMicroSecond

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to TimestampMicrosecond LGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb do you know any history reason this field still named as TimeMilliSecond?
https://github.com/apache/arrow-datafusion/blob/master/datafusion/proto/proto/datafusion.proto#L669-L674

i think the original naming comes from here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb do you know any history reason this field still named as TimeMilliSecond?

I do not know why that field is called TimeMillisecond -- it is called Millisecond in the arrow schema so I think we could do the same in Datafusion: https://docs.rs/arrow/23.0.0/arrow/datatypes/enum.TimeUnit.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a PR to make the naming of TimeUnit consistent: #3575

@alamb
Copy link
Contributor Author

alamb commented Sep 20, 2022

cc @waitingkuo @avantgardnerio

Copy link
Contributor

@avantgardnerio avantgardnerio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as far as extending existing patterns. I very much agree with you @alamb that the whole situation is very confusing however. One day I hope someone has enough high-level knowledge to clean it up in a sensible way.

@alamb
Copy link
Contributor Author

alamb commented Sep 21, 2022

that the whole situation is very confusing however. One day I hope someone has enough high-level knowledge to clean it up in a sensible way.

I have plans (see #3547 ) but it has somewhat turned into

yak_hair_overflow

I think I can get remove the entire PrimitiveScalarType enum after I remove the multi-level list changes

@alamb alamb merged commit 0a2b0a7 into apache:master Sep 21, 2022
@alamb alamb deleted the alamb/serialize_binary_time branch September 21, 2022 16:37
@ursabot
Copy link

ursabot commented Sep 21, 2022

Benchmark runs are scheduled for baseline = 6be3301 and contender = 0a2b0a7. 0a2b0a7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants