Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify serialization by removing redundant PrimitiveScalarValue #3612

Merged
merged 6 commits into from
Oct 25, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions datafusion/common/src/scalar.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2238,6 +2238,15 @@ impl_try_from!(Float32, f32);
impl_try_from!(Float64, f64);
impl_try_from!(Boolean, bool);

impl TryFrom<DataType> for ScalarValue {
type Error = DataFusionError;

/// Create a Null instance of ScalarValue for this datatype
fn try_from(datatype: DataType) -> Result<Self> {
(&datatype).try_into()
}
}

impl TryFrom<&DataType> for ScalarValue {
type Error = DataFusionError;

Expand All @@ -2260,6 +2269,9 @@ impl TryFrom<&DataType> for ScalarValue {
}
DataType::Utf8 => ScalarValue::Utf8(None),
DataType::LargeUtf8 => ScalarValue::LargeUtf8(None),
DataType::Binary => ScalarValue::Binary(None),
DataType::FixedSizeBinary(len) => ScalarValue::FixedSizeBinary(*len, None),
DataType::LargeBinary => ScalarValue::LargeBinary(None),
DataType::Date32 => ScalarValue::Date32(None),
DataType::Date64 => ScalarValue::Date64(None),
DataType::Time64(TimeUnit::Nanosecond) => ScalarValue::Time64(None),
Expand All @@ -2275,6 +2287,15 @@ impl TryFrom<&DataType> for ScalarValue {
DataType::Timestamp(TimeUnit::Nanosecond, tz_opt) => {
ScalarValue::TimestampNanosecond(None, tz_opt.clone())
}
DataType::Interval(IntervalUnit::YearMonth) => {
ScalarValue::IntervalYearMonth(None)
}
DataType::Interval(IntervalUnit::DayTime) => {
ScalarValue::IntervalDayTime(None)
}
DataType::Interval(IntervalUnit::MonthDayNano) => {
ScalarValue::IntervalMonthDayNano(None)
}
DataType::Dictionary(index_type, value_type) => ScalarValue::Dictionary(
index_type.clone(),
Box::new(value_type.as_ref().try_into()?),
Expand Down
50 changes: 6 additions & 44 deletions datafusion/proto/proto/datafusion.proto
Original file line number Diff line number Diff line change
Expand Up @@ -762,8 +762,9 @@ message StructValue {

message ScalarValue{
oneof value {
// Null value of any type (type is encoded)
PrimitiveScalarType null_value = 19;
// was PrimitiveScalarType null_value = 19;
alamb marked this conversation as resolved.
Show resolved Hide resolved
// Null value of any type
ArrowType null_value = 33;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a dumb question but why are nulls typed? This encoding of scalarvalue seems to conflate encoding the schema with encoding the values, which seems unfortunate.

Perhaps we could take a look at what substrait does and copy that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean why are nulls typed in general? Basically because of how ScalarValue:: is implemented (as an Option<> around the underlying native type). I think @jimexist tried to clean it up at some point and make ScalarValue::None and then all the variants like ScalarValue::Int8 have values like i8 rather than Option<i8>.

I can't remember what the problem was but it didn't work easily.

In my opinion at least the serialization should follow how they are implemented in ScalarValue and if we improve ScalarValue then we can also improve the serialization code


bool bool_value = 1;
string utf8_value = 2;
Expand All @@ -778,7 +779,7 @@ message ScalarValue{
uint64 uint64_value = 11;
float float32_value = 12;
double float64_value = 13;
//Literal Date32 value always has a unit of day
// Literal Date32 value always has a unit of day
int32 date_32_value = 14;
ScalarListValue list_value = 17;
//WAS: ScalarType null_list_value = 18;
Expand All @@ -803,48 +804,9 @@ message Decimal128{
int64 s = 3;
}

// Contains all valid datafusion scalar type except for
// List
enum PrimitiveScalarType{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these types are already handled in ArrowType


BOOL = 0; // arrow::Type::BOOL
UINT8 = 1; // arrow::Type::UINT8
INT8 = 2; // arrow::Type::INT8
UINT16 = 3; // represents arrow::Type fields in src/arrow/type.h
INT16 = 4;
UINT32 = 5;
INT32 = 6;
UINT64 = 7;
INT64 = 8;
FLOAT32 = 9;
FLOAT64 = 10;
UTF8 = 11;
LARGE_UTF8 = 12;
DATE32 = 13;
TIMESTAMP_MICROSECOND = 14;
TIMESTAMP_NANOSECOND = 15;
NULL = 16;
DECIMAL128 = 17;
DATE64 = 20;
TIMESTAMP_SECOND = 21;
TIMESTAMP_MILLISECOND = 22;
INTERVAL_YEARMONTH = 23;
INTERVAL_DAYTIME = 24;
INTERVAL_MONTHDAYNANO = 28;

BINARY = 25;
LARGE_BINARY = 26;

TIME64 = 27;
}


// Broke out into multiple message types so that type
// metadata did not need to be in separate message
// All types that are of the empty message types contain no additional metadata
// about the type
// Serialized data type
message ArrowType{
oneof arrow_type_enum{
oneof arrow_type_enum {
EmptyMessage NONE = 1; // arrow::Type::NA
EmptyMessage BOOL = 2; // arrow::Type::BOOL
EmptyMessage UINT8 = 3; // arrow::Type::UINT8
Expand Down
Loading