Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ScalarValue an ArrayRef Wrapper #7353

Open
tustvold opened this issue Aug 21, 2023 · 14 comments
Open

Make ScalarValue an ArrayRef Wrapper #7353

tustvold opened this issue Aug 21, 2023 · 14 comments
Labels
enhancement New feature or request

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge?

We have recently standardised a scalar abstraction upstream, based around the existing Array abstractions.

The rationale behind this is expanded upon here apache/arrow-rs#4393 (comment)

Describe the solution you'd like

I would like ScalarValue to just be a wrapper around an ArrayRef. This would avoid a huge amount of duplicated and/or boilerplate code, whilst also discouraging implementing non-vectorizable kernels based around ScalarValue.

Describe alternatives you've considered

No response

Additional context

No response

@alamb
Copy link
Contributor

alamb commented Aug 21, 2023

FWIW using an ArrayRef for a single scalar value may be substantially more overhead than todays implementation, though the fact they can be copied cheaply might make it a wash

In general however, I think the basic idea is a great one and will simplify many things. Perhaps we can start at the edges: change code that today operates on ScalarValue to convert into ArrayRef and then call the appropriate arrow kernels

@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 27, 2023

ScalarValue::List(Option<Vec> will change to something like ScalarValue::List(ArrayRef)?

It would be really nice if we have ArrayRef for ScalarValue for #7242. The reason why I need to introduce ColumnValue::Array for Accumulator is that we don't have ArrayRef inside ScalarValue::List!!

@tustvold
Copy link
Contributor Author

#7352 could be a first step but eventually it would be good to just use an ArrayRef of length 1 for all scalars, inline with the design in arrow-rs

@jayzhan211
Copy link
Contributor

@tustvold It seems there is no prost implementation for ArrayRef. Should it be introduced in arrow-rs?

@tustvold
Copy link
Contributor Author

I think the idea would be to use arrow-ipc

@alamb
Copy link
Contributor

alamb commented Sep 28, 2023

I personally think there is great value in using ArrayRef for things like ScalarValue::List and ScalarValue::Struct. I am not at all sure making ScalarValue::Int8 or other primitive types is a great idea

My rationale is that the overhead for converting ScalarValue::Int8 to an ArrayRef on demand when needed by calculation kernels or something similar, is likely to be lower than the overhead of copying ArrayRefs around (serializing them, for example).

@jayzhan211
Copy link
Contributor

Actually, I am working on List only, which is this link, I paste the incorrect one. #7352

@jayzhan211
Copy link
Contributor

jayzhan211 commented Sep 30, 2023

I think the idea would be to use arrow-ipc

arrow-ipc converts data to EncodedData (Vec<u8>) and writes them to a file, it is quite different from the current proto/ implementation, which just turns them into bytes (Vec<u8>) but doesn't write to a file. If I understand correctly, prost::Message is the equivalent concept of arrow-ipc's EncodedData.

I think there are two approaches

If we remain from_proto / to_proto for ScalarValue::List , we can get the ipc-message from arrow-ipc and store those Vec in prost::Message. Not sure whether we can do this but since they are all Vec, so probably doable.

The other way is we introduce from_ipc / to_ipc for ScalarValue::List and no prost for ScalarValue::List, and use write/read to store EncodedData into a given filename, but I am not sure whether mixing prost and arrow-ipc makes sense while building logical plan and physical plan.

@tustvold How do you think about these two approaches?

@jayzhan211
Copy link
Contributor

https://github.com/apache/arrow-datafusion/actions/runs/6362768232/job/17278040087?pr=7629

Anyone know how to fix this kind of CI?

= note: /usr/bin/ld: final link failed: No space left on device
collect2: error: ld returned 1 exit status

@alamb
Copy link
Contributor

alamb commented Oct 2, 2023

Anyone know how to fix this kind of CI?

I think it was fixed by @sarutak in #7706

So merging up from apache/main should fix the issue

@jayzhan211
Copy link
Contributor

I think the idea would be to use arrow-ipc

arrow-ipc converts data to EncodedData (Vec) and writes them to a file, it is quite different from the current proto/ implementation, which just turns them into bytes (Vec) but doesn't write to a file. If I understand correctly, prost::Message is the equivalent concept of arrow-ipc's EncodedData.

I think there are two approaches

If we remain from_proto / to_proto for ScalarValue::List , we can get the ipc-message from arrow-ipc and store those Vec in prost::Message. Not sure whether we can do this but since they are all Vec, so probably doable.

The other way is we introduce from_ipc / to_ipc for ScalarValue::List and no prost for ScalarValue::List, and use write/read to store EncodedData into a given filename, but I am not sure whether mixing prost and arrow-ipc makes sense while building logical plan and physical plan.

@tustvold How do you think about these two approaches?

Maybe I could just convert ArrayRef to Vec, probably the easiest fix.

@alamb
Copy link
Contributor

alamb commented Oct 3, 2023

Maybe I could just convert ArrayRef to Vec, probably the easiest fix.

I am not sure that would work for all array types (like List, for example)

@jayzhan211
Copy link
Contributor

Maybe I could just convert ArrayRef to Vec, probably the easiest fix.

I am not sure that would work for all array types (like List, for example)

Fail to handle nested list. We should directly handle Array to proto/ipc-like things.

@jayzhan211
Copy link
Contributor

If we remain from_proto / to_proto for ScalarValue::List , we can get the ipc-message from arrow-ipc and store those Vec in prost::Message. Not sure whether we can do this but since they are all Vec, so probably doable.

I found this quite easy :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants