-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(sdk): row-based access to results in sdk #2903
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless there is a very compelling usecase for this, I'd be very hesitant about exposing this functionality. This is an anti-pattern when working with columnar data.
At a minimum, I'd want to see a docstring that explicitly states the performance implications of row based iteration. Something like:
Row iteration is not optimal as the underlying data is stored in columnar
Note that this method should not be used in place of native (columnar) operations, due the high cost of materializing all data into a Map
it should be used only when you need to move the values out of columnar format or into other object that cannot operate directly with Arrow/Datafusion columns.
pub fn iter(&self) -> impl Iterator<Item = RowMap> { | ||
self.0.clone().into_iter() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should implement Iterator
instead
crates/glaredb/src/lib.rs
Outdated
for (idx, field) in schema.fields.into_iter().enumerate() { | ||
record.insert( | ||
field.name().to_owned(), | ||
ScalarValue::try_from_array(batch.column(idx), row)?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering this is operation is already very expensive, I'd prefer to see some fast paths instead of using ScalarValue::try_from_array
which has a lot of bounds checks and type casts that make this even more expensive than it needs to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine to me. The bounds check is pretty much a constant assert, and the downcast needs to happen somewhere, and here is as good as any.
We use try_from_array
for the pg stuff as well: https://github.com/GlareDB/glaredb/blob/main/crates/pgsrv/src/codec/server.rs#L327-L351 (wrapped, but internally this calls the same function)
Co-authored-by: universalmind303 <cory.grinstead@gmail.com>
We could feature flag this for testing, but this isn't (and for the reasons you point out) shouldn't be used for production code, but we're not in this PR, and I'm definitely willing to use a different scalar value wrapper, or another conversion method if you have a better one in mind, but the performance can be improved later, if needed. (The fact that this code isn't and shouldn't be in any critical path means "if needed" is unlikely to be true.) Anyway, this is not radically different from a design perspective from the to_pydict method that we use in the tests for the bindings. |
crates/glaredb/src/lib.rs
Outdated
for (idx, field) in schema.fields.into_iter().enumerate() { | ||
record.insert( | ||
field.name().to_owned(), | ||
ScalarValue::try_from_array(batch.column(idx), row)?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine to me. The bounds check is pretty much a constant assert, and the downcast needs to happen somewhere, and here is as good as any.
We use try_from_array
for the pg stuff as well: https://github.com/GlareDB/glaredb/blob/main/crates/pgsrv/src/codec/server.rs#L327-L351 (wrapped, but internally this calls the same function)
crates/glaredb/src/lib.rs
Outdated
/// RowMap represents a single record in an ordered map. | ||
type RowMap = indexmap::IndexMap<String, ScalarValue>; | ||
|
||
/// RowMapBatch is equivalent to a row-based view of a record batch. | ||
pub struct RowMapBatch(Vec<RowMap>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a quick comment this should be used sparingly/for tests.
Also, we could use an Arc<str>
instead of String
for the key if we want to keep the size down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arc str!
My other theory was to just use the bson library, but the bson value enum has fewer types than the scalar value type, so avoiding more downcasting than necessary seems good
This fixes a bug introduced by datafusion 36 previously some operations that failed during optimization such as invalid casts now fail at runtime. Since they now fail at runtime, it means that we would still create the catalog and table and only fail during insert afterwards. This left both the catalog and the storage in a bad state that didn't accurately reflect the operation. e.g. ```sql create table invalid_ctas as (select cast('test' as int) as 'bad_cast'); ``` This updated the catalog and created a table for `invalid_ctas`, but when you'd query it you would get an error. This PR makes sure that the operation is successful before committing the changes. It does so by exposing some new methods on the catalog client. `commit_state` `mutate_and_commit` and `mutate` instead of the previous `mutate`. The existing code was refactored to use the `mutate_and_commit` which is the same as the old `mutate`. The code that requires the commit semantics (create table) now uses `mutate` to first get an uncommitted catalog state with those changes, then does all of it's other actions - create the "native" table - _Optional_ inserts into the table - commits the catalog state If any of the operations before the commit fail, then the catalog mutations are never committed. --------- Co-authored-by: Sean Smith <scsmithr@gmail.com>
closes #2683
the bindings have similar functionality via roughly-native mechanisms
(pyarrow, etc), but if this is going to be useful for interacting with
results in tests, something like this is going going to be needed.