Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sdk): row-based access to results in sdk #2903

Merged
merged 6 commits into from
Apr 22, 2024
Merged

Conversation

tychoish
Copy link
Contributor

the bindings have similar functionality via roughly-native mechanisms
(pyarrow, etc), but if this is going to be useful for interacting with
results in tests, something like this is going going to be needed.

Copy link
Contributor

@universalmind303 universalmind303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there is a very compelling usecase for this, I'd be very hesitant about exposing this functionality. This is an anti-pattern when working with columnar data.

At a minimum, I'd want to see a docstring that explicitly states the performance implications of row based iteration. Something like:

Row iteration is not optimal as the underlying data is stored in columnar 

Note that this method should not be used in place of native (columnar) operations, due the high cost of materializing all data into a Map

it should be used only when you need to move the values out of columnar format or into other object that cannot operate directly with Arrow/Datafusion columns.

Comment on lines +366 to +368
pub fn iter(&self) -> impl Iterator<Item = RowMap> {
self.0.clone().into_iter()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should implement Iterator instead

crates/glaredb/src/lib.rs Outdated Show resolved Hide resolved
crates/glaredb/src/lib.rs Outdated Show resolved Hide resolved
for (idx, field) in schema.fields.into_iter().enumerate() {
record.insert(
field.name().to_owned(),
ScalarValue::try_from_array(batch.column(idx), row)?,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering this is operation is already very expensive, I'd prefer to see some fast paths instead of using ScalarValue::try_from_array which has a lot of bounds checks and type casts that make this even more expensive than it needs to be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to me. The bounds check is pretty much a constant assert, and the downcast needs to happen somewhere, and here is as good as any.

We use try_from_array for the pg stuff as well: https://github.com/GlareDB/glaredb/blob/main/crates/pgsrv/src/codec/server.rs#L327-L351 (wrapped, but internally this calls the same function)

crates/glaredb/src/lib.rs Outdated Show resolved Hide resolved
crates/glaredb/src/lib.rs Outdated Show resolved Hide resolved
crates/glaredb/src/lib.rs Outdated Show resolved Hide resolved
Co-authored-by: universalmind303 <cory.grinstead@gmail.com>
@tychoish
Copy link
Contributor Author

tychoish commented Apr 18, 2024

We could feature flag this for testing, but this isn't (and for the reasons you point out) shouldn't be used for production code, but we're not in this PR, and I'm definitely willing to use a different scalar value wrapper, or another conversion method if you have a better one in mind, but the performance can be improved later, if needed. (The fact that this code isn't and shouldn't be in any critical path means "if needed" is unlikely to be true.)

Anyway, this is not radically different from a design perspective from the to_pydict method that we use in the tests for the bindings.

for (idx, field) in schema.fields.into_iter().enumerate() {
record.insert(
field.name().to_owned(),
ScalarValue::try_from_array(batch.column(idx), row)?,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to me. The bounds check is pretty much a constant assert, and the downcast needs to happen somewhere, and here is as good as any.

We use try_from_array for the pg stuff as well: https://github.com/GlareDB/glaredb/blob/main/crates/pgsrv/src/codec/server.rs#L327-L351 (wrapped, but internally this calls the same function)

Comment on lines 316 to 320
/// RowMap represents a single record in an ordered map.
type RowMap = indexmap::IndexMap<String, ScalarValue>;

/// RowMapBatch is equivalent to a row-based view of a record batch.
pub struct RowMapBatch(Vec<RowMap>);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a quick comment this should be used sparingly/for tests.

Also, we could use an Arc<str> instead of String for the key if we want to keep the size down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arc str!

My other theory was to just use the bson library, but the bson value enum has fewer types than the scalar value type, so avoiding more downcasting than necessary seems good

universalmind303 and others added 3 commits April 19, 2024 11:58
This fixes a bug introduced by datafusion 36

previously some operations that failed during optimization such as
invalid casts now fail at runtime.

Since they now fail at runtime, it means that we would still create the
catalog and table and only fail during insert afterwards. This left both
the catalog and the storage in a bad state that didn't accurately
reflect the operation. e.g.

```sql
create table invalid_ctas as (select cast('test' as int) as 'bad_cast');
```

This updated the catalog and created a table for `invalid_ctas`, but
when you'd query it you would get an error.


This PR makes sure that the operation is successful before committing
the changes. It does so by exposing some new methods on the catalog
client. `commit_state` `mutate_and_commit` and `mutate` instead of the
previous `mutate`.

The existing code was refactored to use the `mutate_and_commit` which is
the same as the old `mutate`. The code that requires the commit
semantics (create table) now uses `mutate` to first get an uncommitted
catalog state with those changes, then does all of it's other actions

- create the "native" table
- _Optional_ inserts into the table
- commits the catalog state

If any of the operations before the commit fail, then the catalog
mutations are never committed.

---------

Co-authored-by: Sean Smith <scsmithr@gmail.com>
@tychoish tychoish merged commit 7cc2862 into main Apr 22, 2024
26 checks passed
@tychoish tychoish deleted the tycho/sdk-row-access branch April 22, 2024 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants