feat(sdk): row-based access to results in sdk #2903

tychoish · 2024-04-17T17:45:16Z

the bindings have similar functionality via roughly-native mechanisms
(pyarrow, etc), but if this is going to be useful for interacting with
results in tests, something like this is going going to be needed.

universalmind303

Unless there is a very compelling usecase for this, I'd be very hesitant about exposing this functionality. This is an anti-pattern when working with columnar data.

At a minimum, I'd want to see a docstring that explicitly states the performance implications of row based iteration. Something like:

Row iteration is not optimal as the underlying data is stored in columnar 

Note that this method should not be used in place of native (columnar) operations, due the high cost of materializing all data into a Map

it should be used only when you need to move the values out of columnar format or into other object that cannot operate directly with Arrow/Datafusion columns.

universalmind303 · 2024-04-17T23:26:06Z

crates/glaredb/src/lib.rs

+    pub fn iter(&self) -> impl Iterator<Item = RowMap> {
+        self.0.clone().into_iter()
+    }


should implement Iterator instead

crates/glaredb/src/lib.rs

universalmind303 · 2024-04-17T23:36:43Z

crates/glaredb/src/lib.rs

+            for (idx, field) in schema.fields.into_iter().enumerate() {
+                record.insert(
+                    field.name().to_owned(),
+                    ScalarValue::try_from_array(batch.column(idx), row)?,


Considering this is operation is already very expensive, I'd prefer to see some fast paths instead of using ScalarValue::try_from_array which has a lot of bounds checks and type casts that make this even more expensive than it needs to be.

This seems fine to me. The bounds check is pretty much a constant assert, and the downcast needs to happen somewhere, and here is as good as any.

We use try_from_array for the pg stuff as well: https://github.com/GlareDB/glaredb/blob/main/crates/pgsrv/src/codec/server.rs#L327-L351 (wrapped, but internally this calls the same function)

crates/glaredb/src/lib.rs

Co-authored-by: universalmind303 <cory.grinstead@gmail.com>

tychoish · 2024-04-18T03:42:31Z

We could feature flag this for testing, but this isn't (and for the reasons you point out) shouldn't be used for production code, but we're not in this PR, and I'm definitely willing to use a different scalar value wrapper, or another conversion method if you have a better one in mind, but the performance can be improved later, if needed. (The fact that this code isn't and shouldn't be in any critical path means "if needed" is unlikely to be true.)

Anyway, this is not radically different from a design perspective from the to_pydict method that we use in the tests for the bindings.

scsmithr · 2024-04-18T15:48:19Z

crates/glaredb/src/lib.rs

+            for (idx, field) in schema.fields.into_iter().enumerate() {
+                record.insert(
+                    field.name().to_owned(),
+                    ScalarValue::try_from_array(batch.column(idx), row)?,


This seems fine to me. The bounds check is pretty much a constant assert, and the downcast needs to happen somewhere, and here is as good as any.

We use try_from_array for the pg stuff as well: https://github.com/GlareDB/glaredb/blob/main/crates/pgsrv/src/codec/server.rs#L327-L351 (wrapped, but internally this calls the same function)

scsmithr · 2024-04-18T15:50:53Z

crates/glaredb/src/lib.rs

+/// RowMap represents a single record in an ordered map.
+type RowMap = indexmap::IndexMap<String, ScalarValue>;
+
+/// RowMapBatch is equivalent to a row-based view of a record batch.
+pub struct RowMapBatch(Vec<RowMap>);


I'd add a quick comment this should be used sparingly/for tests.

Also, we could use an Arc<str> instead of String for the key if we want to keep the size down.

arc str!

My other theory was to just use the bson library, but the bson value enum has fewer types than the scalar value type, so avoiding more downcasting than necessary seems good

This fixes a bug introduced by datafusion 36 previously some operations that failed during optimization such as invalid casts now fail at runtime. Since they now fail at runtime, it means that we would still create the catalog and table and only fail during insert afterwards. This left both the catalog and the storage in a bad state that didn't accurately reflect the operation. e.g. ```sql create table invalid_ctas as (select cast('test' as int) as 'bad_cast'); ``` This updated the catalog and created a table for `invalid_ctas`, but when you'd query it you would get an error. This PR makes sure that the operation is successful before committing the changes. It does so by exposing some new methods on the catalog client. `commit_state` `mutate_and_commit` and `mutate` instead of the previous `mutate`. The existing code was refactored to use the `mutate_and_commit` which is the same as the old `mutate`. The code that requires the commit semantics (create table) now uses `mutate` to first get an uncommitted catalog state with those changes, then does all of it's other actions - create the "native" table - _Optional_ inserts into the table - commits the catalog state If any of the operations before the commit fail, then the catalog mutations are never committed. --------- Co-authored-by: Sean Smith <scsmithr@gmail.com>

closes #2683

eh

feat(sdk): row-based access to results in sdk

e9d2065

universalmind303 previously requested changes Apr 18, 2024

View reviewed changes

Apply suggestions from code review

8bbc7eb

Co-authored-by: universalmind303 <cory.grinstead@gmail.com>

scsmithr approved these changes Apr 18, 2024

View reviewed changes

universalmind303 and others added 3 commits April 19, 2024 11:58

feat: read_blob (#2902)

1e80cfe

closes #2683

fix: lint

4d8388d

Merge remote-tracking branch 'origin/main' into tycho/sdk-row-access

1b589c6

tychoish merged commit 7cc2862 into main Apr 22, 2024
26 checks passed

tychoish deleted the tycho/sdk-row-access branch April 22, 2024 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sdk): row-based access to results in sdk #2903

feat(sdk): row-based access to results in sdk #2903

tychoish commented Apr 17, 2024

universalmind303 left a comment

universalmind303 Apr 17, 2024

universalmind303 Apr 17, 2024

scsmithr Apr 18, 2024

tychoish commented Apr 18, 2024 •

edited

Loading

scsmithr Apr 18, 2024

scsmithr Apr 18, 2024

tychoish Apr 19, 2024

feat(sdk): row-based access to results in sdk #2903

feat(sdk): row-based access to results in sdk #2903

Conversation

tychoish commented Apr 17, 2024

universalmind303 left a comment

Choose a reason for hiding this comment

universalmind303 Apr 17, 2024

Choose a reason for hiding this comment

universalmind303 Apr 17, 2024

Choose a reason for hiding this comment

scsmithr Apr 18, 2024

Choose a reason for hiding this comment

tychoish commented Apr 18, 2024 • edited Loading

scsmithr Apr 18, 2024

Choose a reason for hiding this comment

scsmithr Apr 18, 2024

Choose a reason for hiding this comment

tychoish Apr 19, 2024

Choose a reason for hiding this comment

tychoish commented Apr 18, 2024 •

edited

Loading