Add remaining data types to schema visiting in ffi #187

nicklan · 2024-05-01T21:41:15Z

Adds more types, and uses a macro to reduce repetition for the primitive ones. Also adds some rust docs to try and explain how the visitors for schema work.

Tested by adding a visitor to the read_table program that builds and prints the schema.

Output:

.$ /read_table file:///home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/nested_types/delta/
Reading table at file:///home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/nested_types/delta/
version: 0

Schema:
├─ pk: integer
├─ struct: struct
│  ├─ float64: double
│  └─ bool: boolean
├─ array: array
│  └─ array_data_type: short
└─ map: map
   ├─ map_key_type: string
   └─ map_value_type: integer

Got some data
  Of this data, here is a selection vector
    sel[0] = 0
    sel[1] = 0
    sel[2] = 0
    sel[3] = 1
called back to actually read!
  path: part-00000-3b8ca3f0-6444-4ed5-8961-c605cba95bf1-c000.snappy.parquet
  No selection vector for this call
  no partition here
Iterator done
All done

scovich

First pass

ffi/examples/read-table/schema.h

scovich · 2024-05-04T12:40:33Z

ffi/examples/read-table/schema.h

+  printf("Making a list of lenth %i with id %i\n", reserve, id);
+#endif
+  builder->list_count++;
+  builder->lists = realloc(builder->lists, sizeof(SchemaItemList) * builder->list_count);


why realloc, out of curiosity? What previous usage could there have been?

(if there was some valid previous usage, wouldn't we need to free the lists it contained?)

Yikes, this is being used to grow the array... but realloc often (usually?) moves the memory to a new location, which would invalidate all the child pointers. To avoid dangling pointers, SchemaItem::children needs to be an index into SchemaBuilder::lists. Doing that would also make it obvious in the free_builder function, why we don't need to free the child list when freeing its parent.

Yeah, real bad on the pointers 🤦🏽 .

I've updated things to use indexes.

Why realloc, out of curiosity? What previous usage could there have been?

We don't know the total number of lists ahead of time. SchemaBuilder.lists is an array of SchemaItemList, so this realloc is just adding one more to it (list_count gets incremented before this realloc call). Previous usage was when we made the builder and did

SchemaBuilder builder = { .list_count = 0, .lists = calloc(0, sizeof(SchemaItem*)), };

as well as each other call to this method.

(if there was some valid previous usage, wouldn't we need to free the lists it contained?)

realloc does that for you (afaiu): "Unless ptr is NULL, it must have been returned by an earlier call to malloc or related functions. If the area pointed to was moved, a free(ptr) is done."

(if there was some valid previous usage, wouldn't we need to free the lists it contained?)

realloc does that for you (afaiu)

I was referring to the lists of SchemaItem that each SchemaItemList in the builder referred to. But it's not a problem as you say, because we're doing realloc to extend rather than replace.

One gotcha with using realloc this way -- it has O(n**2) cost, where n is the final list_count. The usual workaround is to increase the capacity of the array by a fixed ratio like 3/2; in return for potentially 50% wasted space, it is amortized O(n) cost. Probably doesn't matter for this example code tho.

Yep. I considered increasing more each time, but:
a) it requires tracking an extra capacity param and
b) for example code as you say it's probably overkill

scovich · 2024-05-04T12:43:55Z

ffi/examples/read-table/schema.h

+  SchemaBuilder *builder = (SchemaBuilder*)data;
+  char* name_ptr = allocate_name(name);
+  char* type = malloc(16 * sizeof(char));
+  sprintf(type, "decimal(%i)(%i)", precision, scale);


Always use snprintf! This string could technically be up to 19 bytes long, for example.

Good call to use snprintf, but how would it be 19 bytes? The protocol says precision and scale max out at 38, and "decimal(38)(38)" is 16 bytes.

That's what the spec says, yes... but to avoid buffer overrun risk, we have to handle the biggest i8/u8 values an adversary could throw at us.

scovich · 2024-05-04T13:42:09Z

ffi/src/lib.rs

        }
        child_list_id
    }

+    fn visit_array_item(visitor: &EngineSchemaVisitor, at: &ArrayType) -> usize {
+        let child_list_id = (visitor.make_field_list)(visitor.data, 1);
+        visit_schema_item(&at.element_type, "array_data_type", visitor, child_list_id);


Shouldn't this be something like "array_element"? AFAIK it corresponds to an actual column in the parquet?
(ditto for "map_key" and "map_value" below).

We also need to decide how to present these types to the engine, since parquet can have several different physical representations for LIST and MAP types (legacy 2-level vs. 3-level preferred), and we probably want to protect the engine from having to care what the underlying parquet ended up containing -- @roeap knows more about that situation. Either kernel or parquet reader will have to deal with that name mapping problem.

scovich · 2024-05-04T13:56:19Z

ffi/src/lib.rs

-    // Creates a new field list, optionally reserving capacity up front
-    make_field_list: extern "C" fn(data: *mut c_void, reserve: usize) -> usize,
+    /// opaque state pointer
+    pub data: *mut c_void,


Should we require NonNull<c_void>? I guess we don't actually care what the engine does with its own pointer tho?

yeah, like you might want something that just prints out with no need for data, so it could be null.

ffi/src/lib.rs

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>

scovich

LGTM. Bunch of nits to consider before merge, if you want.

scovich · 2024-05-07T02:00:01Z

ffi/examples/read-table/schema.h

 */

 // If you want the visitor to print out what it's being asked to do at each step, uncomment the
 // following line
-// #define PRINT_VISITS
+//#define PRINT_VISITS


nit: Could also potentially leverage variadic macro args for this:

#ifdef VERBOSE #define PRINT_VISITS(...) printf(__VA_ARGS__) #else #define PRINT_VISIT(...) #endif

and then just use it:

PRINT_VISIT("Making a list of length %i with id %i\n", reserve, id);

(a bit more work up front, but less #ifdef magic at use sites)

Also more fun :) switched to that

ffi/examples/read-table/schema.h

scovich · 2024-05-07T03:07:06Z

ffi/examples/read-table/schema.h

+        printf("│  ");
+      }
+    }
+    SchemaItem* item = &list->list[i];


We should probably decide whether to use &array[index] vs. array + index and be consistent?

(I personally prefer the former because it's obvious at a glance that it's not normal arithmetic... but I recognize it's also a bit more typing)

Yeah, makes sense. I tend to use array + index when I know I just want the pointer, and array[index] when I want to use the item there (like array[index].name), but consistency is good. I've updated to all &array[index] style for getting pointers.

scovich · 2024-05-09T02:22:38Z

ffi/examples/read-table/schema.h

+  printf("Making a list of lenth %i with id %i\n", reserve, id);
+#endif
+  builder->list_count++;
+  builder->lists = realloc(builder->lists, sizeof(SchemaItemList) * builder->list_count);


(if there was some valid previous usage, wouldn't we need to free the lists it contained?)

realloc does that for you (afaiu)

I was referring to the lists of SchemaItem that each SchemaItemList in the builder referred to. But it's not a problem as you say, because we're doing realloc to extend rather than replace.

scovich · 2024-05-09T02:25:46Z

ffi/examples/read-table/schema.h

+  printf("Making a list of lenth %i with id %i\n", reserve, id);
+#endif
+  builder->list_count++;
+  builder->lists = realloc(builder->lists, sizeof(SchemaItemList) * builder->list_count);


One gotcha with using realloc this way -- it has O(n**2) cost, where n is the final list_count. The usual workaround is to increase the capacity of the array by a fixed ratio like 3/2; in return for potentially 50% wasted space, it is amortized O(n) cost. Probably doesn't matter for this example code tho.

ffi/src/lib.rs

scovich · 2024-05-09T02:35:21Z

ffi/src/lib.rs

+            DataType::Map(mt) => call!(
+                visit_map,
+                mt.value_contains_null,
+                visit_map_types(visitor, mt)
+            ),


Out of curiosity, what decides whether to have => call!( vs. => {?

Suggested change

DataType::Map(mt) => call!(

visit_map,

mt.value_contains_null,

visit_map_types(visitor, mt)

),

DataType::Map(mt) => {

call!(visit_map, mt.value_contains_null, visit_map_types(visitor, mt))

}

not entirely sure, but i think for fmt the trailing ; does, at least if there is no macro involved.

Yeah, rustfmt is a bit odd here. It'll let me put the map call into {}s, but won't let it stay on one line.

It seems to use {}s if the body of the match case needs to go on its own line, so

&DataType::STRING => call!(visit_string), is okay because it's all one line, but

DataType::Array(at) => call!(visit_array, at.contains_null, visit_array_item(visitor, at)),

while valid, will be fmt'd to:

DataType::Array(at) => { call!(visit_array, at.contains_null, visit_array_item(visitor, at)) }

I've put the map case into {}s for a bit more consistency, but really can't find a formatting that will let it stay on one line, I think because fmt thinks it's just a bit too long.

roeap · 2024-05-10T10:19:20Z

ffi/src/lib.rs

+/// representation of a schema from a particular schema within kernel.
+///
+/// The model is list based. When the kernel needs a list, it will ask engine to allocate one of a
+/// particular size. Once allocated the engine returns an `id`, which can be any identifier the


nit: reading "which can be any identifier the engine wants" I thought this would be an opaque type, but IIUC we are assuming usize?

yeah, I was thinking that "identifier" was enough to make it clear, but i've amended to "integer identifier ([`usize`])" which is hopefully more clear (and links to the usize docs so implementers can understand what type it will be in c/c++)

roeap · 2024-05-10T10:21:18Z

ffi/src/lib.rs

+///      - For a map, visit the key and value, passing its special name ("map_key" or "map_value"),
+///        type, and value nullability (keys are never nullable)
+///      - For a list, visit the element, passing its special name ("array_element"), type, and
+///        nullability


nit: not sure if it makes that much of a difference, but should be we consistent with the arrow conversion in how we name the special fields?

I looked about and couldn't find any consistency online about what to name these. Agree that it's not great to have different names here than in our schema.rs code.

In the spirit of not bikeshedding too much I've opened #202 and will follow-up to rename everything properly.

roeap · 2024-05-10T10:28:32Z

ffi/src/lib.rs

+            DataType::Map(mt) => call!(
+                visit_map,
+                mt.value_contains_null,
+                visit_map_types(visitor, mt)
+            ),


not entirely sure, but i think for fmt the trailing ; does, at least if there is no macro involved.

roeap · 2024-05-10T10:37:40Z

Just left a few minor questions. Took me a little bit longer as I took the opportunity to go through most of the ffi code to get familiar.

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>

nicklan changed the title ~~Add most types other than a few less common primitive ones~~ Add remaining data types May 1, 2024

nicklan changed the title ~~Add remaining data types~~ Add remaining data types to schema visiting in ffi May 1, 2024

nicklan force-pushed the add-most-schema-types-to-ffi branch 6 times, most recently from 86534d7 to 1aecefd Compare May 2, 2024 23:17

nicklan requested review from scovich and samansmink May 2, 2024 23:18

nicklan force-pushed the add-most-schema-types-to-ffi branch from 1aecefd to 2f475b4 Compare May 3, 2024 20:32

nicklan requested a review from zachschuermann May 3, 2024 20:32

scovich reviewed May 4, 2024

View reviewed changes

nicklan added 9 commits May 4, 2024 12:25

Add most types other than a few less common primitive ones

1d5a9dd

add remaining types

32428a5

include pragma once

7f59e54

include schema printing

bcc5a4f

docs mostly

9593038

fmt

6fd15cf

free schema!

4baa68a

cleanup

8747fe3

fix pointer issue

6c4a173

nicklan requested a review from roeap May 6, 2024 22:25

nicklan and others added 4 commits May 6, 2024 16:52

strndup + macro

87ef58c

Apply suggestions from code review

21a3402

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>

Merge branch 'main' into add-most-schema-types-to-ffi

78378b8

don't forget long

0fc9bd7

nicklan force-pushed the add-most-schema-types-to-ffi branch from f3a27f9 to 0fc9bd7 Compare May 7, 2024 00:03

nicklan added 2 commits May 6, 2024 17:04

untabify :sigh:

a3ff414

Merge branch 'main' into add-most-schema-types-to-ffi

03c34db

nicklan added 8 commits May 7, 2024 15:50

fix up selection vector stuff

184e2c3

maps can have nulls + print if array/map has null in schema.h

0805f38

cleaner visit macro

76b942b

comment update

245da8a

clearer todo

d479631

fix schema printing for deep nesting

872b0e7

use schema appropriate names

47abc39

Merge branch 'main' into add-most-schema-types-to-ffi

84af4ba

nicklan requested a review from scovich May 8, 2024 20:15

scovich approved these changes May 9, 2024

View reviewed changes

roeap approved these changes May 10, 2024

View reviewed changes

nicklan and others added 6 commits May 10, 2024 11:17

Merge branch 'main' into add-most-schema-types-to-ffi

cd86766

macros are fun

93fc17c

Apply suggestions from code review

711b16b

Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>

doc update

953c3e8

slightly more consistent formatting

2498d10

more consistent array indexing

8cdf7f8

nicklan merged commit 663585d into delta-io:main May 10, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add remaining data types to schema visiting in ffi #187

Add remaining data types to schema visiting in ffi #187

nicklan commented May 1, 2024 •

edited

Loading

scovich left a comment

scovich May 4, 2024

scovich May 4, 2024

nicklan May 8, 2024 •

edited

Loading

scovich May 9, 2024

scovich May 9, 2024

nicklan May 10, 2024

scovich May 4, 2024

nicklan May 6, 2024

scovich May 7, 2024 •

edited

Loading

scovich May 4, 2024

scovich May 4, 2024

nicklan May 8, 2024

scovich left a comment

scovich May 7, 2024

nicklan May 10, 2024

scovich May 7, 2024

nicklan May 10, 2024

scovich May 9, 2024

scovich May 9, 2024

scovich May 9, 2024

roeap May 10, 2024

nicklan May 10, 2024

roeap May 10, 2024

nicklan May 10, 2024

roeap May 10, 2024

nicklan May 10, 2024

roeap May 10, 2024

roeap commented May 10, 2024

Add remaining data types to schema visiting in ffi #187

Add remaining data types to schema visiting in ffi #187

Conversation

nicklan commented May 1, 2024 • edited Loading

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicklan May 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich May 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roeap commented May 10, 2024

nicklan commented May 1, 2024 •

edited

Loading

nicklan May 8, 2024 •

edited

Loading

scovich May 7, 2024 •

edited

Loading