-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add remaining data types to schema visiting in ffi #187
Conversation
86534d7
to
1aecefd
Compare
1aecefd
to
2f475b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass
printf("Making a list of lenth %i with id %i\n", reserve, id); | ||
#endif | ||
builder->list_count++; | ||
builder->lists = realloc(builder->lists, sizeof(SchemaItemList) * builder->list_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why realloc
, out of curiosity? What previous usage could there have been?
(if there was some valid previous usage, wouldn't we need to free
the lists it contained?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes, this is being used to grow the array... but realloc
often (usually?) moves the memory to a new location, which would invalidate all the child pointers. To avoid dangling pointers, SchemaItem::children
needs to be an index into SchemaBuilder::lists
. Doing that would also make it obvious in the free_builder
function, why we don't need to free the child list when freeing its parent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, real bad on the pointers 🤦🏽 .
I've updated things to use indexes.
Why realloc, out of curiosity? What previous usage could there have been?
We don't know the total number of lists ahead of time. SchemaBuilder.lists
is an array of SchemaItemList
, so this realloc is just adding one more to it (list_count
gets incremented before this realloc call). Previous usage was when we made the builder and did
SchemaBuilder builder = {
.list_count = 0,
.lists = calloc(0, sizeof(SchemaItem*)),
};
as well as each other call to this method.
(if there was some valid previous usage, wouldn't we need to free the lists it contained?)
realloc
does that for you (afaiu): "Unless ptr is NULL, it must have been returned by an earlier call to malloc or related functions. If the area pointed to was moved, a free(ptr) is done."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(if there was some valid previous usage, wouldn't we need to free the lists it contained?)
realloc does that for you (afaiu)
I was referring to the lists of SchemaItem
that each SchemaItemList
in the builder referred to. But it's not a problem as you say, because we're doing realloc to extend rather than replace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One gotcha with using realloc this way -- it has O(n**2)
cost, where n
is the final list_count
. The usual workaround is to increase the capacity of the array by a fixed ratio like 3/2; in return for potentially 50% wasted space, it is amortized O(n)
cost. Probably doesn't matter for this example code tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I considered increasing more each time, but:
a) it requires tracking an extra capacity param and
b) for example code as you say it's probably overkill
ffi/examples/read-table/schema.h
Outdated
SchemaBuilder *builder = (SchemaBuilder*)data; | ||
char* name_ptr = allocate_name(name); | ||
char* type = malloc(16 * sizeof(char)); | ||
sprintf(type, "decimal(%i)(%i)", precision, scale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always use snprintf
! This string could technically be up to 19 bytes long, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call to use snprintf, but how would it be 19 bytes? The protocol says precision and scale max out at 38, and "decimal(38)(38)" is 16 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what the spec says, yes... but to avoid buffer overrun risk, we have to handle the biggest i8/u8 values an adversary could throw at us.
ffi/src/lib.rs
Outdated
} | ||
child_list_id | ||
} | ||
|
||
fn visit_array_item(visitor: &EngineSchemaVisitor, at: &ArrayType) -> usize { | ||
let child_list_id = (visitor.make_field_list)(visitor.data, 1); | ||
visit_schema_item(&at.element_type, "array_data_type", visitor, child_list_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be something like "array_element"? AFAIK it corresponds to an actual column in the parquet?
(ditto for "map_key" and "map_value" below).
We also need to decide how to present these types to the engine, since parquet can have several different physical representations for LIST and MAP types (legacy 2-level vs. 3-level preferred), and we probably want to protect the engine from having to care what the underlying parquet ended up containing -- @roeap knows more about that situation. Either kernel or parquet reader will have to deal with that name mapping problem.
// Creates a new field list, optionally reserving capacity up front | ||
make_field_list: extern "C" fn(data: *mut c_void, reserve: usize) -> usize, | ||
/// opaque state pointer | ||
pub data: *mut c_void, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we require NonNull<c_void>
? I guess we don't actually care what the engine does with its own pointer tho?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, like you might want something that just prints out with no need for data, so it could be null.
Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
f3a27f9
to
0fc9bd7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Bunch of nits to consider before merge, if you want.
ffi/examples/read-table/schema.h
Outdated
*/ | ||
|
||
// If you want the visitor to print out what it's being asked to do at each step, uncomment the | ||
// following line | ||
// #define PRINT_VISITS | ||
//#define PRINT_VISITS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could also potentially leverage variadic macro args for this:
#ifdef VERBOSE
#define PRINT_VISITS(...) printf(__VA_ARGS__)
#else
#define PRINT_VISIT(...)
#endif
and then just use it:
PRINT_VISIT("Making a list of length %i with id %i\n", reserve, id);
(a bit more work up front, but less #ifdef magic at use sites)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also more fun :) switched to that
printf("│ "); | ||
} | ||
} | ||
SchemaItem* item = &list->list[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably decide whether to use &array[index]
vs. array + index
and be consistent?
(I personally prefer the former because it's obvious at a glance that it's not normal arithmetic... but I recognize it's also a bit more typing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, makes sense. I tend to use array + index
when I know I just want the pointer, and array[index]
when I want to use the item there (like array[index].name
), but consistency is good. I've updated to all &array[index]
style for getting pointers.
printf("Making a list of lenth %i with id %i\n", reserve, id); | ||
#endif | ||
builder->list_count++; | ||
builder->lists = realloc(builder->lists, sizeof(SchemaItemList) * builder->list_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(if there was some valid previous usage, wouldn't we need to free the lists it contained?)
realloc does that for you (afaiu)
I was referring to the lists of SchemaItem
that each SchemaItemList
in the builder referred to. But it's not a problem as you say, because we're doing realloc to extend rather than replace.
printf("Making a list of lenth %i with id %i\n", reserve, id); | ||
#endif | ||
builder->list_count++; | ||
builder->lists = realloc(builder->lists, sizeof(SchemaItemList) * builder->list_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One gotcha with using realloc this way -- it has O(n**2)
cost, where n
is the final list_count
. The usual workaround is to increase the capacity of the array by a fixed ratio like 3/2; in return for potentially 50% wasted space, it is amortized O(n)
cost. Probably doesn't matter for this example code tho.
ffi/src/lib.rs
Outdated
DataType::Map(mt) => call!( | ||
visit_map, | ||
mt.value_contains_null, | ||
visit_map_types(visitor, mt) | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, what decides whether to have => call!(
vs. => {
?
DataType::Map(mt) => call!( | |
visit_map, | |
mt.value_contains_null, | |
visit_map_types(visitor, mt) | |
), | |
DataType::Map(mt) => { | |
call!(visit_map, mt.value_contains_null, visit_map_types(visitor, mt)) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not entirely sure, but i think for fmt
the trailing ;
does, at least if there is no macro involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, rustfmt
is a bit odd here. It'll let me put the map call into {}
s, but won't let it stay on one line.
It seems to use {}
s if the body of the match case needs to go on its own line, so
&DataType::STRING => call!(visit_string),
is okay because it's all one line, but
DataType::Array(at) =>
call!(visit_array, at.contains_null, visit_array_item(visitor, at)),
while valid, will be fmt
'd to:
DataType::Array(at) => {
call!(visit_array, at.contains_null, visit_array_item(visitor, at))
}
I've put the map
case into {}
s for a bit more consistency, but really can't find a formatting that will let it stay on one line, I think because fmt
thinks it's just a bit too long.
ffi/src/lib.rs
Outdated
/// representation of a schema from a particular schema within kernel. | ||
/// | ||
/// The model is list based. When the kernel needs a list, it will ask engine to allocate one of a | ||
/// particular size. Once allocated the engine returns an `id`, which can be any identifier the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: reading "which can be any identifier the engine wants" I thought this would be an opaque type, but IIUC we are assuming usize
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I was thinking that "identifier" was enough to make it clear, but i've amended to "integer identifier ([`usize`])"
which is hopefully more clear (and links to the usize docs so implementers can understand what type it will be in c/c++)
/// - For a map, visit the key and value, passing its special name ("map_key" or "map_value"), | ||
/// type, and value nullability (keys are never nullable) | ||
/// - For a list, visit the element, passing its special name ("array_element"), type, and | ||
/// nullability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not sure if it makes that much of a difference, but should be we consistent with the arrow conversion in how we name the special fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked about and couldn't find any consistency online about what to name these. Agree that it's not great to have different names here than in our schema.rs
code.
In the spirit of not bikeshedding too much I've opened #202 and will follow-up to rename everything properly.
ffi/src/lib.rs
Outdated
DataType::Map(mt) => call!( | ||
visit_map, | ||
mt.value_contains_null, | ||
visit_map_types(visitor, mt) | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not entirely sure, but i think for fmt
the trailing ;
does, at least if there is no macro involved.
Just left a few minor questions. Took me a little bit longer as I took the opportunity to go through most of the ffi code to get familiar. |
Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
Adds more types, and uses a macro to reduce repetition for the primitive ones. Also adds some rust docs to try and explain how the visitors for schema work.
Tested by adding a visitor to the
read_table
program that builds and prints the schema.Output: