Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting up #2769 to diagnose CI issues [WIP] #2773

Closed
wants to merge 6 commits into from

Conversation

johnkerl
Copy link
Contributor

@johnkerl johnkerl commented Jan 3, 2022

The array-schema dump method writes to stdout which is fine for many uses. In notebook contexts, though, it isn't fine -- stdout does not route to anything the notebook user can see. Notebook display of arr.schema works for Python because the Python library works around this by rolling its own array-schema dumper in its __repr__ method for ArraySchema here. Notebook display of schema(arr) does not work for R because the R library, in its show generic for ArraySchema, directly invokes the core function which prints to stdout here. Rather than taking the roll-another-of-our-own approach in TileDB-R (which could be done), on this PR we address the issue at the core level.

On #2769 I am getting too many CI errors for me to understand. Here I'm starting the commits bottom-up: filters, then dimensions, then domains, then attributes, then array schema at c_api/cpp_api, then array schema at top sm level.


TYPE: IMPROVEMENT
DESC: Create array-schema-to-string methods in support of notebook display of array schemas

@shortcut-integration
Copy link

@johnkerl johnkerl marked this pull request as draft January 3, 2022 16:56
@johnkerl johnkerl force-pushed the kerl/sc-12282/schema-dump-visibility-split-1 branch from b3d328a to 3071939 Compare January 3, 2022 17:14
@johnkerl johnkerl changed the title Splitting up #2769 to diagnose CI issues Splitting up #2769 to diagnose CI issues [WIP] Jan 3, 2022
Copy link
Contributor

@eric-hughes-tiledb eric-hughes-tiledb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all in favor of converting the dump function to C++ streams. I am far less in favor of having them implemented side by side with C I/O. We only have dump functions publicly visible for four of the schema types: array_schema, attribute, domain, dimension. The API functions themselves (in tiledb.cc) are the place to do the conversion; everything else can be done with streams. That means eliminating all dump functions with FILE * arguments in array schema classes; the ones in the API (obviously) have to stay.

fprintf(out, "%s", ss.str().c_str());
}

void Dimension::dump_ss(std::stringstream& ss) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If we were keeping the old dump function, there would be no need for a different function name. Argument-dependent resolution would work just fine.
  • There's no need to specialize for string streams. std::ostream will work just as well.

@@ -76,6 +76,9 @@ class Add1InPlace : public tiledb::sm::Filter {
void dump(FILE* out) const override {
(void)out;
}
void dump_ss(std::stringstream& ss) const override {
(void)ss;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you simply omit the argument name ss, leaving the type, you don't need to dummy out a value to avoid an unused argument warning.

@@ -222,23 +222,29 @@ const Range& Dimension::domain() const {
void Dimension::dump(FILE* out) const {
if (out == nullptr)
out = stdout;
std::stringstream ss;
dump_ss(ss);
fprintf(out, "%s", ss.str().c_str());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • No need for fprintf. A call to fwrite is all that's needed.
  • Should call fflush at the end, since this is the top-level call.

* @return `TILEDB_OK` for success and `TILEDB_ERR` for error.
*/
TILEDB_EXPORT int32_t tiledb_array_schema_dump_str(
tiledb_ctx_t* ctx, const tiledb_array_schema_t* array_schema, char** out);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API has a security defect. There's no maximum length specified.

Personally, I'd rather not have our C API writing directly in user-allocated buffers. That pattern has created lots of systemic mischief (as with user-allocated query buffers). Better would be to have an API for strings with an alloc/free model and have the library manage the memory. We don't have such library-allocated strings ready today, however. (We're needing one already for better error reporting and for delivering warnings, for example.)

We will shortly, with #2748, have experimental features available. This API call could be put into experimental/ and used immediately pending a library-allocated string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eric-hughes-tiledb -- I should add a comment, it does not write into user-allocated buffers. It allocates. (I was, perhaps naively, following the patter of tiledb_stats_raw_dump.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the deallocation handled?

Regardless of how it's handled, if there's an allocation here, the documentation for this function needs to (1) state that, and (2) say how to deallocate.

@johnkerl
Copy link
Contributor Author

johnkerl commented Jan 3, 2022

I'm all in favor of converting the dump function to C++ streams. I am far less in favor of having them implemented side by side with C I/O. We only have dump functions publicly visible for four of the schema types: array_schema, attribute, domain, dimension. The API functions themselves (in tiledb.cc) are the place to do the conversion; everything else can be done with streams. That means eliminating all dump functions with FILE * arguments in array schema classes; the ones in the API (obviously) have to stay.

@eric-hughes-tiledb indeed.

This draft PR is the outcome of a planning conversation with @ihnorton -- the problem at hand being, how to make Jupyter-notebook display of arr.schema work in R (it doesn't at present), as it currently does in Python.

One choice is doing this re-work in core -- which, as you point out, is open-ended. There are more spots to convert from FILE * to string.

Another choice is that I abandon this PR, and imitate what TileDB-Python is already doing here.

Initially I had thought that the first option -- this PR -- was the more elegant. But perhaps I'm mistaken; perhaps I should go with option two in TileDB-R.

Thoughts?

@eric-hughes-tiledb
Copy link
Contributor

One choice is doing this re-work in core -- which, as you point out, is open-ended. There are more spots to convert from FILE * to string.

The bulk of what you've got here, to use C++ streams instead of FILE, is worth doing regardless. If you scope the work to only deal with array schema classes, that would be fine. We're concentrating on array schema right now for other reasons; it's fine to limit work to this boundary.

The only question I've got is how we're returning internally-allocated objects through the C API. I've got a plan to keep a registry of all objects that we return this way (of any type, from any source), but that project is not quite ripe yet. (It's for lifespan management, detection of C client leaks, etc.) It's the reason I suggested putting the new API function into experimental/.

The much, much larger issue is the best way to implement this is with a generic schema walker that uses the visitor pattern. (I've got an independent reason for wanting this.) With such an interface available, you could implement your visitor completely externally and print out the schema however you wanted, all without hard-coding a particular dump format in the core library.

@eric-hughes-tiledb
Copy link
Contributor

generic schema walker that uses the visitor pattern

I should also mention that actually implementing this would force awareness of something that's not on the surface of the whole issue of dumping schemas: the schema is version dependent. In general, each schema version needs a separate schema dump. Formats don't change rapidly, so it's possible to have a single piece of code that dumps multiple versions, but that's not the point. Older format versions have implicit defaults (such as for nullable attributes). The "schema" that's dumped is actually two slightly different things: (1) the versioned schema as it sits in storage, and (2) the schema in memory that's upgraded from what's in storage. Getting a schema dump may be for a purpose of either (1) or (2) or both, so if there's a single dump function, it should have both kinds of information in it. Perhaps-better is a dump facility that allows all three of these possibilities.

@ihnorton
Copy link
Member

This will be rolled in to other C API improvements, thanks for getting the ball rolling here.

@ihnorton ihnorton closed this Mar 23, 2022
@ihnorton ihnorton deleted the kerl/sc-12282/schema-dump-visibility-split-1 branch March 23, 2022 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants