Splitting up #2769 to diagnose CI issues [WIP] #2773

johnkerl · 2022-01-03T16:56:22Z

The array-schema dump method writes to stdout which is fine for many uses. In notebook contexts, though, it isn't fine -- stdout does not route to anything the notebook user can see. Notebook display of arr.schema works for Python because the Python library works around this by rolling its own array-schema dumper in its __repr__ method for ArraySchema here. Notebook display of schema(arr) does not work for R because the R library, in its show generic for ArraySchema, directly invokes the core function which prints to stdout here. Rather than taking the roll-another-of-our-own approach in TileDB-R (which could be done), on this PR we address the issue at the core level.

On #2769 I am getting too many CI errors for me to understand. Here I'm starting the commits bottom-up: filters, then dimensions, then domains, then attributes, then array schema at c_api/cpp_api, then array schema at top sm level.

TYPE: IMPROVEMENT
DESC: Create array-schema-to-string methods in support of notebook display of array schemas

shortcut-integration · 2022-01-03T16:56:25Z

This pull request has been linked to Shortcut Story #12282: Expand R-UDF validation coverage: make sure essential API calls work.

eric-hughes-tiledb

I'm all in favor of converting the dump function to C++ streams. I am far less in favor of having them implemented side by side with C I/O. We only have dump functions publicly visible for four of the schema types: array_schema, attribute, domain, dimension. The API functions themselves (in tiledb.cc) are the place to do the conversion; everything else can be done with streams. That means eliminating all dump functions with FILE * arguments in array schema classes; the ones in the API (obviously) have to stay.

eric-hughes-tiledb · 2022-01-03T21:48:31Z

tiledb/sm/array_schema/dimension.cc

+  fprintf(out, "%s", ss.str().c_str());
+}
+
+void Dimension::dump_ss(std::stringstream& ss) const {


If we were keeping the old dump function, there would be no need for a different function name. Argument-dependent resolution would work just fine.

There's no need to specialize for string streams. std::ostream will work just as well.

eric-hughes-tiledb · 2022-01-03T22:05:32Z

test/src/unit-filter-pipeline.cc

@@ -76,6 +76,9 @@ class Add1InPlace : public tiledb::sm::Filter {
  void dump(FILE* out) const override {
    (void)out;
  }
+  void dump_ss(std::stringstream& ss) const override {
+    (void)ss;


If you simply omit the argument name ss, leaving the type, you don't need to dummy out a value to avoid an unused argument warning.

eric-hughes-tiledb · 2022-01-03T22:06:34Z

tiledb/sm/array_schema/dimension.cc

@@ -222,23 +222,29 @@ const Range& Dimension::domain() const {
 void Dimension::dump(FILE* out) const {
  if (out == nullptr)
    out = stdout;
+  std::stringstream ss;
+  dump_ss(ss);
+  fprintf(out, "%s", ss.str().c_str());


No need for fprintf. A call to fwrite is all that's needed.

Should call fflush at the end, since this is the top-level call.

eric-hughes-tiledb · 2022-01-03T22:33:02Z

tiledb/sm/c_api/tiledb.h

+ * @return `TILEDB_OK` for success and `TILEDB_ERR` for error.
+ */
+TILEDB_EXPORT int32_t tiledb_array_schema_dump_str(
+    tiledb_ctx_t* ctx, const tiledb_array_schema_t* array_schema, char** out);


This API has a security defect. There's no maximum length specified.

Personally, I'd rather not have our C API writing directly in user-allocated buffers. That pattern has created lots of systemic mischief (as with user-allocated query buffers). Better would be to have an API for strings with an alloc/free model and have the library manage the memory. We don't have such library-allocated strings ready today, however. (We're needing one already for better error reporting and for delivering warnings, for example.)

We will shortly, with #2748, have experimental features available. This API call could be put into experimental/ and used immediately pending a library-allocated string.

Thanks @eric-hughes-tiledb -- I should add a comment, it does not write into user-allocated buffers. It allocates. (I was, perhaps naively, following the patter of tiledb_stats_raw_dump.)

How is the deallocation handled?

Regardless of how it's handled, if there's an allocation here, the documentation for this function needs to (1) state that, and (2) say how to deallocate.

johnkerl · 2022-01-03T23:04:02Z

I'm all in favor of converting the dump function to C++ streams. I am far less in favor of having them implemented side by side with C I/O. We only have dump functions publicly visible for four of the schema types: array_schema, attribute, domain, dimension. The API functions themselves (in tiledb.cc) are the place to do the conversion; everything else can be done with streams. That means eliminating all dump functions with FILE * arguments in array schema classes; the ones in the API (obviously) have to stay.

@eric-hughes-tiledb indeed.

This draft PR is the outcome of a planning conversation with @ihnorton -- the problem at hand being, how to make Jupyter-notebook display of arr.schema work in R (it doesn't at present), as it currently does in Python.

One choice is doing this re-work in core -- which, as you point out, is open-ended. There are more spots to convert from FILE * to string.

Another choice is that I abandon this PR, and imitate what TileDB-Python is already doing here.

Initially I had thought that the first option -- this PR -- was the more elegant. But perhaps I'm mistaken; perhaps I should go with option two in TileDB-R.

Thoughts?

eric-hughes-tiledb · 2022-01-04T22:45:30Z

One choice is doing this re-work in core -- which, as you point out, is open-ended. There are more spots to convert from FILE * to string.

The bulk of what you've got here, to use C++ streams instead of FILE, is worth doing regardless. If you scope the work to only deal with array schema classes, that would be fine. We're concentrating on array schema right now for other reasons; it's fine to limit work to this boundary.

The only question I've got is how we're returning internally-allocated objects through the C API. I've got a plan to keep a registry of all objects that we return this way (of any type, from any source), but that project is not quite ripe yet. (It's for lifespan management, detection of C client leaks, etc.) It's the reason I suggested putting the new API function into experimental/.

The much, much larger issue is the best way to implement this is with a generic schema walker that uses the visitor pattern. (I've got an independent reason for wanting this.) With such an interface available, you could implement your visitor completely externally and print out the schema however you wanted, all without hard-coding a particular dump format in the core library.

eric-hughes-tiledb · 2022-01-05T15:53:46Z

generic schema walker that uses the visitor pattern

I should also mention that actually implementing this would force awareness of something that's not on the surface of the whole issue of dumping schemas: the schema is version dependent. In general, each schema version needs a separate schema dump. Formats don't change rapidly, so it's possible to have a single piece of code that dumps multiple versions, but that's not the point. Older format versions have implicit defaults (such as for nullable attributes). The "schema" that's dumped is actually two slightly different things: (1) the versioned schema as it sits in storage, and (2) the schema in memory that's upgraded from what's in storage. Getting a schema dump may be for a purpose of either (1) or (2) or both, so if there's a single dump function, it should have both kinds of information in it. Perhaps-better is a dump facility that allows all three of these possibilities.

ihnorton · 2022-03-23T19:56:33Z

This will be rolled in to other C API improvements, thanks for getting the ball rolling here.

johnkerl marked this pull request as draft January 3, 2022 16:56

Splitting up #2769 to diagnose CI issues

3071939

johnkerl force-pushed the kerl/sc-12282/schema-dump-visibility-split-1 branch from b3d328a to 3071939 Compare January 3, 2022 17:14

Next step: add mods to filter classes

c754194

johnkerl changed the title ~~Splitting up #2769 to diagnose CI issues~~ Splitting up #2769 to diagnose CI issues [WIP] Jan 3, 2022

johnkerl added 3 commits January 3, 2022 15:27

Next step: add dimension class

13e4e01

Next step: add domain class

d27ebc9

add sm/c_api pieces

d0f6e7c

eric-hughes-tiledb reviewed Jan 3, 2022

View reviewed changes

remove a bit of code that should not have been in the previous commit

3106af0

This was referenced Jan 7, 2022

Adapt schema-print for notebook use TileDB-Inc/TileDB-R#342

Merged

Make array-schema prints visible from notebooks [WIP] #2769

Closed

ihnorton closed this Mar 23, 2022

ihnorton deleted the kerl/sc-12282/schema-dump-visibility-split-1 branch March 23, 2022 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting up #2769 to diagnose CI issues [WIP] #2773

Splitting up #2769 to diagnose CI issues [WIP] #2773

johnkerl commented Jan 3, 2022 •

edited

Loading

shortcut-integration bot commented Jan 3, 2022

eric-hughes-tiledb left a comment

eric-hughes-tiledb Jan 3, 2022

eric-hughes-tiledb Jan 3, 2022

eric-hughes-tiledb Jan 3, 2022

eric-hughes-tiledb Jan 3, 2022

johnkerl Jan 3, 2022

eric-hughes-tiledb Jan 4, 2022

johnkerl commented Jan 3, 2022 •

edited

Loading

eric-hughes-tiledb commented Jan 4, 2022

eric-hughes-tiledb commented Jan 5, 2022

ihnorton commented Mar 23, 2022

Splitting up #2769 to diagnose CI issues [WIP] #2773

Splitting up #2769 to diagnose CI issues [WIP] #2773

Conversation

johnkerl commented Jan 3, 2022 • edited Loading

shortcut-integration bot commented Jan 3, 2022

eric-hughes-tiledb left a comment

Choose a reason for hiding this comment

eric-hughes-tiledb Jan 3, 2022

Choose a reason for hiding this comment

eric-hughes-tiledb Jan 3, 2022

Choose a reason for hiding this comment

eric-hughes-tiledb Jan 3, 2022

Choose a reason for hiding this comment

eric-hughes-tiledb Jan 3, 2022

Choose a reason for hiding this comment

johnkerl Jan 3, 2022

Choose a reason for hiding this comment

eric-hughes-tiledb Jan 4, 2022

Choose a reason for hiding this comment

johnkerl commented Jan 3, 2022 • edited Loading

eric-hughes-tiledb commented Jan 4, 2022

eric-hughes-tiledb commented Jan 5, 2022

ihnorton commented Mar 23, 2022

johnkerl commented Jan 3, 2022 •

edited

Loading

johnkerl commented Jan 3, 2022 •

edited

Loading