Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt schema-print for notebook use #342

Merged
merged 14 commits into from
Jan 10, 2022
Merged

Conversation

johnkerl
Copy link
Collaborator

@johnkerl johnkerl commented Jan 7, 2022

Problem to solve

Currently the "show" generic for ArraySchema -- and its recursive attributes including Domain, Dimension, Attribute, FilterList, and Filter -- rely on libtiledb_array_schema_dump et al. In turn, those TileDB core functions print to a FILE* which is stdout. This works fine at the R CLI; for notebooks, stdout is not plumbed to the user so the user sees nothing when doing schema(arr).

Options

  • We could modify TileDB core to have parallel functions which return strings, rather than writing to FILE*. This was attempted on Splitting up #2769 to diagnose CI issues [WIP] TileDB#2773. However, the scoping of that work has other dependencies as narrated by @eric-hughes-tiledb there. While that work is worth doing, it's on a greater scope than the current task.
  • We could somehow manipulate Rcpp::Rcout (an ostream which is plumbed to the user in both CLI and notebook contexts) and graft it onto core's FILE* interface.
  • We can simply imitate what TileDB-Py does when presented with the same situation, namely, have __repr__ methods (Python's equivalent of "show" generics in R) make the relevant API calls.

This PR implements option three.

Validation

At the CLI:

library(tiledb)

for (uri in c(
  "~/demo/palmer_penguins2",
  "~/demo/quickstart_dense",
  "~/demo/quickstart_sparse",
  "~/demo/test-array-200x200x30",
  "~/demo/writing_dense_global",
  "~/demo/writing_sparse_global",
  "tiledb://TileDB-Inc/quickstart_dense",
  "tiledb://TileDB-Inc/quickstart_sparse",
  "tiledb://TileDB-Inc/gtex-analysis-rnaseqc-gene-tpm"
  )) {

  arr <- tiledb_array(uri, query_type="READ", as.data.frame=TRUE)
  sch <- schema(arr)
  cat("\n")
  cat("================================================================\n")
  cat("URI", uri, "\n")
  print(sch)
}

This should print the same information as before -- since it's to the CLI's stdout.

The more compelling test is to make a Jupyter notebook using a local build of this PR's code:
Screen Shot 2022-01-07 at 11 04 40 AM

Notes

@shortcut-integration
Copy link

@johnkerl johnkerl force-pushed the kerl/sc-12282-array-schema-show branch 2 times, most recently from 95b63f8 to 3acd0b8 Compare January 7, 2022 16:07
@johnkerl johnkerl marked this pull request as ready for review January 7, 2022 16:08
@johnkerl johnkerl requested a review from Shelnutt2 January 7, 2022 16:15
@johnkerl johnkerl force-pushed the kerl/sc-12282-array-schema-show branch from 3acd0b8 to c44fedb Compare January 7, 2022 16:18
Copy link
Contributor

@eddelbuettel eddelbuettel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little tied up earlier so belated comments but I would like to disentangle the PR into the (simpler, limited scope) UINT32 additions to some switch statements (maybe with unit tests hitting them) and the more fluid-in-motion issue of pretty printing schemas etc which is already in my lap and where the PR helps a little less.

@eddelbuettel
Copy link
Contributor

And, if I may, from a purely procedural standpoint, I have over the years added a github PR stanza / form / suggestion in one or two open source projects where I recommend 'issue ticket and discussion first' before entering with PR which (personal view here) can add more friction that needed unless coordinated. I still think that helps a PRs going beyond trivial ones.

johnkerl added a commit that referenced this pull request Jan 7, 2022
@johnkerl johnkerl marked this pull request as draft January 7, 2022 18:36
@johnkerl johnkerl force-pushed the kerl/sc-12282-array-schema-show branch 3 times, most recently from 2f8a6d9 to 6119c68 Compare January 7, 2022 19:27
@johnkerl
Copy link
Collaborator Author

johnkerl commented Jan 7, 2022

Thanks @eddelbuettel ! I'm looking forward to scheduling (currently unscheduled) of SC 13272 and 13273 which we had discussed, among other things, at some length.

Would you prefer I abandon this PR at this point?

@johnkerl johnkerl marked this pull request as ready for review January 7, 2022 19:29
@eddelbuettel
Copy link
Contributor

We can probably throw a few PR together in the blender because I'll do some work on SC 13272 and 13273 which are close in spirit and scope. So we'll see where we end up with. Nothing wrong with adding a few show methods, then again I do not know how much high-level stuff is actually given that the arr <- tiledb_array(uri, options) followed by work on arr covers most uses.

I have this overarching fundamental problem that we have "some many functions" already making it hard at time to find functionality (!!) and the only we seem to be able to offer is more functions still. Tricky issue.

@johnkerl
Copy link
Collaborator Author

johnkerl commented Jan 7, 2022

Thanks @eddelbuettel ! @antalakas and I are of the viewpoint that invisible schema-printing in notebook contexts is a blocker for promoting cloud-R in notebooks. Personally I would prefer to merge these "show" generics sooner, rather than waiting for a future refactor. I'll be happy to red-PR this stuff out in the future once your refactor work is complete -- ?

@eddelbuettel
Copy link
Contributor

Maybe I am not seeing the forest for the trees but is there a reason why you cannot add pretty-printers you need now into the cloud package you are in control of?

@johnkerl
Copy link
Collaborator Author

johnkerl commented Jan 7, 2022

@eddelbuettel our cloud docs show examples of things like printing schemas. Python support for doing this in notebooks exists; R has a gap for notebooks which this PR addresses. While I've done significant work in TileDB-Cloud-R over the last few months, the affected package for this particular task is indeed TileDB-R, not TileDB-Cloud-R.

See for example https://docs.tiledb.com/cloud/tutorials/start-here

eddelbuettel pushed a commit that referenced this pull request Jan 7, 2022
* uint32 mods in support of schema-printer PR #342

* unit-test cases for UINT32 attribute get/set fill value

* code-review feedback
@johnkerl johnkerl force-pushed the kerl/sc-12282-array-schema-show branch from d36f8dd to dc073b4 Compare January 7, 2022 20:50
@eddelbuettel
Copy link
Contributor

This PR is good, it gets us almost to the finish line of not relying on core code (to stdout) for object display and adding a numer of missing show() methods. It tickled an error an real penguins data set (with NA values, as opposed to the cleansed one in cloud use). I fixed that, made the code a little tighter, and removed use of try.

Please see #345 (a new PR into this branch for your review) and have a look with your test arrays, it it is looking ok on the ones I tried.

@eddelbuettel
Copy link
Contributor

GitHub Actions for R forces a qpdf installation which we do not even need or use but which currently times out :-/ I relaunched; that helped earlier on the weekend.

Copy link
Contributor

@eddelbuettel eddelbuettel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- but then I also tainted the review by adding some code myself :)

Anybody else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants