Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r] Add iterator classes #1274

Merged
merged 21 commits into from
May 26, 2023
Merged

[r] Add iterator classes #1274

merged 21 commits into from
May 26, 2023

Conversation

pablo-gar
Copy link
Member

@pablo-gar pablo-gar commented Apr 24, 2023

Summary

address #1285 and #1348

  • Implements read iterators in a similar way as python API.
  • Removes iteration capabilities from SOMADenseNDArray.
  • Expands matrix zero-based shim introduced by @mlin. It now supports arithmetic operations and Matrix::DenseMatrix
  • SOMADataFrame and SOMASparseNDArray use correspoding read iterators.
    • Final shape of exported sparse matrices is the same as SOMASparseNDArray$shape()
  • Refractors code to accommodate the changes above.
  • Updates docs.
  • Updates tests.

Notes for reviewers

  • @eddelbuettel @aaronwolen Please review changes to:

    • SOMADataFrame
    • SOMASparseNDArray
    • SOMADenseNDArray
    • SOMAArrayBase
    • ReadIter
    • SparseReadIter
    • TableReadIter
    • utils-readerTransformers
  • @mlin please review changes to:

    • SOMASparseNDArray and in particular $read_sparse_matrix_zero_based.
    • utils-readerTransformers and in particular arrow_table_to_sparse
    • utils-matrixZeroBasedView

Usage example (May 24)

library(tiledbsoma)

ctx <- tiledbsoma::SOMATileDBContext$new(
   config = c("soma.init_buffer_bytes" = "300")
)
 
human_experiment = load_dataset("soma-exp-pbmc-small", tiledbsoma_ctx = ctx)

###
# X reading (SOMASparseNDArray)

x_sparse <- human_experiment$ms$get("RNA")$X$get("data")
    
# Arrow
sparse_arrow <- x_sparse$read()$tables()

sparse_arrow$read_next()
# Table
# 37 rows x 3 columns
# $soma_dim_0 <int64 not null>
# $soma_dim_1 <int64 not null>
# $soma_data <double not null>

sparse_arrow$concat()
# Table
# 4419 rows x 3 columns
# $soma_dim_0 <int64 not null>
# $soma_dim_1 <int64 not null>
# $soma_data <double not null>

# Sparse zero_based view
sparse <- x_sparse$read()$sparse_matrix(zero_based = T)

sparse$read_next()
# Non-mutable 0-based 'view' class for matrices.
#  To get 1-based matrix use `x$one_based_matrix` or `as.one.based(x)`

sparse$concat()
# Non-mutable 0-based 'view' class for matrices.
#  To get 1-based matrix use `x$one_based_matrix` or `as.one.based(x)`

###
# obs reading (SOMADataFrame)

ctx <- tiledbsoma::SOMATileDBContext$new(
   config = c("soma.init_buffer_bytes" = "30000000")
)
 
human_experiment = SOMAExperimentOpen("./apis/python/notebooks/data/sparse/pbmc3k/", tiledbsoma_ctx =ctx)

# Full data frame
df <- human_experiment$obs$read()
df$read_next()
# Table
# 2638 rows x 6 columns
# $soma_joinid <int64 not null>
# $obs_id <large_string not null>
# $n_genes <int64 not null>
# $percent_mito <float not null>
# $n_counts <float not null>
# $louvain <large_string not null>

df$read_next()
# NULL
# Warning message:
# In df$read_next() : Iteration complete, returning NULL

# Iterated 
ctx <- tiledbsoma::SOMATileDBContext$new(
   config = c("soma.init_buffer_bytes" = "300")
)
 
human_experiment = SOMAExperimentOpen("./apis/python/notebooks/data/sparse/pbmc3k/", tiledbsoma_ctx =ctx)
iterator <- human_experiment$obs$read(iterated = TRUE)
iterator$read_next()
# Table
# 18 rows x 6 columns
# $soma_joinid <int64 not null>
# $obs_id <large_string not null>
# $n_genes <int64 not null>
# $percent_mito <float not null>
# $n_counts <float not null>
# $louvain <large_string not null>

iterator$concat() 
# Table
# 2620 rows x 6 columns
# $soma_joinid <int64 not null>
# $obs_id <large_string not null>
# $n_genes <int64 not null>
# $percent_mito <float not null>
# $n_counts <float not null>
# $louvain <large_string not null>

@pablo-gar pablo-gar changed the title add: iterator classes experiment add: iterator classes [experiment] Apr 24, 2023
@johnkerl johnkerl changed the title add: iterator classes [experiment] [r] Add iterator classes [experimental] Apr 24, 2023
@codecov-commenter
Copy link

codecov-commenter commented May 7, 2023

Codecov Report

Patch coverage has no change and project coverage change: -0.33 ⚠️

Comparison is base (83c4e30) 52.13% compared to head (8818b5f) 51.81%.

❗ Current head 8818b5f differs from pull request most recent head 07c2e94. Consider uploading reports for the commit 07c2e94 to get more accurate results

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1274      +/-   ##
==========================================
- Coverage   52.13%   51.81%   -0.33%     
==========================================
  Files          68       73       +5     
  Lines        5726     5690      -36     
==========================================
- Hits         2985     2948      -37     
- Misses       2741     2742       +1     
Flag Coverage Δ
r 51.81% <ø> (-0.33%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 13 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@pablo-gar pablo-gar marked this pull request as ready for review May 11, 2023 19:07
@mlin
Copy link
Member

mlin commented May 12, 2023

@pablo-gar can you talk me through the need for arithmetic ops on matrixZeroBasedView?

I had a rationale for supporting only essential operations on it, namely that this forces user to explicitly use as.one.based() to do anything else with it, thus minimizing opportunities to implicitly mix zero- and one-based matrices in expressions.

I'm open to the possibilities that (i) the zero-based arithmetic ops are essential, or (ii) I should give up on that rationale -- just would like to hear more to either end.

@eddelbuettel
Copy link
Contributor

eddelbuettel commented May 12, 2023

@pablo-gar: On 'git checkout pablo-gar/r_read_iterators; ./cleanup; R CMD INSTALL .` I hit an immediate speed bump

installing to /usr/local/lib/R/site-library/00LOCK-r/00new/tiledbsoma/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
Error : EphemeralCollection.Rd:7: undefined exports: DenseReadIter
ERROR: installing Rd objects failed for package ‘tiledbsoma’
* removing ‘/usr/local/lib/R/site-library/tiledbsoma’
* restoring previous ‘/usr/local/lib/R/site-library/tiledbsoma’
Warning message:
In install.packages(pkgs = f, lib = lib, repos = if (isMatchingFile(f)) NULL else repos) :
  installation of package ‘.’ had non-zero exit status
edd@rob:~/git/tiledb-soma/apis/r(pablo-gar/r_read_iterators)$ 

Also:

edd@rob:~/git/tiledb-soma/apis/r(pablo-gar/r_read_iterators)$ ag -l DenseReadIter
man/DenseReadIter.Rd
NAMESPACE
edd@rob:~/git/tiledb-soma/apis/r(pablo-gar/r_read_iterators)$ 

Did you forget to add a file? I see this in your TODO list so maybe your use in documentation 'got ahead' ?

@eddelbuettel
Copy link
Contributor

Easy enough to fix via

edd@rob:~/git/tiledb-soma/apis/r(pablo-gar/r_read_iterators)$ git diff
diff --git a/apis/r/NAMESPACE b/apis/r/NAMESPACE
index 76adcc03..9dfd9b77 100644
--- a/apis/r/NAMESPACE
+++ b/apis/r/NAMESPACE
@@ -22,7 +22,7 @@ S3method(write_soma,TsparseMatrix)
 S3method(write_soma,data.frame)
 S3method(write_soma,matrix)
 export(ConfigList)
-export(DenseReadIter)
+#export(DenseReadIter)
 export(EphemeralCollection)
 export(EphemeralExperiment)
 export(EphemeralMeasurement)
edd@rob:~/git/tiledb-soma/apis/r(pablo-gar/r_read_iterators)$ 

Copy link
Contributor

@eddelbuettel eddelbuettel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a large PR, and it currently borks tests so no way to approve it yet -- but I like it!

I am wondering if we should simplify and reduce its scope a little and take e.g. the changes to @mlin 's zero-based view out into a different PR ?

The addition of the iterator looks good (in principle) to me and does deliver what it set out to do.

# Get max soma dims for indeces via tiledb
tiledb_array <- tiledb::tiledb_array(uri)
tiledb::tiledb_array_open(tiledb_array, type = "READ")
max_soma_dim_0 <- as.integer(max(tiledb::tiledb_array_get_non_empty_domain_from_index(tiledb_array, 1)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert this is a 32-byte AND

get max_soma_dom_0 - min_soma_dim_0 as the worst-case cardinality

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obsolete comment

@pablo-gar
Copy link
Member Author

Did you forget to add a file? I see this in your TODO list so maybe your use in documentation 'got ahead' ?

Yes apologies! It was the SOMADenseArray iterator that I did not include in the commit, which we will not implement in any case. Will fix!

@pablo-gar
Copy link
Member Author

pablo-gar commented May 16, 2023

@pablo-gar can you talk me through the need for arithmetic ops on matrixZeroBasedView?

I had a rationale for supporting only essential operations on it, namely that this forces user to explicitly use as.one.based() to do anything else with it, thus minimizing opportunities to implicitly mix zero- and one-based matrices in expressions.

I'm open to the possibilities that (i) the zero-based arithmetic ops are essential, or (ii) I should give up on that rationale -- just would like to hear more to either end.

@mlin

Not all of them are essential, only addition is essential to be able to seemingly concatenate chunks of the iterator.

A user can quickly do chunk_1 + chunk_2 and in fact that's what the $concat() method does:

mat <- mat + self$read_next()

I can reduce the scope to only support iteration, but I think it leads to a poor user experience to not allow for the others given the relatively easy lift.

@pablo-gar pablo-gar changed the title [r] Add iterator classes [experimental] [r] Add iterator classes May 18, 2023
@pablo-gar pablo-gar requested a review from eddelbuettel May 18, 2023 18:58
dims <- self$dimensions()
attr <- self$attributes()
shape <- self$shape()

stopifnot("'repr' must be a sinlge character string" = length(repr) == 1 | mode(repr) == "character",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: 'single'


#' @description Concatenate remainder of iterator
# to be refined in derived classes
concat = function() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the very belated review -- @aaronwolen asked me to help review, and I'm happy to -- I believe my review is speaking for the both of us

There are definitely multiple ways to go here. However, given that on the Python side we have a single read, options of .tables, etc., and concat -- e.g. sdf.read().tables().concat() -- we should do the same here. It won't be ideal for everyone but it will work, it will parallel the Python implementation, and we can add some keystroke-saving syntactic-sugar functions on top of these later if we want.

Let's get this concat going, and get rid of the iterated argument to read. Then read will always return an iterator (low complexity), and tables() et al. will transform an iterator of one type to an iterator of another, and concat will have the single job of doing concatenation.

Note that apis/r/R/utils-readerTransformers.R on this PR already has reader-transformers, so (I think and hope) it should be easy to connect them as methods on the iterator objects.

Likewise, concat() exists in non-stub form elsewhere on this PR and should be able to be connected as a method on the iterator objects.

If read_next remains, it needs to be remain as a helper method, not as public API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with everything and I will try to get this done tonight.

If read_next remains, it needs to be remain as a helper method, not as public API.

@johnkerl I'm not sure that the suggestion here is? Unfortunately in R there's no generic next() function, so I'm not sure how we could give users the ability to iterate themselves unless we:

  1. make read_next a public method
  2. OR we create function next() that calls internally read_next

The second one is a little less intuitive imo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read will always return an iterator

I'm not sure I understand the benefit to this. While R does have package that implements iterators, it isn't widely used (or at all in the single-cell world). Because iterators don't exist in R, all this does is add complexity to R users. While I understand the desire to make the R API to act like the Python API, we do need to keep in mind that R is not Python and there are things that Python handles natively that just don't work as well in R

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key is here:

we can add some keystroke-saving syntactic-sugar functions on top of these later if we want

and imo it's not a "if we want" but more a "we need to". I think we can keep the low-level functionality in the R6 level as proposed and then we provide wrapper functions like as_arrow_table(soma_obj) see here, as(soma_obj, "dgCMatrix"), as.data.frame(soma_obj), etc

Copy link
Member Author

@pablo-gar pablo-gar May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnkerl I have addressed your comments with the exception of:

If read_next remains, it needs to be remain as a helper method, not as public API.

I also updated the usage section of top-level comment. The branch still needs to merge main but wanted to get your early thoughts on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pablo-gar looking this morning -- thank you! :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pablo-gar sorry to confuse.

This looks good to me -- thank you!! :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I will update the docs and merge main into the PR branch

apis/r/R/SOMADataFrame.R Show resolved Hide resolved
apis/r/R/SOMADenseNDArray.R Show resolved Hide resolved
apis/r/R/SOMASparseNDArray.R Outdated Show resolved Hide resolved
@pablo-gar pablo-gar requested a review from johnkerl May 25, 2023 08:42

#' @description Concatenate remainder of iterator
# to be refined in derived classes
concat = function() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pablo-gar sorry to confuse.

This looks good to me -- thank you!! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants