Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r] Add iterator classes #1274

Merged
merged 21 commits into from
May 26, 2023
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions apis/r/NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,15 @@ S3method("[",matrixZeroBasedView)
S3method("[<-",matrixZeroBasedView)
S3method("[[",MappingBase)
S3method("[[<-",MappingBase)
S3method(Ops,matrixZeroBasedView)
S3method(as.list,MappingBase)
S3method(as.one.based,matrixZeroBasedView)
S3method(dim,matrixZeroBasedView)
S3method(length,MappingBase)
S3method(names,MappingBase)
S3method(pad_matrix,default)
S3method(pad_matrix,matrix)
S3method(print,matrixZeroBasedView)
S3method(write_soma,Assay)
S3method(write_soma,DimReduc)
S3method(write_soma,Graph)
Expand All @@ -20,6 +22,7 @@ S3method(write_soma,TsparseMatrix)
S3method(write_soma,data.frame)
S3method(write_soma,matrix)
export(ConfigList)
export(DenseReadIter)
export(EphemeralCollection)
export(EphemeralExperiment)
export(EphemeralMeasurement)
Expand Down Expand Up @@ -50,6 +53,8 @@ export(SOMASparseNDArrayCreate)
export(SOMASparseNDArrayOpen)
export(SOMATileDBContext)
export(ScalarMap)
export(SparseReadIter)
export(TableReadIter)
export(TileDBArray)
export(TileDBGroup)
export(TileDBObject)
Expand Down
84 changes: 84 additions & 0 deletions apis/r/R/ReadIter.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
#' SOMA Read Iterator Base class
#'
#' Class that allows for read iteration of SOMA reads.

ReadIter <- R6::R6Class(
classname = "ReadIter",

public = list(

#' @description Create (lifecycle: experimental)
#' @param uri Character value with URI path to a SOMADataFrame or SOMASparseNDArray
#' @param config character vector containing TileDB config.
#' @param colnames Optional vector of character value with the name of the columns to retrieve
#' @param qc Optional external Pointer object to TileDB Query Condition, defaults to \sQuote{NULL} i.e.
#' no query condition
#' @param dim_points Optional named list with vector of data points to select on the given
#' dimension(s). Each dimension can be one entry in the list.
#' @param loglevel Character value with the desired logging level, defaults to \sQuote{auto}
#' which lets prior setting prevail, any other value is set as new logging level.
initialize = function(uri,
config,
colnames = NULL,
qc = NULL,
dim_points = NULL,
loglevel = "auto") {

# Instantiate soma_reader_pointer with a soma_array_reader object
private$soma_reader_pointer <- sr_setup(
uri = uri,
config = config,
colnames = colnames,
qc = qc,
dim_points = dim_points,
loglevel = loglevel
)
},

#' @description Check if iterated read is complete or not. (lifecycle: experimental)
#' @return logical
read_complete = function() {
if (is.null(private$soma_reader_pointer)) {
TRUE
} else {
sr_complete(private$soma_reader_pointer)
}
},

#' @description Read the next chunk of an iterated read. (lifecycle: experimental).
#' If read is complete, retunrs `NULL` and raises warning.
#' @return \code{NULL} or one of arrow::\link[arrow]{Table}, \link{matrixZeroBasedView}
read_next = function() {
if (is.null(private$soma_reader_pointer)) {
NULL
} else {
if (self$read_complete()) {
warning("Iteration complete, returning NULL")
NULL
} else {
rl <- sr_next(private$soma_reader_pointer)
return(private$soma_reader_transform(rl))
}
}
},

#' @description Concatenate remainder of iterator
# to be refined in derived classes
concat = function() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the very belated review -- @aaronwolen asked me to help review, and I'm happy to -- I believe my review is speaking for the both of us

There are definitely multiple ways to go here. However, given that on the Python side we have a single read, options of .tables, etc., and concat -- e.g. sdf.read().tables().concat() -- we should do the same here. It won't be ideal for everyone but it will work, it will parallel the Python implementation, and we can add some keystroke-saving syntactic-sugar functions on top of these later if we want.

Let's get this concat going, and get rid of the iterated argument to read. Then read will always return an iterator (low complexity), and tables() et al. will transform an iterator of one type to an iterator of another, and concat will have the single job of doing concatenation.

Note that apis/r/R/utils-readerTransformers.R on this PR already has reader-transformers, so (I think and hope) it should be easy to connect them as methods on the iterator objects.

Likewise, concat() exists in non-stub form elsewhere on this PR and should be able to be connected as a method on the iterator objects.

If read_next remains, it needs to be remain as a helper method, not as public API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with everything and I will try to get this done tonight.

If read_next remains, it needs to be remain as a helper method, not as public API.

@johnkerl I'm not sure that the suggestion here is? Unfortunately in R there's no generic next() function, so I'm not sure how we could give users the ability to iterate themselves unless we:

  1. make read_next a public method
  2. OR we create function next() that calls internally read_next

The second one is a little less intuitive imo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read will always return an iterator

I'm not sure I understand the benefit to this. While R does have package that implements iterators, it isn't widely used (or at all in the single-cell world). Because iterators don't exist in R, all this does is add complexity to R users. While I understand the desire to make the R API to act like the Python API, we do need to keep in mind that R is not Python and there are things that Python handles natively that just don't work as well in R

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key is here:

we can add some keystroke-saving syntactic-sugar functions on top of these later if we want

and imo it's not a "if we want" but more a "we need to". I think we can keep the low-level functionality in the R6 level as proposed and then we provide wrapper functions like as_arrow_table(soma_obj) see here, as(soma_obj, "dgCMatrix"), as.data.frame(soma_obj), etc

Copy link
Member Author

@pablo-gar pablo-gar May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnkerl I have addressed your comments with the exception of:

If read_next remains, it needs to be remain as a helper method, not as public API.

I also updated the usage section of top-level comment. The branch still needs to merge main but wanted to get your early thoughts on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pablo-gar looking this morning -- thank you! :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pablo-gar sorry to confuse.

This looks good to me -- thank you!! :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I will update the docs and merge main into the PR branch

.NotYetImplemented()
}

),

private = list(

# Internal 'external pointer' object used for iterated reads
soma_reader_pointer = NULL,

# to be refined in derived classes
soma_reader_transform = function(x) {
.NotYetImplemented()
}

)
)
44 changes: 0 additions & 44 deletions apis/r/R/SOMAArrayBase.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,33 +17,6 @@ SOMAArrayBase <- R6::R6Class(
}
),

public = list(

#' @description Check if iterated read is complete or not. (lifecycle: experimental)
read_complete = function() {
if (is.null(private$soma_reader_pointer)) {
TRUE
} else {
sr_complete(private$soma_reader_pointer)
}
},

#' @description Read the next chunk of an iterated read. (lifecycle: experimental)
read_next = function() {
if (is.null(private$soma_reader_pointer)) {
NULL
} else {
if (sr_complete(private$soma_reader_pointer)) {
invisible(NULL)
} else {
rl <- sr_next(private$soma_reader_pointer)
private$soma_reader_transform(rl)
}
}
}

),

private = list(

# Cache object's SOMA_OBJECT_TYPE_METADATA_KEY
Expand All @@ -58,23 +31,6 @@ SOMAArrayBase <- R6::R6Class(
meta[[SOMA_OBJECT_TYPE_METADATA_KEY]] <- self$class()
meta[[SOMA_ENCODING_VERSION_METADATA_KEY]] <- SOMA_ENCODING_VERSION
self$set_metadata(meta)
},

# Internal 'external pointer' object used for iterated reads
soma_reader_pointer = NULL,

# Instantiate soma_reader_pointer with a soma_array_reader object
soma_reader_setup = function() {
private$soma_reader_pointer <- sr_setup(
self$uri,
config=as.character(tiledb::config(self$tiledbsoma_ctx$context()))
)
},

## to be refined in derived classes
soma_reader_transform = function(x) {
x
}

)
)
50 changes: 21 additions & 29 deletions apis/r/R/SOMADataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ SOMADataFrame <- R6::R6Class(
#' @param iterated Option boolean indicated whether data is read in call (when
#' `FALSE`, the default value) or in several iterated steps.
#' @param log_level Optional logging level with default value of `"warn"`.
#' @return An [`arrow::Table`].
#' @return arrow::\link[arrow]{Table} or \link{TableReadIter}
johnkerl marked this conversation as resolved.
Show resolved Hide resolved
read = function(coords = NULL,
column_names = NULL,
value_filter = NULL,
Expand All @@ -181,7 +181,7 @@ SOMADataFrame <- R6::R6Class(
result_order <- match_query_layout(result_order)
uri <- self$uri
arr <- self$object # need array (schema) to properly parse query condition

## if unnamed set names
if (!is.null(coords)) {
if (!is.list(coords))
Expand All @@ -205,29 +205,26 @@ SOMADataFrame <- R6::R6Class(
value_filter <- parsed@ptr
}

cfg <- as.character(tiledb::config(self$tiledbsoma_ctx$context()))
if (isFALSE(iterated)) {
cfg <- as.character(tiledb::config(self$tiledbsoma_ctx$context()))
rl <- soma_array_reader(uri = uri,
colnames = column_names, # NULL dealt with by soma_array_reader()
qc = value_filter, # idem
dim_points = coords, # idem
loglevel = log_level, # idem
config = cfg)
private$soma_reader_transform(rl)
rl <- soma_array_reader(uri = self$uri,
config = cfg,
colnames = column_names, # NULL dealt with by sr_setup()
qc = value_filter, # idem
dim_points = coords,
loglevel = log_level
)

soma_array_to_arrow_table(rl)
} else {
## should we error if this isn't null?
if (!is.null(private$soma_reader_pointer)) {
warning("Reader pointer not null, skipping")
rl <- NULL
} else {
private$soma_reader_setup()
rl <- list()
while (!self$read_complete()) {
## soma_reader_transform() applied inside read_next()
rl <- c(rl, self$read_next())
}
}
invisible(rl)
read_iter <- TableReadIter$new(uri = self$uri,
config = cfg,
colnames = column_names, # NULL dealt with by sr_setup()
qc = value_filter, # idem
dim_points = coords, # idem
loglevel = log_level # idem
)
return(read_iter)
}
}

Expand Down Expand Up @@ -265,12 +262,7 @@ SOMADataFrame <- R6::R6Class(
}

schema
},

## refined from base class
soma_reader_transform = function(x) {
arrow::as_arrow_table(arrow::RecordBatch$import_from_c(x[[1]], x[[2]]))
}

)
)
Loading