Parallel support improvements. #124

dblodgett-usgs · 2024-12-20T19:09:08Z

This PR includes the commits in #122 and is related to issue #123

I've been working through lingering edge cases related to #105 and ended up with what's here.

I feel like this is actually fairly robust but also am really not sure I like how much complexity I had to bring in to support what should be pretty straight forward. Thoughts?

dblodgett-usgs · 2024-12-20T21:18:14Z

I just added a commit with a bunch of documentation stuff for #95 here as well. I can move these commits / contributions around if we decide not to go with this approach to parallelism.

dblodgett-usgs · 2025-02-20T03:26:28Z

@keller-mark -- are you ok merging this in? I'm back working with this stuff again and would be stoked to get it moved toward CRAN.

keller-mark · 2025-02-20T13:28:55Z

R/zarr-array.R

@@ -16,71 +16,70 @@ ZarrArray <- R6::R6Class("ZarrArray",
    # store Array store, already initialized.
    #' @keywords internal
    store = NULL,
-    #' chunk_store Separate storage for chunks. If not provided, `store` will be used for storage of both chunks and metadata.
+    # chunk_store Separate storage for chunks. If not provided, `store` will be used for storage of both chunks and metadata.


Why are the #' to # changes needed? I am not an expert in Roxygen comments but I see the apostrophe format at https://roxygen2.r-lib.org/articles/rd-other.html#r6

I don't remember precisely, but for internal methods, this was causing documentation to get polluted with stuff that didn't make sense.

Yeah... so with the #' on private methods, it does strange stuff.

keller-mark · 2025-02-20T13:31:38Z

R/zarr-array.R


      parts <- indexer$iter()
-      part1_results <- apply_func(parts, function(proj, cl = NA) {
+      part1_results <- ps$FUN(parts, function(proj, cl = NA) {


Suggested change

part1_results <- ps$FUN(parts, function(proj, cl = NA) {

part1_results <- ps$apply_func(parts, function(proj, cl = NA) {

Keeping apply somewhere in the naming of this function would make it clear that it is doing an lapply-type operation over parts

Yeah... I'll work that up.

keller-mark · 2025-02-20T13:32:24Z

R/zarr-array.R


      parts <- indexer$iter()
-      part1_results <- apply_func(parts, function(proj, cl = NA) {
+      part1_results <- ps$FUN(parts, function(proj, cl = NA) {


Suggested change

part1_results <- ps$FUN(parts, function(proj, cl = NA) {

part1_results <- ps$FUN(parts, function(proj) {

It seems like since cl = cl has been changed to cl = ps$cl below, this cl = NA parameter is never used, correct? Do we still need to keep the cl = NA in the function signature (here, in the writing case, and within get_parallel_settings)?

I'll double check this and implement in a follow up commit.

keller-mark · 2025-02-20T13:37:24Z

R/zarr-array.R

@@ -1317,3 +1262,82 @@ ZarrArray <- R6::R6Class("ZarrArray",
 as.array.ZarrArray = function(x, ...) {
  x$as.array()
 }
+
+get_parallel_settings <- function(on_windows = (.Platform$OS.type == "windows"),


Suggested change

get_parallel_settings <- function(on_windows = (.Platform$OS.type == "windows"),

get_parallel_settings <- function(on_windows = (.Platform$OS.type == "windows"),

Would you be able to add documentation comments to document the parameters and return value?

keller-mark · 2025-02-20T13:40:16Z

R/zarr-array.R

+
+get_parallel_settings <- function(on_windows = (.Platform$OS.type == "windows"),
+                                  parallel_option = getOption("pizzarr.parallel_read_enabled", FALSE),
+                                  progress = getOption("pizzarr.progress_bar", FALSE)) {


Since pizzarr.progress_bar is a new option being introduced here, can it also be added to https://github.com/keller-mark/pizzarr/blob/main/R/options.R for initialization and to support controlling via environment variable?

This is done now.

keller-mark · 2025-02-20T13:44:31Z

R/zarr-array.R

+
+get_parallel_settings <- function(on_windows = (.Platform$OS.type == "windows"),
+                                  parallel_option = getOption("pizzarr.parallel_read_enabled", FALSE),
+                                  progress = getOption("pizzarr.progress_bar", FALSE)) {


Since there are now the 3 different parallel backends (pbapply, future.apply, and parallel), would it make more sense for this option to be a categorical/string instead of a binary flag? The naming of the option suggests that it will control whether the user sees a progress bar during processing (as opposed to something to do with parallelism or the pbapply package), so can we name it something like parallel_backend with values "pbapply", "future.apply", and "parallel"?

I was trying to be minimally invasive so didn't think to do something like this. I'll look at what it.

OK, so we have a mess of options / considerations here. I think I like where it stands.

There are actually only 2 parallel back ends and the wrapper pbapply. pbapply works with either parallel back end and we call it if we want a progress bar.

parallel_option accepts TRUE/FLASE or an integer, or a cluster object, or the string "future". "future" works on any platform but parallel has limited functionality on Windows.

I'll document these and make sure things are clear in options.R.

Merge branch 'main' of https://github.com/keller-mark/pizzarr # Conflicts: # tests/testthat/test-01-parallel.R

Merge branch 'parallel' of https://github.com/dblodgett-usgs/pizzarr # Conflicts: # tests/testthat/test-01-parallel.R

dblodgett-usgs · 2025-02-20T17:22:50Z

R/options.R

@@ -1,21 +1,29 @@
 # Adapted from https://github.com/IRkernel/IRkernel/blob/master/R/options.r

 #' pizzarr_option_defaults


@keller-mark -- see additions here. "parallel_backend" is now the master control on parallelization and there is a separate flag for whether to use it for writing.

keller-mark · 2025-02-20T17:53:01Z

Thanks!

dblodgett-usgs added 3 commits December 19, 2024 22:03

improve sample data fetch and parallel tests

26cb66a

clean up after sample data

9d25077

improve parallel settings code and testing

2ac1a22

dblodgett-usgs requested a review from keller-mark December 20, 2024 19:09

dblodgett-usgs added 4 commits December 20, 2024 13:16

test nits

6a3f186

.

f1eabbc

.

233f76c

doco cleanup for CRAN checks

bc4da73

dblodgett-usgs mentioned this pull request Dec 20, 2024

CRAN preparations #95

Open

7 tasks

keller-mark and others added 2 commits January 25, 2025 10:29

Merge branch 'main' into parallel

1a3f59e

short cut to avoid some web requests

92ca03a

keller-mark reviewed Feb 20, 2025

View reviewed changes

dblodgett-usgs added 4 commits February 20, 2025 08:11

apply_func instead of FUN

b44cefc

clean up parallel options and documentation

b214e5c

conflicts

e89b251

Merge branch 'main' of https://github.com/keller-mark/pizzarr # Conflicts: # tests/testthat/test-01-parallel.R

conflict

29e4c6b

Merge branch 'parallel' of https://github.com/dblodgett-usgs/pizzarr # Conflicts: # tests/testthat/test-01-parallel.R

dblodgett-usgs commented Feb 20, 2025

View reviewed changes

keller-mark approved these changes Feb 20, 2025

View reviewed changes

keller-mark merged commit 2a8ea4d into keller-mark:main Feb 20, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel support improvements. #124

Parallel support improvements. #124

dblodgett-usgs commented Dec 20, 2024

dblodgett-usgs commented Dec 20, 2024

dblodgett-usgs commented Feb 20, 2025

keller-mark Feb 20, 2025

dblodgett-usgs Feb 20, 2025

dblodgett-usgs Feb 20, 2025

keller-mark Feb 20, 2025

dblodgett-usgs Feb 20, 2025

keller-mark Feb 20, 2025

dblodgett-usgs Feb 20, 2025

keller-mark Feb 20, 2025

dblodgett-usgs Feb 20, 2025

keller-mark Feb 20, 2025

dblodgett-usgs Feb 20, 2025

keller-mark Feb 20, 2025

dblodgett-usgs Feb 20, 2025

dblodgett-usgs Feb 20, 2025

dblodgett-usgs Feb 20, 2025

keller-mark commented Feb 20, 2025

	part1_results <- ps$FUN(parts, function(proj, cl = NA) {
	part1_results <- ps$apply_func(parts, function(proj, cl = NA) {

	get_parallel_settings <- function(on_windows = (.Platform$OS.type == "windows"),
	get_parallel_settings <- function(on_windows = (.Platform$OS.type == "windows"),

		@@ -1,21 +1,29 @@
		# Adapted from https://github.com/IRkernel/IRkernel/blob/master/R/options.r

		#' pizzarr_option_defaults

Parallel support improvements. #124

Parallel support improvements. #124

Conversation

dblodgett-usgs commented Dec 20, 2024

dblodgett-usgs commented Dec 20, 2024

dblodgett-usgs commented Feb 20, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keller-mark commented Feb 20, 2025