progress bar and parallel compute for cartogram_ncont and cartogram_cont #36

e-kotov · 2024-11-03T12:52:06Z

Hi, trying to solve #35 myself I started with the simplest function cartogram_ncont(). I know there is an older pull request #28 , but this approach with just building on furrr and future seems simpler to me. Arguably it can be improved with better handling of the workers. E.g. instead of users specifying the n_cores in cartogram_ncont() they should be able to set the number of workers and type of parallel processing with future outside the function.

This PR results in cartogram_ncont() gaining both progress bar and parallel processing.

Before making changes to the code, I ran your default example from cartogram_ncont() function and obtained a control value of area of the resulting cartogram for nc from sf package. I wrote simple test for this area in the tests folder using {testthat}. The test passes with my updated code for cartogram_ncont().

e-kotov · 2024-11-03T13:09:02Z

Update:

fixed missing CRS after applying cartogram_ncont()
added test to prevent missing CRS after applying cartogram_ncont()

Nowosad · 2024-11-03T13:25:29Z

R/cartogram_ncont.R

+  weight,
+  k = 1,
+  inplace = TRUE,
+  n_cpu = parallelly::availableCores()


I think that in general it is recommended to keep the default core use to 1.

Actually, having thought of that, I don't fully agree. In case of highly advanced packages such as mlr3, I would agree that default should be 1 core and the user should be able to have full control over which algorithm uses how many cores in the inner and outer loops. With simpler functions that users are more likely to use in interactive mode, I would argue that using maximum cores would be better, as most users probably don't read the docs and don't change the defaults. Say, you normally don't let users of GUI apps choose how many cores to use (e.g. in Photoshop, Final Cut, and many more), these kind of software just try to use as many cores as you have to maximise performance, and this default is probably exactly because otherwise users (who almost never touch defaults) would simply complain that software is slow, or does not use their hardware to its maximum potential, or worse, would switch to the competitor software that does do multicore by default. If we take DuckDB as an example, it uses 80% of RAM by default and all of the CPU cores (which I disagree with and would rather use max less 1 core, reserving it for the OS to prevent freezes).

I am currently working on a new commit where I plan to design 3 choices:

auto mode - similar to DuckDB's approach it would use max cores less 1

respect future::plan()

explicit number of cores, where the function will use future::plan() internally

The open question will remain which of these should be the default. I would say that auto mode is a better design choice, as forcing the users of cartogram to use future::plan seems like an overkill.

@Nowosad , @sjewo what do you think?

Nowosad · 2024-11-03T13:27:01Z

DESCRIPTION

@@ -12,8 +12,16 @@ Authors@R: c(
 Description: Construct continuous and non-contiguous area cartograms.
 URL: https://github.com/sjewo/cartogram, https://sjewo.github.io/cartogram/
 BugReports: https://github.com/sjewo/cartogram/issues
-Imports: methods, sf, packcircles
-Suggests: 
+Imports: 


Alternatively, the furrr, parallelly, and future packages could be moved to Suggests: and the code related to them may be only used if the progress bar is on.

(I also think there should be an option to disable the progress bar)

makes sense!

@Nowosad On a second thought, that would imply using the for loop (or purrr::map) by default with no progress bar... (though purrr::map also has progress bar). So kind of defeats the purpose. And why would parallel processing rely on the progress bar setting?

Thank you for your suggestions and your work on implementing parallel processing @e-kotov! I haven't had much time to keep an eye on this topic in the past.

I would like to keep the imports as small as possible.

Maybe utils::txtProgressBar() is sufficient for adding a progress bar without additional imports?

I would also prefer to move the parallelly, and future packages to Suggests, as @Nowosad suggested, and add an if-clause for n_cpu > 1 to use futures parallel execution.

I don't want to be petty, but is 'furrr' absolutely necessary? If the code can only be implemented with the future package, I would prefer that.

Best,
Sebastian

@sjewo points taken. I'm not sure excluding furrr is practical, as future package only provides the framework, and then you need some other package like furrr or future.apply to do the job. So why not furrr?

I will get to it next week. What do you think of the older PR #28 ? Did you have plans to merge it at all?

Thanks for your reply, @e-kotov !

Regarding furrr: i would like to keep the number of dependencies low. future.apply has half the number of imported packages. Nevertheless, if furrr is more practical in your opinion, that's fine for me!

I have not gotten the PR #28 to work and the “future” approach looks promising to me. So if you are working on a different solution that results in some speed improvements, I would certainly accept a PR.

e-kotov · 2024-11-03T13:42:33Z

@Nowosad I will work some more on this PR in the coming days. Clearly it is not ready for merging.

There is also something funny happening with the resulting cartogram. It does not show itself when applied to the test nc data, but with my own data (cannot upload right now), I am getting an error with ggplot geom_sf:

Error in `st_transform.sfc()`:
! cannot transform sfc object with missing crs

This may have something to do with attributes that I fail to restore to the geom column. I will look into this and update the PR.

…essr progress bar

e-kotov · 2024-12-23T13:30:26Z

@sjewo , @Nowosad

I tried to address all your concerns and suggestions.

future, future.apply and progressr are all optional. By default, on a single core the progress bar is implemented with base R.

By default the function works as usual, unless you have an active future::plan that is not sequential. Please see the code below for testing.

I imagine you may have more comments on the argument names, values and defaults. I very much welcome those and seek your feedback.

I would also like to say that I am a very strong proponent of keeping the show_progress=TRUE by default, as even with the toy examples the function runtime is not instant and as soon as we go beyond toy examples, the runtime may become significantly longer, so I think it makes sense to keep the user informed.

I would also vote for making use of as many cores as user has, but it is up to you.

if (!require("remotes")) install.packages("remotes")
remotes::install_github("e-kotov/cartogram@progress", force = TRUE)
library(cartogram)

nc = sf::st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc_utm <- sf::st_transform(nc, 26916)


# Create cartogram

# debugonce(cartogram_ncont)

# should work as usual, but with progress bar enabled by default
nc_utm_carto <- cartogram_ncont(nc_utm, weight = "BIR74")

# should work as usual, but with progress bar enabled by default
nc_utm_carto <- cartogram_ncont(nc_utm, weight = "BIR74", n_cpu = 1)

# should ask for future and future.apply, but not progressr and with no visible progres bar
# if you refuse installing future and future.apply, fails with a message
# if you accept install, installs and continues with parallel processing and no progress
nc_utm_carto <- cartogram_ncont(nc_utm, weight = "BIR74", n_cpu = 2, show_progress = FALSE)

# should as for progressr, may ask for future and future.apply if they were not installed in the run above
# if you refuse installing future and future.apply, fails with a message
# if you are testing on a machine with 1 core, will work as usual
nc_utm_carto <- cartogram_ncont(nc_utm, weight = "BIR74", n_cpu = "auto")

Please test whenever you have time and provide feedback.

Best,
Egor

e-kotov · 2024-12-23T22:22:41Z

Added similar parallelization for cartogram_cont() to the same PR. Progress bar is shown for each iteration separately.

Nowosad · 2025-01-01T16:22:26Z

Hi @e-kotov -- the progress bar looks good. I have two related comments/questions:

Why have you decided to set the future::plan inside of the function(s) instead of setting them before the function(s) call (as suggested at https://future.futureverse.org/articles/future-7-for-package-developers.html#avoid-changing-the-future-strategy)
If you decide to keep the plan setting inside of the function(s), then you should "make sure to undo when the function exits" (second code chunk at https://future.futureverse.org/articles/future-7-for-package-developers.html#avoid-changing-the-future-strategy)

e-kotov · 2025-01-01T17:43:43Z

@Nowosad, thanks for raising both of those questions.

First, I did not stumble upon the page you mentioned with the best practice recommendations (I should have, instead of reinventing the wheel). I considered the potential side effects of setting the future plan within the function. The best solution I could come up without reviewing the linked page was to revert to a sequential plan, though I recognize that this is far from ideal.

The reasoning behind setting the plan internally was as follows: I assume that the likelihood of casual users engaging with {cartogram} is quite high. I wanted to simplify the process for them without requiring them to deal with configuring the future plan themselves. My goal was to streamline their experience rather than burden them with understanding the complexities of the future package – such as the nuances of selecting the right plan. For 99% of {cartogram} use cases, multisession would likely be the best choice since it’s hard to imagine a casual user running {cartogram} calculations on a cluster of network machines. I wanted to provide a straightforward n_cpu parameter while leaving flexibility for advanced users who are familiar with future plans and know how to set them up as needed.

What I would suggest we do is that I rewrite the function following the recommendations at https://future.futureverse.org/articles/future-7-for-package-developers.html#avoid-changing-the-future-strategy, but we keep the possibility for casual users to set the n_cpu and therefore invoke the internal setting of the future plan. This will however keep the possibility to control the plan from outside the function for advanced users.

e-kotov · 2025-01-05T21:37:43Z

@Nowosad , @sjewo , I updated the code to respect the recommendations at https://future.futureverse.org/articles/future-7-for-package-developers.html#avoid-changing-the-future-strategy

To test:

if (!require("remotes")) install.packages("remotes")
remotes::install_github("e-kotov/cartogram@progress", force = TRUE)
library(cartogram)

nc = sf::st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc_utm <- sf::st_transform(nc, 26916)

future::plan(future::multisession, workers = 2)
current_plan <- future::plan()
current_plan

# this will use the current plan and don't touch the plan set above
nc_utm_carto <- cartogram_ncont(nc_utm, weight = "BIR74")
current_plan

# this will apply multisession plan for 4 cores internally, and revert it back
nc_utm_carto <- cartogram_ncont(nc_utm, weight = "BIR74", n_cpu = 4)
current_plan # workers are still set to 2

@mtennekes, hi, could you also have a look and perhaps test, since you have tm_cartogram() that wraps the {cartogram} functions.

sjewo · 2025-01-08T08:48:34Z

Hi @e-kotov - great work! I have been thinking about @Nowosad comment about setting the future plan internally.

Some other packages, like data.table use multithreading by default. On the other hand, an ordinary R user should be able to follow an example that shows how to define a future plan.

In my opinion, your suggestions are a good compromise between ease of use and predictable behavior of the cartogram functions.

Also thanks for adding some test - we'll have to checkout the differences in the ubuntu R-CMD-chek run (see actions).

e-kotov · 2025-01-08T09:03:47Z

@sjewo I will have a look at what fails on Ubuntu

e-kotov · 2025-01-08T11:27:52Z

I had a look a the test results:

── Failure ('test-cartogram_ncont.R:10:3'): nc cartogram matches expected area ──
`cartogram_area` (`actual`) not equal to 22284872 (`expected`).

  `actual`: 22284531.0
`expected`: 22284872.0

I guess it has to something to do with how rounding is handled differently on Linux (see this).

The difference is negligible for the data that is used in the test: 341 square meters between the expected and actual polygon areas, so approx. 0.0015% of the total area.

So I will just introduce a some more tolerance for the test and it should work fine.

e-kotov · 2025-01-08T11:58:51Z

Done, all ubuntu checks pass (see the results for the same PR in my fork: e-kotov#1 ).

I also added some examples for parallel processing to the documentation of both functions in the most recent commit.

e-kotov · 2025-01-08T12:00:29Z

devtools::check(remote = TRUE, manual = TRUE)
devtools::check(cran = TRUE)

These also pass well locally on my mac, except for the version number of course, you will need to bump that before submitting to CRAN.

sjewo · 2025-01-08T20:02:27Z

Thanks! I'll try the examples tomorrow.

Could you add yourself as contributor in DESCRIPTION?

e-kotov · 2025-01-08T22:51:48Z

Could you add yourself as contributor in DESCRIPTION?

I added myself, and also cleaned up the DESCRIPTION with usethis::use_tidy_description() (I hope you don't mind that).

e-kotov added 5 commits November 3, 2024 12:56

ignore private test scripts folder

430997e

parallel processing and progress bar for cartogram_ncont

43d79ef

no need for progressr package

b632be1

hotfix to restore CRS of the original sf object

2c4d8b8

a test for empty CRS

cdec44b

Nowosad reviewed Nov 3, 2024

View reviewed changes

e-kotov added 9 commits December 23, 2024 13:59

update build ignore

102c1a7

internal assert intalled packages util

bf0f3d1

updated ncont

6f6eb21

updated docs and DESC

99af856

prevent zero workers

0a38021

do not use future.apply with 1 core detected with parallelly

6afa1e0

if plan is sequential, do not use future.apply and consequnetly progr…

e1924a0

…essr progress bar

add doc for show_progress argument

ad8de6b

suggest parallelly

3f2b977

add parallel processing for continuous area cartogram

5a5547a

proper handling of future plan restoration if n_cpu > 1

95fdb7b

e-kotov added 2 commits January 8, 2025 12:30

bump tolerance to pass on linux because of the rounding

5969da7

add examples for parallel processing

fdcbf6e

e-kotov changed the title ~~progress bar and parallel compute for cartogram_ncont~~ progress bar and parallel compute for cartogram_ncont and cartogram_cont Jan 8, 2025

add Egor Kotov as contributor and make DESC tidy

5f40984

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

progress bar and parallel compute for cartogram_ncont and cartogram_cont #36

progress bar and parallel compute for cartogram_ncont and cartogram_cont #36

e-kotov commented Nov 3, 2024

e-kotov commented Nov 3, 2024

Nowosad Nov 3, 2024

e-kotov Nov 3, 2024

e-kotov Dec 22, 2024

Nowosad Nov 3, 2024

e-kotov Nov 3, 2024

e-kotov Nov 3, 2024

sjewo Nov 4, 2024 •

edited

Loading

e-kotov Nov 5, 2024

sjewo Nov 9, 2024

e-kotov commented Nov 3, 2024

e-kotov commented Dec 23, 2024

e-kotov commented Dec 23, 2024

Nowosad commented Jan 1, 2025

e-kotov commented Jan 1, 2025

e-kotov commented Jan 5, 2025

sjewo commented Jan 8, 2025

e-kotov commented Jan 8, 2025

e-kotov commented Jan 8, 2025

e-kotov commented Jan 8, 2025

e-kotov commented Jan 8, 2025

sjewo commented Jan 8, 2025

e-kotov commented Jan 8, 2025

progress bar and parallel compute for cartogram_ncont and cartogram_cont #36

Are you sure you want to change the base?

progress bar and parallel compute for cartogram_ncont and cartogram_cont #36

Conversation

e-kotov commented Nov 3, 2024

e-kotov commented Nov 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjewo Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-kotov commented Nov 3, 2024

e-kotov commented Dec 23, 2024

e-kotov commented Dec 23, 2024

Nowosad commented Jan 1, 2025

e-kotov commented Jan 1, 2025

e-kotov commented Jan 5, 2025

sjewo commented Jan 8, 2025

e-kotov commented Jan 8, 2025

e-kotov commented Jan 8, 2025

e-kotov commented Jan 8, 2025

e-kotov commented Jan 8, 2025

sjewo commented Jan 8, 2025

e-kotov commented Jan 8, 2025

sjewo Nov 4, 2024 •

edited

Loading