Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

progress bar and parallel compute for cartogram_ncont and cartogram_cont #36

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

e-kotov
Copy link

@e-kotov e-kotov commented Nov 3, 2024

Hi, trying to solve #35 myself I started with the simplest function cartogram_ncont(). I know there is an older pull request #28 , but this approach with just building on furrr and future seems simpler to me. Arguably it can be improved with better handling of the workers. E.g. instead of users specifying the n_cores in cartogram_ncont() they should be able to set the number of workers and type of parallel processing with future outside the function.

This PR results in cartogram_ncont() gaining both progress bar and parallel processing.

Before making changes to the code, I ran your default example from cartogram_ncont() function and obtained a control value of area of the resulting cartogram for nc from sf package. I wrote simple test for this area in the tests folder using {testthat}. The test passes with my updated code for cartogram_ncont().

@e-kotov
Copy link
Author

e-kotov commented Nov 3, 2024

Update:

  1. fixed missing CRS after applying cartogram_ncont()
  2. added test to prevent missing CRS after applying cartogram_ncont()

weight,
k = 1,
inplace = TRUE,
n_cpu = parallelly::availableCores()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in general it is recommended to keep the default core use to 1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, having thought of that, I don't fully agree. In case of highly advanced packages such as mlr3, I would agree that default should be 1 core and the user should be able to have full control over which algorithm uses how many cores in the inner and outer loops. With simpler functions that users are more likely to use in interactive mode, I would argue that using maximum cores would be better, as most users probably don't read the docs and don't change the defaults. Say, you normally don't let users of GUI apps choose how many cores to use (e.g. in Photoshop, Final Cut, and many more), these kind of software just try to use as many cores as you have to maximise performance, and this default is probably exactly because otherwise users (who almost never touch defaults) would simply complain that software is slow, or does not use their hardware to its maximum potential, or worse, would switch to the competitor software that does do multicore by default. If we take DuckDB as an example, it uses 80% of RAM by default and all of the CPU cores (which I disagree with and would rather use max less 1 core, reserving it for the OS to prevent freezes).

I am currently working on a new commit where I plan to design 3 choices:

  • auto mode - similar to DuckDB's approach it would use max cores less 1
  • respect future::plan()
  • explicit number of cores, where the function will use future::plan() internally

The open question will remain which of these should be the default. I would say that auto mode is a better design choice, as forcing the users of cartogram to use future::plan seems like an overkill.

@Nowosad , @sjewo what do you think?

@@ -12,8 +12,16 @@ Authors@R: c(
Description: Construct continuous and non-contiguous area cartograms.
URL: https://github.com/sjewo/cartogram, https://sjewo.github.io/cartogram/
BugReports: https://github.com/sjewo/cartogram/issues
Imports: methods, sf, packcircles
Suggests:
Imports:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, the furrr, parallelly, and future packages could be moved to Suggests: and the code related to them may be only used if the progress bar is on.

(I also think there should be an option to disable the progress bar)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nowosad On a second thought, that would imply using the for loop (or purrr::map) by default with no progress bar... (though purrr::map also has progress bar). So kind of defeats the purpose. And why would parallel processing rely on the progress bar setting?

Copy link
Owner

@sjewo sjewo Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestions and your work on implementing parallel processing @e-kotov! I haven't had much time to keep an eye on this topic in the past.

I would like to keep the imports as small as possible.

  • Maybe utils::txtProgressBar() is sufficient for adding a progress bar without additional imports?
  • I would also prefer to move the parallelly, and future packages to Suggests, as @Nowosad suggested, and add an if-clause for n_cpu > 1 to use futures parallel execution.
  • I don't want to be petty, but is 'furrr' absolutely necessary? If the code can only be implemented with the future package, I would prefer that.

Best,
Sebastian

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjewo points taken. I'm not sure excluding furrr is practical, as future package only provides the framework, and then you need some other package like furrr or future.apply to do the job. So why not furrr?

I will get to it next week. What do you think of the older PR #28 ? Did you have plans to merge it at all?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply, @e-kotov !

Regarding furrr: i would like to keep the number of dependencies low. future.apply has half the number of imported packages. Nevertheless, if furrr is more practical in your opinion, that's fine for me!

I have not gotten the PR #28 to work and the “future” approach looks promising to me. So if you are working on a different solution that results in some speed improvements, I would certainly accept a PR.

@e-kotov
Copy link
Author

e-kotov commented Nov 3, 2024

@Nowosad I will work some more on this PR in the coming days. Clearly it is not ready for merging.

There is also something funny happening with the resulting cartogram. It does not show itself when applied to the test nc data, but with my own data (cannot upload right now), I am getting an error with ggplot geom_sf:

Error in `st_transform.sfc()`:
! cannot transform sfc object with missing crs

This may have something to do with attributes that I fail to restore to the geom column. I will look into this and update the PR.

@e-kotov
Copy link
Author

e-kotov commented Dec 23, 2024

@sjewo , @Nowosad

I tried to address all your concerns and suggestions.

future, future.apply and progressr are all optional. By default, on a single core the progress bar is implemented with base R.

By default the function works as usual, unless you have an active future::plan that is not sequential. Please see the code below for testing.

I imagine you may have more comments on the argument names, values and defaults. I very much welcome those and seek your feedback.

I would also like to say that I am a very strong proponent of keeping the show_progress=TRUE by default, as even with the toy examples the function runtime is not instant and as soon as we go beyond toy examples, the runtime may become significantly longer, so I think it makes sense to keep the user informed.

I would also vote for making use of as many cores as user has, but it is up to you.

if (!require("remotes")) install.packages("remotes")
remotes::install_github("e-kotov/cartogram@progress", force = TRUE)
library(cartogram)

nc = sf::st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc_utm <- sf::st_transform(nc, 26916)


# Create cartogram

# debugonce(cartogram_ncont)

# should work as usual, but with progress bar enabled by default
nc_utm_carto <- cartogram_ncont(nc_utm, weight = "BIR74")

# should work as usual, but with progress bar enabled by default
nc_utm_carto <- cartogram_ncont(nc_utm, weight = "BIR74", n_cpu = 1)

# should ask for future and future.apply, but not progressr and with no visible progres bar
# if you refuse installing future and future.apply, fails with a message
# if you accept install, installs and continues with parallel processing and no progress
nc_utm_carto <- cartogram_ncont(nc_utm, weight = "BIR74", n_cpu = 2, show_progress = FALSE)

# should as for progressr, may ask for future and future.apply if they were not installed in the run above
# if you refuse installing future and future.apply, fails with a message
# if you are testing on a machine with 1 core, will work as usual
nc_utm_carto <- cartogram_ncont(nc_utm, weight = "BIR74", n_cpu = "auto")

Please test whenever you have time and provide feedback.

Best,
Egor

@e-kotov
Copy link
Author

e-kotov commented Dec 23, 2024

Added similar parallelization for cartogram_cont() to the same PR. Progress bar is shown for each iteration separately.

@Nowosad
Copy link
Contributor

Nowosad commented Jan 1, 2025

Hi @e-kotov -- the progress bar looks good. I have two related comments/questions:

  1. Why have you decided to set the future::plan inside of the function(s) instead of setting them before the function(s) call (as suggested at https://future.futureverse.org/articles/future-7-for-package-developers.html#avoid-changing-the-future-strategy)
  2. If you decide to keep the plan setting inside of the function(s), then you should "make sure to undo when the function exits" (second code chunk at https://future.futureverse.org/articles/future-7-for-package-developers.html#avoid-changing-the-future-strategy)

@e-kotov
Copy link
Author

e-kotov commented Jan 1, 2025

@Nowosad, thanks for raising both of those questions.

First, I did not stumble upon the page you mentioned with the best practice recommendations (I should have, instead of reinventing the wheel). I considered the potential side effects of setting the future plan within the function. The best solution I could come up without reviewing the linked page was to revert to a sequential plan, though I recognize that this is far from ideal.

The reasoning behind setting the plan internally was as follows: I assume that the likelihood of casual users engaging with {cartogram} is quite high. I wanted to simplify the process for them without requiring them to deal with configuring the future plan themselves. My goal was to streamline their experience rather than burden them with understanding the complexities of the future package – such as the nuances of selecting the right plan. For 99% of {cartogram} use cases, multisession would likely be the best choice since it’s hard to imagine a casual user running {cartogram} calculations on a cluster of network machines. I wanted to provide a straightforward n_cpu parameter while leaving flexibility for advanced users who are familiar with future plans and know how to set them up as needed.

What I would suggest we do is that I rewrite the function following the recommendations at https://future.futureverse.org/articles/future-7-for-package-developers.html#avoid-changing-the-future-strategy, but we keep the possibility for casual users to set the n_cpu and therefore invoke the internal setting of the future plan. This will however keep the possibility to control the plan from outside the function for advanced users.

@e-kotov
Copy link
Author

e-kotov commented Jan 5, 2025

@Nowosad , @sjewo , I updated the code to respect the recommendations at https://future.futureverse.org/articles/future-7-for-package-developers.html#avoid-changing-the-future-strategy

To test:

if (!require("remotes")) install.packages("remotes")
remotes::install_github("e-kotov/cartogram@progress", force = TRUE)
library(cartogram)

nc = sf::st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc_utm <- sf::st_transform(nc, 26916)

future::plan(future::multisession, workers = 2)
current_plan <- future::plan()
current_plan

# this will use the current plan and don't touch the plan set above
nc_utm_carto <- cartogram_ncont(nc_utm, weight = "BIR74")
current_plan

# this will apply multisession plan for 4 cores internally, and revert it back
nc_utm_carto <- cartogram_ncont(nc_utm, weight = "BIR74", n_cpu = 4)
current_plan # workers are still set to 2

@mtennekes, hi, could you also have a look and perhaps test, since you have tm_cartogram() that wraps the {cartogram} functions.

@sjewo
Copy link
Owner

sjewo commented Jan 8, 2025

Hi @e-kotov - great work! I have been thinking about @Nowosad comment about setting the future plan internally.

Some other packages, like data.table use multithreading by default. On the other hand, an ordinary R user should be able to follow an example that shows how to define a future plan.

In my opinion, your suggestions are a good compromise between ease of use and predictable behavior of the cartogram functions.

Also thanks for adding some test - we'll have to checkout the differences in the ubuntu R-CMD-chek run (see actions).

@e-kotov
Copy link
Author

e-kotov commented Jan 8, 2025

@sjewo I will have a look at what fails on Ubuntu

@e-kotov
Copy link
Author

e-kotov commented Jan 8, 2025

I had a look a the test results:

── Failure ('test-cartogram_ncont.R:10:3'): nc cartogram matches expected area ──
`cartogram_area` (`actual`) not equal to 22284872 (`expected`).

  `actual`: 22284531.0
`expected`: 22284872.0

I guess it has to something to do with how rounding is handled differently on Linux (see this).

The difference is negligible for the data that is used in the test: 341 square meters between the expected and actual polygon areas, so approx. 0.0015% of the total area.

So I will just introduce a some more tolerance for the test and it should work fine.

@e-kotov
Copy link
Author

e-kotov commented Jan 8, 2025

Done, all ubuntu checks pass (see the results for the same PR in my fork: e-kotov#1 ).

I also added some examples for parallel processing to the documentation of both functions in the most recent commit.

@e-kotov
Copy link
Author

e-kotov commented Jan 8, 2025

devtools::check(remote = TRUE, manual = TRUE)
devtools::check(cran = TRUE)

These also pass well locally on my mac, except for the version number of course, you will need to bump that before submitting to CRAN.

@e-kotov e-kotov changed the title progress bar and parallel compute for cartogram_ncont progress bar and parallel compute for cartogram_ncont and cartogram_cont Jan 8, 2025
@sjewo
Copy link
Owner

sjewo commented Jan 8, 2025

Thanks! I'll try the examples tomorrow.

Could you add yourself as contributor in DESCRIPTION?

@e-kotov
Copy link
Author

e-kotov commented Jan 8, 2025

Could you add yourself as contributor in DESCRIPTION?

I added myself, and also cleaned up the DESCRIPTION with usethis::use_tidy_description() (I hope you don't mind that).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants