-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
progress bar and parallel compute for cartogram_ncont and cartogram_cont #36
base: master
Are you sure you want to change the base?
Conversation
Update:
|
R/cartogram_ncont.R
Outdated
weight, | ||
k = 1, | ||
inplace = TRUE, | ||
n_cpu = parallelly::availableCores() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that in general it is recommended to keep the default core use to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, having thought of that, I don't fully agree. In case of highly advanced packages such as mlr3
, I would agree that default should be 1 core and the user should be able to have full control over which algorithm uses how many cores in the inner and outer loops. With simpler functions that users are more likely to use in interactive mode, I would argue that using maximum cores would be better, as most users probably don't read the docs and don't change the defaults. Say, you normally don't let users of GUI apps choose how many cores to use (e.g. in Photoshop, Final Cut, and many more), these kind of software just try to use as many cores as you have to maximise performance, and this default is probably exactly because otherwise users (who almost never touch defaults) would simply complain that software is slow, or does not use their hardware to its maximum potential, or worse, would switch to the competitor software that does do multicore by default. If we take DuckDB as an example, it uses 80% of RAM by default and all of the CPU cores (which I disagree with and would rather use max less 1 core, reserving it for the OS to prevent freezes).
I am currently working on a new commit where I plan to design 3 choices:
- auto mode - similar to DuckDB's approach it would use max cores less 1
- respect future::plan()
- explicit number of cores, where the function will use future::plan() internally
The open question will remain which of these should be the default. I would say that auto mode is a better design choice, as forcing the users of cartogram
to use future::plan seems like an overkill.
@@ -12,8 +12,16 @@ Authors@R: c( | |||
Description: Construct continuous and non-contiguous area cartograms. | |||
URL: https://github.com/sjewo/cartogram, https://sjewo.github.io/cartogram/ | |||
BugReports: https://github.com/sjewo/cartogram/issues | |||
Imports: methods, sf, packcircles | |||
Suggests: | |||
Imports: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, the furrr, parallelly, and future packages could be moved to Suggests: and the code related to them may be only used if the progress bar is on.
(I also think there should be an option to disable the progress bar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Nowosad On a second thought, that would imply using the for loop (or purrr::map) by default with no progress bar... (though purrr::map also has progress bar). So kind of defeats the purpose. And why would parallel processing rely on the progress bar setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestions and your work on implementing parallel processing @e-kotov! I haven't had much time to keep an eye on this topic in the past.
I would like to keep the imports as small as possible.
- Maybe
utils::txtProgressBar()
is sufficient for adding a progress bar without additional imports? - I would also prefer to move the parallelly, and future packages to Suggests, as @Nowosad suggested, and add an if-clause for
n_cpu > 1
to use futures parallel execution. - I don't want to be petty, but is 'furrr' absolutely necessary? If the code can only be implemented with the future package, I would prefer that.
Best,
Sebastian
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sjewo points taken. I'm not sure excluding furrr
is practical, as future
package only provides the framework, and then you need some other package like furrr
or future.apply
to do the job. So why not furrr
?
I will get to it next week. What do you think of the older PR #28 ? Did you have plans to merge it at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reply, @e-kotov !
Regarding furrr
: i would like to keep the number of dependencies low. future.apply
has half the number of imported packages. Nevertheless, if furrr
is more practical in your opinion, that's fine for me!
I have not gotten the PR #28 to work and the “future” approach looks promising to me. So if you are working on a different solution that results in some speed improvements, I would certainly accept a PR.
@Nowosad I will work some more on this PR in the coming days. Clearly it is not ready for merging. There is also something funny happening with the resulting cartogram. It does not show itself when applied to the test Error in `st_transform.sfc()`:
! cannot transform sfc object with missing crs This may have something to do with attributes that I fail to restore to the geom column. I will look into this and update the PR. |
…essr progress bar
I tried to address all your concerns and suggestions.
By default the function works as usual, unless you have an active future::plan that is not sequential. Please see the code below for testing. I imagine you may have more comments on the argument names, values and defaults. I very much welcome those and seek your feedback. I would also like to say that I am a very strong proponent of keeping the I would also vote for making use of as many cores as user has, but it is up to you. if (!require("remotes")) install.packages("remotes")
remotes::install_github("e-kotov/cartogram@progress", force = TRUE)
library(cartogram)
nc = sf::st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc_utm <- sf::st_transform(nc, 26916)
# Create cartogram
# debugonce(cartogram_ncont)
# should work as usual, but with progress bar enabled by default
nc_utm_carto <- cartogram_ncont(nc_utm, weight = "BIR74")
# should work as usual, but with progress bar enabled by default
nc_utm_carto <- cartogram_ncont(nc_utm, weight = "BIR74", n_cpu = 1)
# should ask for future and future.apply, but not progressr and with no visible progres bar
# if you refuse installing future and future.apply, fails with a message
# if you accept install, installs and continues with parallel processing and no progress
nc_utm_carto <- cartogram_ncont(nc_utm, weight = "BIR74", n_cpu = 2, show_progress = FALSE)
# should as for progressr, may ask for future and future.apply if they were not installed in the run above
# if you refuse installing future and future.apply, fails with a message
# if you are testing on a machine with 1 core, will work as usual
nc_utm_carto <- cartogram_ncont(nc_utm, weight = "BIR74", n_cpu = "auto") Please test whenever you have time and provide feedback. Best, |
Added similar parallelization for |
Hi @e-kotov -- the progress bar looks good. I have two related comments/questions:
|
@Nowosad, thanks for raising both of those questions. First, I did not stumble upon the page you mentioned with the best practice recommendations (I should have, instead of reinventing the wheel). I considered the potential side effects of setting the future plan within the function. The best solution I could come up without reviewing the linked page was to revert to a sequential plan, though I recognize that this is far from ideal. The reasoning behind setting the plan internally was as follows: I assume that the likelihood of casual users engaging with What I would suggest we do is that I rewrite the function following the recommendations at https://future.futureverse.org/articles/future-7-for-package-developers.html#avoid-changing-the-future-strategy, but we keep the possibility for casual users to set the |
@Nowosad , @sjewo , I updated the code to respect the recommendations at https://future.futureverse.org/articles/future-7-for-package-developers.html#avoid-changing-the-future-strategy To test: if (!require("remotes")) install.packages("remotes")
remotes::install_github("e-kotov/cartogram@progress", force = TRUE)
library(cartogram)
nc = sf::st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc_utm <- sf::st_transform(nc, 26916)
future::plan(future::multisession, workers = 2)
current_plan <- future::plan()
current_plan
# this will use the current plan and don't touch the plan set above
nc_utm_carto <- cartogram_ncont(nc_utm, weight = "BIR74")
current_plan
# this will apply multisession plan for 4 cores internally, and revert it back
nc_utm_carto <- cartogram_ncont(nc_utm, weight = "BIR74", n_cpu = 4)
current_plan # workers are still set to 2 @mtennekes, hi, could you also have a look and perhaps test, since you have |
Hi @e-kotov - great work! I have been thinking about @Nowosad comment about setting the future plan internally. Some other packages, like In my opinion, your suggestions are a good compromise between ease of use and predictable behavior of the cartogram functions. Also thanks for adding some test - we'll have to checkout the differences in the ubuntu R-CMD-chek run (see actions). |
@sjewo I will have a look at what fails on Ubuntu |
I had a look a the test results:
I guess it has to something to do with how rounding is handled differently on Linux (see this). The difference is negligible for the data that is used in the test: 341 square meters between the expected and actual polygon areas, so approx. 0.0015% of the total area. So I will just introduce a some more tolerance for the test and it should work fine. |
Done, all ubuntu checks pass (see the results for the same PR in my fork: e-kotov#1 ). I also added some examples for parallel processing to the documentation of both functions in the most recent commit. |
devtools::check(remote = TRUE, manual = TRUE)
devtools::check(cran = TRUE) These also pass well locally on my mac, except for the version number of course, you will need to bump that before submitting to CRAN. |
Thanks! I'll try the examples tomorrow. Could you add yourself as contributor in DESCRIPTION? |
I added myself, and also cleaned up the DESCRIPTION with |
Hi, trying to solve #35 myself I started with the simplest function
cartogram_ncont()
. I know there is an older pull request #28 , but this approach with just building onfurrr
andfuture
seems simpler to me. Arguably it can be improved with better handling of the workers. E.g. instead of users specifying then_cores
incartogram_ncont()
they should be able to set the number of workers and type of parallel processing withfuture
outside the function.This PR results in
cartogram_ncont()
gaining both progress bar and parallel processing.Before making changes to the code, I ran your default example from
cartogram_ncont()
function and obtained a control value of area of the resulting cartogram fornc
fromsf
package. I wrote simple test for this area in thetests
folder using{testthat}
. The test passes with my updated code forcartogram_ncont()
.