-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of {crew} vs {clustermq} #81
Comments
Looking at the
|
|
I think R itself is going to be the limiting factor. Are there any bits that good be rewritten in C++? |
It seems like the largest single chunk is the updates to If I'm not mistaken, you can use As an aside, should this self$log$popped_errors[index] <- self$log$popped_errors[index] +
!anyNA(out$error)
self$log$popped_warnings[index] <-
self$log$popped_warnings[index] + !anyNA(out$error) be |
I noticed you classed your Using lists would be faster than tibbles, but it seems would negate some of the appeal of the nice interface that |
I'll offer up my fast df/tibble creator. If you are creating the tibbles yourself, you can cut out all the validation code for a dramatic speedup. All you need is to add the following attributes to a list: (i) names, (ii) class, (iii) rownames. Illustrative example below. cols <- 8L
colnames <- letters[seq_len(cols)]
rows <- 100L
df <- vector(mode = "list", length = cols)
for (i in seq_along(df))
df[[i]] <- runif(rows)
attributes(df) <- list(names = colnames,
class = c("tbl_df", "tbl", "data.frame"),
row.names = .set_row_names(rows))) Comes courtesy of |
Thanks everyone, this is such helpful advice. Changing/limiting interactions with tibbles seems to be improving things already. |
I also wonder how much speed can be gained by simply removing the |
Doesn't seem to make a difference. |
A more fair comparison is the following test, and it runs in about 3.4 seconds locally (the library(crew)
controller <- crew_controller_local(workers = 4L)
controller$start()
controller$launch(n = 4L)
names <- character(0L)
index <- 0L
n_tasks <- 6000L
Sys.sleep(5)
system.time({
for (task in seq_len(n_tasks)) {
controller$push(
name = as.character(index),
command = TRUE,
scale = FALSE,
seed = 0L
)
}
while (length(controller$queue) > 0L) {
controller$collect()
}
})
controller$terminate() |
As seed = sample.int(n = 1e9L, size = 1L) You can just simply replace with seed = NULL as or if you need to record the actual seed, you could use seed = as.integer(random() / 2) for a c. 8x speed up. The range of |
Thanks, @shikokuchuo. Implemented just now. My original thinking was that the default seed should be tied to the RNG state of the calling session, but I think |
After 7cb5111, the example from #81 (comment) looks like this (on the Macbook):
|
Moving to a more focused issue about multi-push and multi-pop. |
On second thought, efficiency improvements may also require a more efficient approach to queueing on crew's end. |
I should probably try using environments/hash tables for the queues instead of lists that get incremented 1 task at a time. |
> x <- rep(list(list(error = NULL, result = 3, warning = "")), 6000)
> system.time(while (length(x) > 0) x[1] <- NULL)
user system elapsed
0.117 0.010 0.127 How would you implement a queue in an environment? My first stab at it is much slower than the list version: > x <- new.env()
> for (i in 1:6000) assign(sprintf("q%04d", i), list(error = NULL, result = 3, warning = ""), x)
> system.time(while (length(x) > 0) remove(list = ls(x)[1], pos = x))
user system elapsed
5.212 0.000 5.220 The docs for > x <- new.env()
> system.time(for (i in 1:6000) assign(sprintf("q%04d", i), list(error = NULL, result = 3, warning = ""), x))
user system elapsed
0.016 0.000 0.016
> system.time(while (length(x) > 0) remove(list = ls(x, sorted = FALSE)[1], pos = x))
user system elapsed
0.002 0.000 0.003 If the results aren't coming out sorted, then it's not really a queue... but since we are doing asynchronous execution anyway, maybe it doesn't matter if Using a non-hash environment seems to yield a stack, but then it's slower than the list version: > x <- new.env(hash = FALSE)
> system.time(for (i in 1:6000) assign(sprintf("q%04d", i), list(error = NULL, result = 3, warning = ""), x))
user system elapsed
0.119 0.000 0.119
> ls(x, sort = FALSE)[1]
[1] "q6000"
> ls(x, sort = FALSE)[6000]
[1] "q0001"
>system.time(while (length(x) > 0) remove(list = ls(x, sorted = FALSE)[1], pos = x))
user system elapsed
0.690 0.007 0.699 And popping the last element rather than the first (so it's FIFO) is even slower: system.time(while (length(x) > 0) remove(list = ls(x, sorted = FALSE)[length(x)], pos = x))
user system elapsed
1.001 0.000 1.002 |
That certainly explains a lot.
I found |
As I mentioned, in f8290b5 I switched Now when I run: library(crew)
controller <- crew_controller_local(workers = 4L)
controller$start()
controller$launch(n = 4L)
Sys.sleep(5)
index <- 0L
n_tasks <- 6000L
Sys.sleep(5)
px <- proffer::pprof({
for (task in seq_len(n_tasks)) {
controller$push(
name = as.character(index),
command = TRUE,
scale = FALSE,
seed = 0L
)
}
while (length(controller$queue) > 0L) {
controller$collect()
}
})
controller$terminate() I see the same clear bottleneck in both my Macbook and Ubuntu machines (although slightly less severe on the Macbook for some reason): This bottleneck happens in the following lines of code. Below, Lines 391 to 398 in f8290b5
I don't know if there is a way in R to loop over an environment of |
I shouldn't think the hashed environment gives you much advantage as you don't need to retrieve or replace values by name/key. I might be over-simplifying, but I think you can revert to using a list, and then it seems just this will do: as.logical(lapply(queue, .unresolved)) Or refactor your code so that's all you need?
|
That is depressing, this seems like the prototypical use case for
Does that mean that you think the actual calls to Did you try |
I think that my suggestion would require allocating an extra list for the result of |
Forget my suggestion about reverting to lists - I forget you were trying to avoid the other duplication costs. Just on the above, lapply() should be able to iterate over the environment without needing as.list(), although I'm not at a computer to check at the moment. |
How is the speed comparing to clustermq at this point? |
Yes, great point. Further profiling shows Line 390 in 3f1dfca
On my Macbook, |
You know what? Maybe I am optimizing the wrong thing, or reading the profiler output wrong. If I add a sleep to let the tasks finish: for (task in seq_len(n_tasks)) {
controller$push(
name = as.character(index),
command = TRUE,
scale = FALSE,
seed = 0L
)
}
Sys.sleep(3)
while (length(controller$queue) > 0L) {
controller$collect(throttle = TRUE)
} then the Alternatively, when I turn on throttling in for (task in seq_len(n_tasks)) {
controller$push(
name = as.character(index),
command = TRUE,
scale = FALSE,
seed = 0L
)
}
while (length(controller$queue) > 0L) {
controller$collect(throttle = TRUE)
} then the "bottleneck" appears to be in These may not be bottlenecks at all, they may just be the places where the profiler just happens to be sampling while But I am pretty sure |
After solving #83, here is what I see in the flame graph after returning to the original I do not believe Lines 149 to 151 in ef35e06
|
Speeds are looking really good if we remove delays waiting for tasks. The following is on my Ubuntu machine, which has proved much slower than my Macbook for these tests. library(crew)
controller <- crew_controller_local(workers = 4L)
controller$start()
controller$launch(n = 4L)
Sys.sleep(5)
index <- 0L
n_tasks <- 6000L
system.time({
for (index in seq_len(n_tasks)) {
controller$push(command = TRUE)
}
})
#> user system elapsed
#> 0.571 0.022 1.122
Sys.sleep(5)
system.time(controller$collect())
#> user system elapsed
#> 0.024 0.000 0.025
Sys.sleep(5)
system.time({
for (index in seq_len(n_tasks)) {
controller$pop()
}
})
#> user system elapsed
#> 0.570 0.000 0.614
controller$terminate() |
And the analogous flame graphs: library(crew)
controller <- crew_controller_local(workers = 4L)
controller$start()
controller$launch(n = 4L)
index <- 0L
n_tasks <- 6000L When pushing tasks, Sys.sleep(5)
proffer::pprof({
for (index in seq_len(n_tasks)) {
controller$push(command = TRUE)
}
})
Sys.sleep(5)
proffer::pprof(controller$collect()) And for Sys.sleep(5)
proffer::pprof({
for (index in seq_len(n_tasks)) {
controller$pop()
}
}) |
If we're in the business of eking out gains, one thing before I forget these details - you may pass the 'data' variables in |
Thanks! I implemented your suggestion and replaced |
The only 2 other things I can think of are to
After that, I think I will have done all the optimization I can from |
After (2), the |
Deparsing is expensive. Why not just store the call (language object) - it still prints.
I like this idea! |
From experience with |
I see, in which case, I can offer the following from deparse(x, backtick = TRUE, control = NULL, nlines = 1L) Specifying the arguments led to a real speedup (I forget of what magnitude). You can specify |
Thanks! I may take you up on that eventually. For now, however, the existing |
Because of the blazing speed of
mirai
, I thinkcrew
has the potential to reach speeds comparable toclustermq
. But so far,crew
version 0.2.1, looks to be slower.crew
performanceIn the following example, the timed part of the code took 10.477 seconds to complete on my 4-core Ubuntu machine.
Replacing
system.time()
withproffer::pprof()
, it looks like task management is the bottleneck. There may be a way to rework the R code to make this more efficient.clustermq
performanceThe equivalent
clustermq
example only took 3.060 seconds.And the flame graph:
The text was updated successfully, but these errors were encountered: