-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Superfluous launches in the case of fully transient workers #51
Comments
To do: test in a temporary fork that uses |
Update: although other performance tests and benchmarks have improved immensely with |
In https://github.com/wlandau/crew/compare/main..096e6727cf2715983e7cb347c05748ca103a31aa, I try switching to |
I did do some work to clean up the controller and launcher logic, now in branch "'miraiError' chr Error in envir[[\".expr\"]]: subscript out of bounds" Maybe I can reproduce the error with just |
Got a reprex, not of superfluous worker launches, but of the library(mirai)
library(nanonext)
daemons(n = 1L, url = "ws://127.0.0.1:5000")
tasks <- lapply(seq_len(100L), function(x) {
mirai(x, x = x)
})
results <- list()
px <- NULL
launches <- 0L
while(length(results) < 100L) {
if (is.null(px) || !px$is_alive()) {
px <- callr::r_bg(function() {
mirai::server("ws://127.0.0.1:5000", maxtasks = 1L)
})
launches <- launches + 1L
}
done <- integer(0L)
for (i in seq_along(tasks)) {
if (!.unresolved(tasks[[i]])) {
done <- c(done, i)
results[[length(results) + 1L]] <- tasks[[i]]
}
}
tasks[done] <- NULL
}
print(launches)
#> [1] 100
data <- as.character(lapply(results, function(x) x$data))
print(data)
#> [1] "1"
#> [2] "2"
#> [3] "3"
#> [4] "4"
#> [5] "5"
#> [6] "6"
#> [7] "7"
#> [8] "Error in envir[[\".expr\"]]: subscript out of bounds"
#> ...
sum(grepl("^Error", data))
#> [1] 24
daemons(0L) |
Just posted shikokuchuo/mirai#43 with the above reprex. After it's fixed, I will try the original reprex from this thread again and see how much closer to solved it is. |
Smaller version of the same test code is below. Seems to work on my Macbook. library(mirai)
library(nanonext)
daemons(n = 1L, url = "ws://127.0.0.1:5000")
launches <- 0L
pids <- integer(0L)
while (length(pids) < 100L) {
if (!exists("px") || !px$is_alive()) {
px <- callr::r_bg(\() mirai::server("ws://127.0.0.1:5000", maxtasks = 1L))
launches <- launches + 1L
}
if (!exists("m") || !.unresolved(m)) {
if (exists("m")) pids <- c(pids, m$data)
m <- mirai(ps::ps_pid())
}
}
print(launches)
daemons(n = 0L) |
After the improvements I made yesterday, I have observed nearly all of the superfluous worker launches went away when I removed manual worker termination. (In I think I may need to implement an exit delay similar to the Or maybe I should only force-quit the workers that never connect within the startup window. Once a worker connects to a custom |
@shikokuchuo, in these tests, I consistently see |
This is probably just a function of when the information snapshot is taken. I mean the important thing is that the work is actually being done, and I am confident from my own tests that it is. This is probably a next milestone thing: whether it makes sense to maintain a cumulative record at mirai of all server instances. |
Thanks for explaining. I added some thoughts to #51 (comment):
It would really help with this to have an optional connection timeout in |
Alternatively, I would understand if you prefer to use the existing idle time for this. Just though a separate connection timeout could be shorter and add additional safety to completely persistent workers (which have idletime = Inf). |
Hmm...maybe I don't need an extra connection timeout in |
OK. What you propose in terms of the timeout is certainly feasible. Currently everything seems very robust in terms of tests etc. so I would like to get a stable version of 'mirai' on to CRAN this week as soon as possible, and then start making changes from there. So you have a bit of time to think on it. |
Awesome! From my experiments, I am suspecting more and more that that the original issue in this thread comes from |
On second thought, I actually do think it would very much help to have a
|
I think I almost have this one solved now. Most of the superfluous launches are gone. There is just one dangling worker that seems to want to show up at the very end. I am not yet sure why, but this is now a simpler and less serious problem. |
This bug, if it even is a bug anymore, is becoming extremely hard to reproduce (a very good problem). |
Based on earlier hard-to-pin-down test failures, maybe this smaller case could help: library(crew)
crew_session_start()
x <- crew_controller_callr(workers = 1L, tasks_max = 1L)
x$start()
x$push(ps::ps_pid())
x$wait(mode = "all")
pid_out <- x$pop(scale = FALSE)$result[[1]]
pid_exp <- x$launcher$workers$handle[[1]]$get_pid()
identical(pid_out, pid_exp)
x$terminate()
crew_session_terminate() |
And a controller group version: library(crew)
crew_session_start()
for (i in seq_len(100)) {
print(i)
x <- crew_controller_group(
crew_controller_callr(workers = 1L, tasks_max = 1L, name = "a"),
crew_controller_callr(workers = 1L, tasks_max = 1L, name = "b")
)
x$start()
x$push(ps::ps_pid(), controller = "b")
x$wait(mode = "all")
pid_out <- x$pop(scale = FALSE)$result[[1]]
pid_exp <- x$controllers[["b"]]$launcher$workers$handle[[1]]$get_pid()
stopifnot(identical(pid_out, pid_exp))
x$terminate()
}
crew_session_terminate() |
It takes several iterations, but eventually I get a PID mismatch in the above test. When that happens, there is a worker still running with I think this means there was a slight mismatch between workers and tasks and the task got done anyway. At worst, it's an off-by-one error because |
I wonder if I can reproduce this with just |
This appears to be working well: library(callr)
library(crew) # for crew_wait()
library(mirai)
daemons(n = 1L, url = "ws://127.0.0.1:5000", .compute = "a")
daemons(n = 1L, url = "ws://127.0.0.1:5001", .compute = "b")
for (i in seq_len(100)) {
print(i)
m <- mirai(ps::ps_pid(), .compute = "b")
px <- r_bg(\() mirai::server("ws://127.0.0.1:5001", maxtasks = 1L))
crew_wait(~!unresolved(m), seconds_interval = 0.001, seconds_timeout = 5)
stopifnot(identical(m$data, px$get_pid()))
}
daemons(n = 0L, .compute = "a")
daemons(n = 0L, .compute = "b") |
This example with just library(callr)
library(mirai)
library(nanonext)
library(purrr)
daemons(n = 4L, url = "ws://127.0.0.1:5005")
urls <- rownames(daemons()$daemons)
tasks <- map(seq_len(200), ~mirai(ps::ps_pid()))
launches <- rep(0L, 4L)
workers <- as.list(rep(FALSE, 4L))
names(launches) <- urls
names(workers) <- urls
while ((pending <- sum(map_lgl(tasks, .unresolved))) > 0L) {
online <- daemons()$daemons[, "status_online"]
disconnected <- names(online)[online < 1L]
relaunch <- head(disconnected, n = pending)
for (url in relaunch) {
w <- workers[[url]]
elapsed <- 10
if (!isFALSE(w)) {
elapsed <- difftime(Sys.time(), w$get_start_time(), units = "secs")
}
if (isFALSE(w) || (!w$is_alive() && (elapsed > 5))) {
px <- r_bg(\(url) mirai::server(url, maxtasks = 1L), args = list(url = url))
workers[[url]] <- px
launches[url] <- launches[url] + 1L
}
}
Sys.sleep(0.001)
}
daemons(n = 0L) |
Successfully reproduced the mismatch without controller groups: library(crew)
crew_session_start()
x <- crew_controller_callr(workers = 1L, tasks_max = 1L, name = "a")
x$start()
for (i in seq_len(100)) {
print(i)
x$push(ps::ps_pid(), scale = FALSE)
x$wait(mode = "all")
pid_out <- x$pop(scale = FALSE)$result[[1]]
pid_exp <- x$launcher$workers$handle[[1]]$get_pid()
stopifnot(identical(pid_out, pid_exp))
}
x$terminate()
crew_session_terminate() |
5a1f823 eliminates the remaining superfluous worker launches: right before a new launch, do a last-minute check to see if there is already a worker connected. I think this works because auto-scaling data may be out of date by the time the actual launch is reached. |
Or maybe that just masks the problem. The current which_active() checks "connected" status before it checks "discovered" status. Between those two checks, a worker could dial in. |
What is your latency here - is it because you are only polling occasionally or do you need faster updates than calling Sorry this probably doesn't help, but this new feature is just so cool :) |
#51 itself does not need a faster alternative to Lines 51 to 69 in 18a8b28
For #51 itself, the underlying issue turned out to be a really tricky race condition in the
All I needed to do was switch the order of (2) and (4). Implemented in 18a8b28. Before 18a8b28, I could reliably reproduce the issue using #51 (comment). Now, I do not see any superfluous launches in the same test, or in What a relief to have solved this! |
Yes, that's great, indeed swapping the order seems to work. There might be some benefit from using condition variables, but only if latency or performance is important. You are basically registering callbacks to happen on each event - this is all asynchronous so it doesn't slow anything down, but you get the NNG stats 'for free', so I wouldn't implement unless there's a need. But having said that, in the above, you are calling stat twice on each socket? If you have 500 sockets, that could be quite slow right? Reading the value of a condition variable would be almost instantaneous. I can give you pointers if you want to go down this route. |
Yes, for "discovered" workers, I do end up calling stat() twice. I haven't had the chance to load test 500 workers yet because I have only tried local process workers, but I will take your word for it that it could be slow. {crew} currently assumes these checks are practically instantaneous, so any slight inefficiency at scale may be noticeable. I would love to try condition variables for the {crew} utilities I linked to. If you would be willing to help me get started on cv-powered drop-in replacements for stat(), I would really appreciate it. |
Opened #57 for this. |
Promoting #50 to an issue so it is more visible. See reprex below. I am not sure where the problem comes from. Things may improve with shikokuchuo/mirai#38, and they may improve we can figure out a way to use
mirai
's own sockets instead of custom bus sockets to detect when particular new instances of worker processes connect.The text was updated successfully, but these errors were encountered: