Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash detection #20

Closed
wlandau opened this issue Feb 22, 2023 · 10 comments
Closed

Crash detection #20

wlandau opened this issue Feb 22, 2023 · 10 comments

Comments

@wlandau
Copy link

wlandau commented Feb 22, 2023

If a server crashes while running a task, is there a way to promptly know if the task is never going to complete? I tried the following steps on my SGE cluster. On a server node:

library(mirai)
server("tcp://CLIENT_IP:5555")

On the client with a different node and different IP than the server:

library(mirai)
daemons("tcp://CLIENT_IP:5555")
m <- mirai({Sys.sleep(10); "finished"})

Then before 10 seconds completed, I terminated the server process. On the client, the mirai object looks the same as when the job was running.

m
#> < mirai >
#>  - $data for evaluated result

m$data
#>  'unresolved' logi NA

m$aio
#> <pointer: 0x29fca20>
#> attr(,"ctx")
#> < nanoContext >
#>  - id: 1
#>  - socket: 2
#>  - state: opened
#>  - protocol: req
@shikokuchuo
Copy link
Owner

The easy solution is to always set a .timeout in your mirai() request and then test your m$data with is_error_value() before using it. If you set the time wide enough, it will at least give you an upper bound. But let me investigate and see what can be done.

@wlandau
Copy link
Author

wlandau commented Feb 22, 2023

Thank, I appreciate your openness to these features.

I do not think a timeout in each mirai() would be enough for my case. I am seldom sure how long each task will need a priori, and if I overestimate the timeout, that could be a significant a delay in finding out about the crash. For the efficiency of the {targets} pipelines I have in mind, I think I would need to find out as soon as possible if a particular job will not finish due to a broke connection.

@shikokuchuo
Copy link
Owner

Implemented in 5661217, an active queue is now self-repairing if a node fails.

This will be extended to the more general version of the active queue once that is implemented.

@wlandau
Copy link
Author

wlandau commented Feb 23, 2023

Amazing! I tested this locally, and it appears to work (see below).

I only worry about one possible edge case:

  1. mirai client sends the job.
  2. Job crashes the mirai server due to a bug in the job's code.
  3. crew launches a new server to replace the crashed one.
  4. mirai sends the job again.
  5. Job crashes a server again.
  6. Repeat...

How would you suggest I avoid this improbable but vicious loop?

Current test:

# client xx.xx.xx.138
library(mirai)
daemons("tcp://xx.xx.xx.138:5555", nodes = 2)

# server xx.xx.xx.175 and xx.xx.xx.176
library(mirai)
server("tcp://xx.xx.xx.138:5555")

# client again
m1 <- mirai({
  Sys.sleep(5)
  paste0("job1_", getip::getip())
})
m2 <- mirai({
  Sys.sleep(5)
  paste0("job2_", getip::getip())
})
m3 <- mirai({
  Sys.sleep(5)
  paste0("job3_", getip::getip())
})
m4 <- mirai({
  Sys.sleep(5)
  paste0("job4_", getip::getip())
})
for (i in 1:50) {
  Sys.sleep(1)
  print(sprintf("%ss %s %s %s %s", i, m1$data, m2$data, m3$data, m4$data))
}

# client output
[1] "1s NA NA NA NA"
[1] "2s NA NA NA NA"
[1] "3s NA NA NA NA" # Right here is when I terminated the server on xx.xx.xx.175. Job 1 got rescheduled!
[1] "4s NA NA NA NA"
[1] "5s NA job2_xx.xx.xx.176 NA NA"
[1] "6s NA job2_xx.xx.xx.176 NA NA"
[1] "7s NA job2_xx.xx.xx.176 NA NA"
[1] "8s NA job2_xx.xx.xx.176 NA NA"
[1] "9s NA job2_xx.xx.xx.176 NA NA"
[1] "10s NA job2_xx.xx.xx.176 NA job4_xx.xx.xx.176"
[1] "11s NA job2_xx.xx.xx.176 NA job4_xx.xx.xx.176"
[1] "12s NA job2_xx.xx.xx.176 NA job4_xx.xx.xx.176"
[1] "13s NA job2_xx.xx.xx.176 NA job4_xx.xx.xx.176"
[1] "14s NA job2_xx.xx.xx.176 NA job4_xx.xx.xx.176"
[1] "15s NA job2_xx.xx.xx.176 job3_xx.xx.xx.176 job4_xx.xx.xx.176"
[1] "16s NA job2_xx.xx.xx.176 job3_xx.xx.xx.176 job4_xx.xx.xx.176"
[1] "17s NA job2_xx.xx.xx.176 job3_xx.xx.xx.176 job4_xx.xx.xx.176"
[1] "18s NA job2_xx.xx.xx.176 job3_xx.xx.xx.176 job4_xx.xx.xx.176"
[1] "19s NA job2_xx.xx.xx.176 job3_xx.xx.xx.176 job4_xx.xx.xx.176"

@shikokuchuo
Copy link
Owner

This would be rare as all evaluation is wrapped in a tryCatch(). This might only occur if OOM kills the task or something like that.

You would need to call stop_mirai() on the problematic mirai. By design, the req/rep protocol we use is for guaranteed delivery. If a server dies, NNG automatically detects that the connection has been broken and will resend the task to another connection (whether you launch a new node or not).

How quickly it attempts retries etc. are options that can be set through the nanonext/NNG opt() interface. If you find you need it, I can re-export it in mirai.

@wlandau
Copy link
Author

wlandau commented Feb 27, 2023

I see. Is there an NNG option for the maximum number of retries?

@shikokuchuo
Copy link
Owner

That is not currently an option. This is such an edge case... I am not sure it is something that is worth handling, even on your side.

@wlandau
Copy link
Author

wlandau commented Feb 27, 2023

I agree that it is extremely rare, but I do care about it. The loop in #20 (comment) could be such a nightmare for an unlucky user with servers running as AWS Batch jobs backed by expensive EC2 instances.

Is there anything else we can do through mirai?

If not, I think I would need to let go of my plans for fault tolerance in crew. Instead of trying to restart a crashed server, crew would only ever launch a new server when a user submits a job. Then if crew ever detects more unfinished jobs than servers, it could report a crash and maybe error out the whole pipeline.

@shikokuchuo
Copy link
Owner

For something like this, if it is going to be useful for 99% of cases then why not go ahead with the feature, but just have the option to turn it off for the 1% of times when this might happen. If you really care about the 1% then have the default switched to off - but we tend to overestimate the probability of rare events in any case.

@wlandau
Copy link
Author

wlandau commented Feb 28, 2023

Yeah, I think I could let users opt in or out of fault tolerance via crew.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants