Crash detection #20

wlandau · 2023-02-22T21:07:41Z

If a server crashes while running a task, is there a way to promptly know if the task is never going to complete? I tried the following steps on my SGE cluster. On a server node:

library(mirai)
server("tcp://CLIENT_IP:5555")

On the client with a different node and different IP than the server:

library(mirai)
daemons("tcp://CLIENT_IP:5555")
m <- mirai({Sys.sleep(10); "finished"})

Then before 10 seconds completed, I terminated the server process. On the client, the mirai object looks the same as when the job was running.

m
#> < mirai >
#>  - $data for evaluated result

m$data
#>  'unresolved' logi NA

m$aio
#> <pointer: 0x29fca20>
#> attr(,"ctx")
#> < nanoContext >
#>  - id: 1
#>  - socket: 2
#>  - state: opened
#>  - protocol: req

The text was updated successfully, but these errors were encountered:

shikokuchuo · 2023-02-22T21:27:08Z

The easy solution is to always set a .timeout in your mirai() request and then test your m$data with is_error_value() before using it. If you set the time wide enough, it will at least give you an upper bound. But let me investigate and see what can be done.

wlandau · 2023-02-22T22:16:37Z

Thank, I appreciate your openness to these features.

I do not think a timeout in each mirai() would be enough for my case. I am seldom sure how long each task will need a priori, and if I overestimate the timeout, that could be a significant a delay in finding out about the crash. For the efficiency of the {targets} pipelines I have in mind, I think I would need to find out as soon as possible if a particular job will not finish due to a broke connection.

shikokuchuo · 2023-02-23T16:57:10Z

Implemented in 5661217, an active queue is now self-repairing if a node fails.

This will be extended to the more general version of the active queue once that is implemented.

wlandau · 2023-02-23T20:47:21Z

Amazing! I tested this locally, and it appears to work (see below).

I only worry about one possible edge case:

mirai client sends the job.
Job crashes the mirai server due to a bug in the job's code.
crew launches a new server to replace the crashed one.
mirai sends the job again.
Job crashes a server again.
Repeat...

How would you suggest I avoid this improbable but vicious loop?

Current test:

# client xx.xx.xx.138
library(mirai)
daemons("tcp://xx.xx.xx.138:5555", nodes = 2)

# server xx.xx.xx.175 and xx.xx.xx.176
library(mirai)
server("tcp://xx.xx.xx.138:5555")

# client again
m1 <- mirai({
  Sys.sleep(5)
  paste0("job1_", getip::getip())
})
m2 <- mirai({
  Sys.sleep(5)
  paste0("job2_", getip::getip())
})
m3 <- mirai({
  Sys.sleep(5)
  paste0("job3_", getip::getip())
})
m4 <- mirai({
  Sys.sleep(5)
  paste0("job4_", getip::getip())
})
for (i in 1:50) {
  Sys.sleep(1)
  print(sprintf("%ss %s %s %s %s", i, m1$data, m2$data, m3$data, m4$data))
}

# client output
[1] "1s NA NA NA NA"
[1] "2s NA NA NA NA"
[1] "3s NA NA NA NA" # Right here is when I terminated the server on xx.xx.xx.175. Job 1 got rescheduled!
[1] "4s NA NA NA NA"
[1] "5s NA job2_xx.xx.xx.176 NA NA"
[1] "6s NA job2_xx.xx.xx.176 NA NA"
[1] "7s NA job2_xx.xx.xx.176 NA NA"
[1] "8s NA job2_xx.xx.xx.176 NA NA"
[1] "9s NA job2_xx.xx.xx.176 NA NA"
[1] "10s NA job2_xx.xx.xx.176 NA job4_xx.xx.xx.176"
[1] "11s NA job2_xx.xx.xx.176 NA job4_xx.xx.xx.176"
[1] "12s NA job2_xx.xx.xx.176 NA job4_xx.xx.xx.176"
[1] "13s NA job2_xx.xx.xx.176 NA job4_xx.xx.xx.176"
[1] "14s NA job2_xx.xx.xx.176 NA job4_xx.xx.xx.176"
[1] "15s NA job2_xx.xx.xx.176 job3_xx.xx.xx.176 job4_xx.xx.xx.176"
[1] "16s NA job2_xx.xx.xx.176 job3_xx.xx.xx.176 job4_xx.xx.xx.176"
[1] "17s NA job2_xx.xx.xx.176 job3_xx.xx.xx.176 job4_xx.xx.xx.176"
[1] "18s NA job2_xx.xx.xx.176 job3_xx.xx.xx.176 job4_xx.xx.xx.176"
[1] "19s NA job2_xx.xx.xx.176 job3_xx.xx.xx.176 job4_xx.xx.xx.176"

shikokuchuo · 2023-02-27T11:28:13Z

This would be rare as all evaluation is wrapped in a tryCatch(). This might only occur if OOM kills the task or something like that.

You would need to call stop_mirai() on the problematic mirai. By design, the req/rep protocol we use is for guaranteed delivery. If a server dies, NNG automatically detects that the connection has been broken and will resend the task to another connection (whether you launch a new node or not).

How quickly it attempts retries etc. are options that can be set through the nanonext/NNG opt() interface. If you find you need it, I can re-export it in mirai.

wlandau · 2023-02-27T12:38:38Z

I see. Is there an NNG option for the maximum number of retries?

shikokuchuo · 2023-02-27T15:25:07Z

That is not currently an option. This is such an edge case... I am not sure it is something that is worth handling, even on your side.

wlandau · 2023-02-27T20:28:59Z

I agree that it is extremely rare, but I do care about it. The loop in #20 (comment) could be such a nightmare for an unlucky user with servers running as AWS Batch jobs backed by expensive EC2 instances.

Is there anything else we can do through mirai?

If not, I think I would need to let go of my plans for fault tolerance in crew. Instead of trying to restart a crashed server, crew would only ever launch a new server when a user submits a job. Then if crew ever detects more unfinished jobs than servers, it could report a crash and maybe error out the whole pipeline.

shikokuchuo · 2023-02-28T12:03:48Z

For something like this, if it is going to be useful for 99% of cases then why not go ahead with the feature, but just have the option to turn it off for the 1% of times when this might happen. If you really care about the 1% then have the default switched to off - but we tend to overestimate the probability of rare events in any case.

wlandau · 2023-02-28T13:04:31Z

Yeah, I think I could let users opt in or out of fault tolerance via crew.

wlandau mentioned this issue Feb 27, 2023

Check for crashes while using mirai wlandau/crew#28

Closed

wlandau closed this as completed Feb 28, 2023

This was referenced Jul 24, 2023

Controller hangs perpetually if a Slurm worker is killed via OOM wlandau/crew.cluster#19

Closed

Prevent indefinite worker launch retries wlandau/crew#101

Closed

Set max consecutive futile launches wlandau/crew#102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash detection #20

Crash detection #20

wlandau commented Feb 22, 2023

shikokuchuo commented Feb 22, 2023

wlandau commented Feb 22, 2023 •

edited

Loading

shikokuchuo commented Feb 23, 2023

wlandau commented Feb 23, 2023 •

edited

Loading

shikokuchuo commented Feb 27, 2023

wlandau commented Feb 27, 2023

shikokuchuo commented Feb 27, 2023

wlandau commented Feb 27, 2023

shikokuchuo commented Feb 28, 2023

wlandau commented Feb 28, 2023

Crash detection #20

Crash detection #20

Comments

wlandau commented Feb 22, 2023

shikokuchuo commented Feb 22, 2023

wlandau commented Feb 22, 2023 • edited Loading

shikokuchuo commented Feb 23, 2023

wlandau commented Feb 23, 2023 • edited Loading

shikokuchuo commented Feb 27, 2023

wlandau commented Feb 27, 2023

shikokuchuo commented Feb 27, 2023

wlandau commented Feb 27, 2023

shikokuchuo commented Feb 28, 2023

wlandau commented Feb 28, 2023

wlandau commented Feb 22, 2023 •

edited

Loading

wlandau commented Feb 23, 2023 •

edited

Loading