Ubuntu 20.04 mclapply "child process has died" #366

mbacou · 2020-08-06T14:56:36Z

@jeroen we've migrated our OpenCPU config from Ubuntu 16.04 to 20.04 (opencpu-server 2.2.0). Been stable in 16.04 over the past 3+ years, but now we have requests that fail intermittently with the error below, and no additional log message that I can find.

I can make the request break by increasing the number of file reads on the server (we run models that require from 1 up to 6,000 file reads per request), so figured I must be hitting apache or OpenCPU limits (since everything works as expected in an interactive R session). Copied my ./ocpu/info below as well.

Any pointer as to how I should go about debugging that one?

Thanks!

child process has died

In call:
tryCatch({
    if (length(priority)) 
        setpriority(priority)
    if (length(rlimits)) 
        set_rlimits(rlimits)
    if (length(gid)) 
        setgid(gid)
    if (length(uid)) 
        setuid(uid)
    if (length(profile)) 
        aa_change_profile(profile)
    if (length(device)) 
        options(device = device)
    graphics.off()
    options(menu.graphics = FALSE)
    serialize(withVisible(eval(orig_expr, parent.frame())), NULL)
}, error = function(e) {
    old_class <- attr(e, "class")
    structure(e, class = c(old_class, "eval_fork_error"))
}, finally = substitute(graphics.off()))

# OpenCPU: Producing and Reproducing Results
Cloud Server (version: 2.2.0)

## System
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=C.UTF-8    LC_NUMERIC=C        LC_TIME=C.UTF-8    
 [4] LC_COLLATE=C.UTF-8  LC_MONETARY=C.UTF-8 LC_MESSAGES=C      
 [7] LC_PAPER=C          LC_NAME=C           LC_ADDRESS=C       
[10] LC_TELEPHONE=C      LC_MEASUREMENT=C    LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] opencpu_2.2.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6      lattice_0.20-41   codetools_0.2-16  mime_0.9         
 [5] grid_4.0.2        jsonlite_1.7.0    unix_1.5.1        evaluate_0.14    
 [9] curl_4.3          data.table_1.13.1 raster_3.3-13     sp_1.4-2         
[13] webutils_1.1      tools_4.0.2       parallel_4.0.2    compiler_4.0.2   
[17] askpass_1.1       openssl_1.4.2    

## Configuration
[1] "/usr/lib/opencpu/library/opencpu/config/defaults.conf"
[2] "/etc/opencpu/server.conf"                             

## Libraries
[1] "/usr/local/lib/opencpu/site-library" "/usr/local/lib/R/site-library"      
[3] "/usr/lib/R/site-library"             "/usr/lib/R/library"                 
[5] "/usr/lib/opencpu/library"           

## Apps
[1] "/usr/local/lib/opencpu/apps"

## Limits
$cur
        as       core        cpu       data      fsize    memlock     nofile 
4000000000          0         63        Inf 1000000000      65536       8192 
     nproc      stack 
       500    8388608 

$max
        as       core        cpu       data      fsize    memlock     nofile 
4000000000        Inf         63        Inf 1000000000      65536       8192 
     nproc      stack 
       500        Inf 


## Apparmor
      compiled        enabled            con           mode 
        "TRUE"         "TRUE" "opencpu-exec"      "enforce"

The text was updated successfully, but these errors were encountered:

jeroen · 2020-08-06T14:58:30Z

Are there any messages in your /var/log/kern.log or other system log files that hint why the process died?

mbacou · 2020-08-06T15:15:50Z

Sorry no, I have a long list of apparmor warnings as below, but failed OpenCPU requests don't seem to log any new entry into /var/log/kern.log.

kernel: [ 6507.054355] audit: type=1400 audit(1596722896.561:118): apparmor="STATUS" operation="profile_replace\
" info="same as current profile, skipping" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=29330 comm="apparmor_parser"
kernel: [ 6507.054342] audit: type=1400 audit(1596722896.557:115): apparmor="STATUS" operation="profile_replace\
" info="same as current profile, skipping" profile="unconfined" name="opencpu-exec" pid=29328 comm="apparmor_parser"
kernel: [ 6507.054347] audit: type=1400 audit(1596722896.561:116): apparmor="STATUS" operation="profile_replace\
" info="same as current profile, skipping" profile="unconfined" name="opencpu-main" pid=29334 comm="apparmor_parser"
kernel: [ 6507.055016] audit: type=1400 audit(1596722896.561:121): apparmor="STATUS" operation="profile_replace\
" info="same as current profile, skipping" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=29333 comm="apparmor_parser"

mbacou · 2020-08-06T15:17:42Z

Our models do spawn additional child processes with mclapply(), wonder if I should turn that off while debugging?

mbacou · 2020-08-06T15:24:53Z

Only logs I see are in /var/log/apache2/access.log, e.g.

[06/Aug/2020:15:08:04 +0000] "POST /ocpu/library/wcapi/R/monitor/json HTTP/1.1" 400 5022 "-" "curl/7.68.0"                   
[06/Aug/2020:15:11:36 +0000] "POST /ocpu/library/wcapi/R/monitor/json HTTP/1.1" 201 1855033 "-" "curl/7.68.0"

jeroen · 2020-08-06T15:37:17Z

Yeah maybe try without mclapply. Also in your /etc/opencpu/server.conf try raising rlimit.nproc

mbacou · 2020-08-06T16:06:43Z

No change when turning off mclapply. Tried raising as and nproc limits as below, but no difference either.
But we have a consistent child process has died error when requesting over ~1100 file reads in 1 model run (regardless of which files). Same request in an interactive R session takes under 1 sec.

    "rlimit.as": 14e9,                                                                                               
    "rlimit.fsize": 1e9,                                                                                            
    "rlimit.nproc": 1000,                                                                                           
    "timelimit.get": 60,                                                                                            
    "timelimit.post": 180,                                                                                          
    "timelimit.webhook": 900,                                                                                       
    "preload": ["data.table","raster"]

jeroen · 2020-08-06T18:59:43Z

Hmm so it only happens if you open lots of files? And there is nothing in /var/log/apache2/error.log or /var/log/opencpu/error.log ?

mbacou · 2020-08-06T19:39:13Z

Yes that's correct )-:

Only entries in /var/log/apache2/error.log:

Using locale: C.UTF-8                                                                                  
AppArmor available! Running OpenCPU with security profile and rlimits.                                 
Loading config from /usr/lib/opencpu/library/opencpu/config/defaults.conf                              
Loading config from /etc/opencpu/server.conf                                                           
OpenCPU cloud server ready.

and my /var/log/opencpu/error.log is empty.

All I can say is we have updated package versions between our 16.04 and 20.04 configs (data.table 1.12.9 vs. 1.13.1, raster 3.3.7 vs. raster 3.3.17). These are the main deps.

I have whitelisted all required local system paths in /etc/apparmor.d/opencpu.d/custom.

Been randomly accessing between 500-1100 files, and so far get consistent failure over ~1000. This is on a 8 cores / 32GB VM.

mbacou · 2020-08-06T20:25:46Z

In addition to rlmit.* is there a limit on the amount/size of /tmp files OpenCPU processes can generate?

jeroen · 2020-08-06T21:06:15Z

In addition to rlmit.* is there a limit on the amount/size of /tmp files OpenCPU processes can generate?

Not that I am aware of, but ubuntu switched to systemd in 18.04 so perhaps there are other changes somewhere...

mbacou · 2020-08-07T00:45:28Z

Got a few extra lines in /var/log/kern.log after switching mclapply(mc.silent=FALSE, ...). might be informative (?)

kernel: [44069.123439] audit: type=1400 audit(1596760456.037:190): apparmor="DENIED" 
operation="open" profile="opencpu-exec" name="/sys/fs/cgroup/memory/memory.limit_in_bytes" 
pid=42507 comm="apache2" requested_mask="r" denied_mask="r" fsuid=33 ouid=0

jeroen · 2020-08-07T07:37:53Z

Hmmm interesting. Try allowing that read in the apparmor profile?

/sys/fs/cgroup/** r,

mbacou · 2020-08-07T14:10:48Z

no go. Trying to downgrade R package versions to our prior config, just in case... Also there's a fair amount of randomness in the error (yesterday could pull ~1000 files, today seems to fail at ~800). Puzzling. I'm excluding hardware issues since everything runs as usual outside of OpenCPU.

mbacou · 2020-08-07T14:40:48Z

OK finally fixed that by removing all calls to mclapply() ! Not sure what's causing an issue with ocpu, but that's progress.
Adding MC_CORES=1 to /etc/opencpu/Renviron is taking care of the problem for now.

jeroen · 2020-08-07T15:29:29Z

That's very strange. You're not using rJava, are you?

mbacou · 2020-08-07T15:48:20Z

Not using rJava. Here is our local packages' deps:

import(data.table)
import(fitdistrplus)
import(httr)
import(parallel)
importFrom(MASS,fitdistr)
importFrom(RPostgres,Postgres)
importFrom(RPostgres,dbConnect)
importFrom(cli,cat_rule)
importFrom(fitdistrplus,fitdist)
importFrom(ggplot2,element_rect)
importFrom(ggplot2,element_text)
importFrom(ggplot2,margin)
importFrom(ggplot2,theme)
importFrom(ggplot2,theme_minimal)
importFrom(gtools,even)
importFrom(gtools,odd)
importFrom(lubridate,day)
importFrom(lubridate,floor_date)
importFrom(lubridate,leap_year)
importFrom(lubridate,years)
importFrom(raster,"crs<-")
importFrom(raster,crs)
importFrom(raster,extract)
importFrom(raster,raster)
importFrom(raster,rasterOptions)
importFrom(raster,removeTmpFiles)
importFrom(raster,stack)
importFrom(raster,writeRaster)
importFrom(sp,coordinates)
importFrom(tools,file_ext)
importFrom(tools,file_path_sans_ext)
importFrom(utils,untar)
importFrom(utils,unzip)

I have mclapply only kick in over 365*3 file reads, so the error pattern at around ~1000 files sort of makes sense.

Here is a snippet of that call (calling raster::stack() and raster::extract()):

  mc.cores = if(mc.cores > 1L && length(y) > 3) mc.cores else 1L

  if(verbose) message("Extracting ", nrow(catalog), " ", toupper(code), " rasters over ",
    mc.cores, " cores. Please stand by...\n")

  tmp = mclapply(y, mc.cores=mc.cores, mc.silent=FALSE, mc.allow.recursive=FALSE, function(x) {
    r = stack(as.list(catalog[year(date)==x, file]), quick=TRUE, native=native)
    names(r) = catalog[year(date)==x, layer]
    res = extract(r, data.grp)
    return(res)
  })

jeroen · 2020-08-08T12:45:55Z

Can you try raising rlimit.nproc even higher? Perhaps there is a bug in R that it isn't properly closing the parallel procs.

mbacou · 2020-08-08T18:27:51Z

Jeroen, yes indeed, had to raise "rlimit.nproc": 5000 but that's working now without MC_CORES=1.
Can't tell if it's a problem with raster::extract or an R bug (?)

Hum correction, it worked once, then failed consistently. I rebooted the VM, worked again one time and failed thereafter. So I'm reverting to MC_CORES=1!

jeroen · 2020-08-08T19:01:17Z

Are you using mccollect() to clean up the finished parallel jobs?

mbacou · 2020-08-08T20:00:36Z

Don't think I can use mcollect with mclapply, but I added mc.cleanup=TRUE:

  mc.cores = if(mc.cores > 1 && length(y) > 3) mc.cores else 1L

  tmp = mclapply(y,
    mc.cores=mc.cores, mc.silent=FALSE, mc.allow.recursive=FALSE, mc.cleanup=TRUE,
    function(x) {
      #rasterOptions(timer=FALSE, chunksize=2e+08, datatype=switch(code, rfe="INT1U", "FLT4S"))
      r = stack(as.list(catalog[year(date)==x, file]), quick=TRUE, native=native)
      names(r) = catalog[year(date)==x, layer]
      res = extract(r, data.grp)
      return(res)
    })

I'm getting really inconsistent behavior now, even with MC_CORES=1 requests will fail.

Also how can I check ocpu's environment variables with curl, just to be sure?

jeroen · 2020-08-08T20:08:30Z

Also how can I check ocpu's environment variables with curl, just to be sure?

curl https://cloud.opencpu.org/ocpu/library/base/R/Sys.getenv/print -X POST

mbacou · 2020-08-08T20:22:44Z

OK thx, so yes I can confirm, with the code above and MC_CORES=1 requests work consistently. Switching to e.g. MC_CORES=3 all fail.

mbacou · 2020-08-08T20:30:42Z

There might be recent changes in raster::extract() that makes it spawn parallel processes now by default even with point queries. Will need to check with @rhijmans

jeroen · 2020-08-08T20:31:42Z

OK, if you have time (not urgent) it would be helpful if you can create a small example for me to reproduce this. Maybe it has nothing to do with raster, and just a problem with files not getting closed in concurrent processes?

mbacou · 2020-08-08T20:38:24Z

Yeh thought about that as well, slow I/O with attached SSD drives since we moved from Azure to AWS. Will work on a repex all right. Thx!

But wouldn't explain why that fails through OpenCPU only.

mbacou · 2020-08-08T22:06:58Z

Jeroen is a repex https://gist.github.com/mbacou/41c2fa42fa36cef2a19d9291432b8560

That reproduced the error on my Ubuntu 20.04 VM.

Use step1(dir="some location accessible to ocpu") to download & deflate a bunch of raster files (only need to call that once)
Use step2(dir="some location accessible to ocpu", mc.cores=3) to reproduce the error in OpenCPU

Let me know. Thx!

baohongz · 2020-09-07T00:22:55Z

I run into the same error message by doing http://104.131.40.122/ocpu/library/V8/R/ . Then fixed by turning off SELinux, https://linuxize.com/post/how-to-disable-selinux-on-centos-8/

OpenCPU error HTTP 400
child process has died

In call:
tryCatch({
    if (length(priority)) 
        setpriority(priority)
    if (length(rlimits)) 
        set_rlimits(rlimits)
    if (length(gid)) 
        setgid(gid)
    if (length(uid)) 
        setuid(uid)
    if (length(profile)) 
        aa_change_profile(profile)
    if (length(device)) 
        options(device = device)
    graphics.off()
    options(menu.graphics = FALSE)
    serialize(withVisible(eval(orig_expr, parent.frame())), NULL)
}, error = function(e) {
    old_class <- attr(e, "class")
    structure(e, class = c(old_class, "eval_fork_error"))
}, finally = substitute(graphics.off()))

CentOS Linux release 8.2.2004 (Core)
R version 4.0.2 (2020-06-22)
opencpu_2.2.0
V8_3.2.0

mbacou · 2020-11-09T16:47:25Z

FYI on our Ubuntu 20.04 system, SELinux is already disabled. Also tried to disable AppArmor with OCPU_DISABLE_APPARMOR=1 in /etc/opencpu/Renviron, but that didn't help with the child process has died bug, so still using MC_CORES=1 as a workaround for now.

hai-wang · 2021-06-09T22:40:43Z

Is there any update on this? We are in the middle of migrating from Ubuntu 16.04 / OpenCPU 2.0.8.1 to Unbuntu 20.04 / OpenCPU 2.2.2. During our testing with the new version, we ran into the same problem frequently with the same code. I tried all the suggestions mentioned in this thread including setting MC_CORES=1, the problem keeps happening. But setting MC_CORES=1 defeats the purpose to use mclapply to begin with.

jeroen · 2021-06-10T09:54:21Z

I'm sorry I have never been able to discover the problem. It sounds like an issue with the parallel package in R.

Could you try adding this line to your /etc/opencpu/Rprofile:

assign("setup_strategy", "sequential", parallel:::defaultClusterOptions)

And see if that makes any difference?

hai-wang · 2021-06-10T16:59:19Z

I'm sorry I have never been able to discover the problem. It sounds like an issue with the parallel package in R.

Could you try adding this line to your /etc/opencpu/Rprofile:
assign("setup_strategy", "sequential", parallel:::defaultClusterOptions)
And see if that makes any difference?

Unfortunately, it didn't make any difference. I checked these log files: /var/log/kern.log /var/log/apache2/error.log /var/log/opencpu/error.log /var/log/opencpu/apache_error.log and there is nothing logged.

I don't know this means anything. If I retry the failed call many times (as many as 15 times), the call would eventually succeed.

Questions:

What "child process" is it referring to? Is this an Apache process? Or is this some other process spun off by R when using mclapply? Could this be a thread instead of a process?
Is there any other logs that may have any clue of why the child process died?
Is there a way to increase the amount of logging to understand more why the child process died?

Let me know anything I can help to find out more information to get to the bottom of this.

karawoo · 2021-06-29T19:27:12Z

Chiming in here as we've run into this issue too. With 1 core everything is fine, but when we parallelize across multiple we see child process has died intermittently. For us, repeating the failed call usually works right away, so the issue isn't very reproducible. I think usually not more than 10 requests out of every 1000 fail. We're running in Docker as well, in case that's relevant.

Ideally we'd prefer not to limit MC_CORES to 1 because we would like to keep the performance benefits of parallelizing. @jeroen should it theoretically be possible to have R code using mclapply() over multiple cores, or is that going to conflict with the way opencpu evaluates requests in forked processes?

jeroen · 2021-06-29T22:40:46Z

It really shouldn't conflict, but I have not been able to figure out yet where this bug comes from. I suspect it is related to the changes that were made in R 4.0 wrt how R manages the worker subprocesses.

There were several bugs in that which would appear under only certain circumstances, some of them were only fixed very recently for R 4.2. I'll try to find time to get this sorted out soon.

karawoo · 2021-06-29T23:09:38Z

Thanks @jeroen, I can give it a try on devel and report back on if that changes things at all

karawoo · 2021-07-01T23:48:29Z

Tried using rocker/r-ver:devel as our base image (R 4.1.0) and I still see the error there.

karawoo · 2021-07-12T23:33:26Z

~~Another update: I tried switching to future/future.apply (using a multicore plan) and have not seen any child process has died errors in my testing so far.~~ nope, it happens with future too.

jeroen · 2021-08-04T22:36:20Z

I still haven't been able to figure out the issue, but I have disabled a feature that tries to kill orphaned processes. Perhaps it is related.

I'll push out opencpu 2.2.4.1 with this change, can you try if this changes anything at all?

pilare · 2021-08-08T05:18:59Z

@jeroen thank you for that release. I was struggling with this since version 2.1.5, and was able to run it only on our linux boxes, Azure VM's and AKS containers were always crashing on one of our NGS pipeline scripts. So I was thinking it is some problem with containerization engine (as I could see difference in ocpu/info) or R in version 4 was to blame, but this release solved my issues. I was observing that thread for some time and during that time I was trying to get to the bottom of this which exact operation has the "precess has died" problem, to submit a similar issue with full report. I have one more request, please build and push to docker hub an opencpu/rstudio image based on that 2.2.4.1 version. We are using rstudio images in our development machines, for the ease of debugging and development of our packages.

jeroen · 2021-08-08T12:25:50Z

@pilare ok i'll tag a docker release opencpu/rstudio:2.2.4.1. I'll wait for others to see if this really fixes the problem.

karawoo · 2021-08-11T19:31:51Z

So far in my testing I have not seen any child process has died errors with the new version. We haven't deployed this to production yet, so it's possible things would be different in the wild, but the update looks very promising.

I don't know if this is useful information, but I've noticed that one of the calls that sometimes produced child process has died now sometimes fails with timeout reached (30.000000 sec). It's a relatively long call and I mainly see the timeout error when I'm testing several API calls at once. We can increase the timeout limit and/or try to speed up the code, so that's not an issue, but maybe this gives some hint to the underlying cause?

Edited to note that I've tested this with our codebase now using future/future.apply, not mclapply. Before the new opencpu version we were still getting child process has died errors with future, but less frequently than with mclapply.

jeroen · 2021-08-18T06:51:19Z

I've pushed a new release opencpu 2.2.5 (which includes this fix) to cran and all docker images. Also overhauled the docker build infrastructure so that it will be easier again to publish new releases if needed.

Hopefully the error will disappear for everyone.

mbacou changed the title ~~Ubuntu 20.04 "child process"~~ Ubuntu 20.04 "child process failed" Aug 6, 2020

mbacou changed the title ~~Ubuntu 20.04 "child process failed"~~ Ubuntu 20.04 "child process has died" Aug 6, 2020

mbacou changed the title ~~Ubuntu 20.04 "child process has died"~~ Ubuntu 20.04 mclapply "child process has died" Aug 8, 2020

ltrainstg mentioned this issue Jul 22, 2024

installing v8 in opencpu results in "child process has died" error in minimal docker file. #428

Closed

PietrH mentioned this issue Oct 16, 2024

API: change compressed output format for get_val to reduce serialisation cost inbo/etn#327

Open

6 tasks

Ubuntu 20.04 mclapply "child process has died" #366

Ubuntu 20.04 mclapply "child process has died" #366

Comments

mbacou commented Aug 6, 2020 • edited Loading

jeroen commented Aug 6, 2020

mbacou commented Aug 6, 2020 • edited Loading

mbacou commented Aug 6, 2020

mbacou commented Aug 6, 2020

jeroen commented Aug 6, 2020

mbacou commented Aug 6, 2020 • edited Loading

jeroen commented Aug 6, 2020

mbacou commented Aug 6, 2020 • edited Loading

mbacou commented Aug 6, 2020 • edited Loading

jeroen commented Aug 6, 2020

mbacou commented Aug 7, 2020 • edited Loading

jeroen commented Aug 7, 2020

mbacou commented Aug 7, 2020 • edited Loading

mbacou commented Aug 7, 2020 • edited Loading

jeroen commented Aug 7, 2020

mbacou commented Aug 7, 2020 • edited Loading

jeroen commented Aug 8, 2020

mbacou commented Aug 8, 2020 • edited Loading

jeroen commented Aug 8, 2020

mbacou commented Aug 8, 2020 • edited Loading

jeroen commented Aug 8, 2020

mbacou commented Aug 8, 2020

mbacou commented Aug 8, 2020 • edited Loading

jeroen commented Aug 8, 2020

mbacou commented Aug 8, 2020 • edited Loading

mbacou commented Aug 8, 2020 • edited Loading

baohongz commented Sep 7, 2020

mbacou commented Nov 9, 2020 • edited Loading

hai-wang commented Jun 9, 2021

jeroen commented Jun 10, 2021

hai-wang commented Jun 10, 2021

karawoo commented Jun 29, 2021

jeroen commented Jun 29, 2021 • edited Loading

karawoo commented Jun 29, 2021

karawoo commented Jul 1, 2021

karawoo commented Jul 12, 2021 • edited Loading

jeroen commented Aug 4, 2021

pilare commented Aug 8, 2021

jeroen commented Aug 8, 2021

karawoo commented Aug 11, 2021 • edited Loading

jeroen commented Aug 18, 2021

mbacou commented Aug 6, 2020 •

edited

Loading

mbacou commented Aug 6, 2020 •

edited

Loading

mbacou commented Aug 6, 2020 •

edited

Loading

mbacou commented Aug 6, 2020 •

edited

Loading

mbacou commented Aug 6, 2020 •

edited

Loading

mbacou commented Aug 7, 2020 •

edited

Loading

mbacou commented Aug 7, 2020 •

edited

Loading

mbacou commented Aug 7, 2020 •

edited

Loading

mbacou commented Aug 7, 2020 •

edited

Loading

mbacou commented Aug 8, 2020 •

edited

Loading

mbacou commented Aug 8, 2020 •

edited

Loading

mbacou commented Aug 8, 2020 •

edited

Loading

mbacou commented Aug 8, 2020 •

edited

Loading

mbacou commented Aug 8, 2020 •

edited

Loading

mbacou commented Nov 9, 2020 •

edited

Loading

jeroen commented Jun 29, 2021 •

edited

Loading

karawoo commented Jul 12, 2021 •

edited

Loading

karawoo commented Aug 11, 2021 •

edited

Loading