Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ubuntu 20.04 mclapply "child process has died" #366

Open
mbacou opened this issue Aug 6, 2020 · 41 comments
Open

Ubuntu 20.04 mclapply "child process has died" #366

mbacou opened this issue Aug 6, 2020 · 41 comments

Comments

@mbacou
Copy link

mbacou commented Aug 6, 2020

@jeroen we've migrated our OpenCPU config from Ubuntu 16.04 to 20.04 (opencpu-server 2.2.0). Been stable in 16.04 over the past 3+ years, but now we have requests that fail intermittently with the error below, and no additional log message that I can find.

I can make the request break by increasing the number of file reads on the server (we run models that require from 1 up to 6,000 file reads per request), so figured I must be hitting apache or OpenCPU limits (since everything works as expected in an interactive R session). Copied my ./ocpu/info below as well.

Any pointer as to how I should go about debugging that one?

Thanks!

child process has died

In call:
tryCatch({
    if (length(priority)) 
        setpriority(priority)
    if (length(rlimits)) 
        set_rlimits(rlimits)
    if (length(gid)) 
        setgid(gid)
    if (length(uid)) 
        setuid(uid)
    if (length(profile)) 
        aa_change_profile(profile)
    if (length(device)) 
        options(device = device)
    graphics.off()
    options(menu.graphics = FALSE)
    serialize(withVisible(eval(orig_expr, parent.frame())), NULL)
}, error = function(e) {
    old_class <- attr(e, "class")
    structure(e, class = c(old_class, "eval_fork_error"))
}, finally = substitute(graphics.off()))
# OpenCPU: Producing and Reproducing Results
Cloud Server (version: 2.2.0)

## System
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=C.UTF-8    LC_NUMERIC=C        LC_TIME=C.UTF-8    
 [4] LC_COLLATE=C.UTF-8  LC_MONETARY=C.UTF-8 LC_MESSAGES=C      
 [7] LC_PAPER=C          LC_NAME=C           LC_ADDRESS=C       
[10] LC_TELEPHONE=C      LC_MEASUREMENT=C    LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] opencpu_2.2.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6      lattice_0.20-41   codetools_0.2-16  mime_0.9         
 [5] grid_4.0.2        jsonlite_1.7.0    unix_1.5.1        evaluate_0.14    
 [9] curl_4.3          data.table_1.13.1 raster_3.3-13     sp_1.4-2         
[13] webutils_1.1      tools_4.0.2       parallel_4.0.2    compiler_4.0.2   
[17] askpass_1.1       openssl_1.4.2    

## Configuration
[1] "/usr/lib/opencpu/library/opencpu/config/defaults.conf"
[2] "/etc/opencpu/server.conf"                             

## Libraries
[1] "/usr/local/lib/opencpu/site-library" "/usr/local/lib/R/site-library"      
[3] "/usr/lib/R/site-library"             "/usr/lib/R/library"                 
[5] "/usr/lib/opencpu/library"           

## Apps
[1] "/usr/local/lib/opencpu/apps"

## Limits
$cur
        as       core        cpu       data      fsize    memlock     nofile 
4000000000          0         63        Inf 1000000000      65536       8192 
     nproc      stack 
       500    8388608 

$max
        as       core        cpu       data      fsize    memlock     nofile 
4000000000        Inf         63        Inf 1000000000      65536       8192 
     nproc      stack 
       500        Inf 


## Apparmor
      compiled        enabled            con           mode 
        "TRUE"         "TRUE" "opencpu-exec"      "enforce" 

@mbacou mbacou changed the title Ubuntu 20.04 "child process" Ubuntu 20.04 "child process failed" Aug 6, 2020
@mbacou mbacou changed the title Ubuntu 20.04 "child process failed" Ubuntu 20.04 "child process has died" Aug 6, 2020
@jeroen
Copy link
Member

jeroen commented Aug 6, 2020

Are there any messages in your /var/log/kern.log or other system log files that hint why the process died?

@mbacou
Copy link
Author

mbacou commented Aug 6, 2020

Sorry no, I have a long list of apparmor warnings as below, but failed OpenCPU requests don't seem to log any new entry into /var/log/kern.log.

kernel: [ 6507.054355] audit: type=1400 audit(1596722896.561:118): apparmor="STATUS" operation="profile_replace\
" info="same as current profile, skipping" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=29330 comm="apparmor_parser"
kernel: [ 6507.054342] audit: type=1400 audit(1596722896.557:115): apparmor="STATUS" operation="profile_replace\
" info="same as current profile, skipping" profile="unconfined" name="opencpu-exec" pid=29328 comm="apparmor_parser"
kernel: [ 6507.054347] audit: type=1400 audit(1596722896.561:116): apparmor="STATUS" operation="profile_replace\
" info="same as current profile, skipping" profile="unconfined" name="opencpu-main" pid=29334 comm="apparmor_parser"
kernel: [ 6507.055016] audit: type=1400 audit(1596722896.561:121): apparmor="STATUS" operation="profile_replace\
" info="same as current profile, skipping" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=29333 comm="apparmor_parser" 

@mbacou
Copy link
Author

mbacou commented Aug 6, 2020

Our models do spawn additional child processes with mclapply(), wonder if I should turn that off while debugging?

@mbacou
Copy link
Author

mbacou commented Aug 6, 2020

Only logs I see are in /var/log/apache2/access.log, e.g.

[06/Aug/2020:15:08:04 +0000] "POST /ocpu/library/wcapi/R/monitor/json HTTP/1.1" 400 5022 "-" "curl/7.68.0"                   
[06/Aug/2020:15:11:36 +0000] "POST /ocpu/library/wcapi/R/monitor/json HTTP/1.1" 201 1855033 "-" "curl/7.68.0"

@jeroen
Copy link
Member

jeroen commented Aug 6, 2020

Yeah maybe try without mclapply. Also in your /etc/opencpu/server.conf try raising rlimit.nproc

@mbacou
Copy link
Author

mbacou commented Aug 6, 2020

No change when turning off mclapply. Tried raising as and nproc limits as below, but no difference either.
But we have a consistent child process has died error when requesting over ~1100 file reads in 1 model run (regardless of which files). Same request in an interactive R session takes under 1 sec.

    "rlimit.as": 14e9,                                                                                               
    "rlimit.fsize": 1e9,                                                                                            
    "rlimit.nproc": 1000,                                                                                           
    "timelimit.get": 60,                                                                                            
    "timelimit.post": 180,                                                                                          
    "timelimit.webhook": 900,                                                                                       
    "preload": ["data.table","raster"]  

@jeroen
Copy link
Member

jeroen commented Aug 6, 2020

Hmm so it only happens if you open lots of files? And there is nothing in /var/log/apache2/error.log or /var/log/opencpu/error.log ?

@mbacou
Copy link
Author

mbacou commented Aug 6, 2020

Yes that's correct )-:

Only entries in /var/log/apache2/error.log:

Using locale: C.UTF-8                                                                                  
AppArmor available! Running OpenCPU with security profile and rlimits.                                 
Loading config from /usr/lib/opencpu/library/opencpu/config/defaults.conf                              
Loading config from /etc/opencpu/server.conf                                                           
OpenCPU cloud server ready.  

and my /var/log/opencpu/error.log is empty.

All I can say is we have updated package versions between our 16.04 and 20.04 configs (data.table 1.12.9 vs. 1.13.1, raster 3.3.7 vs. raster 3.3.17). These are the main deps.

I have whitelisted all required local system paths in /etc/apparmor.d/opencpu.d/custom.

Been randomly accessing between 500-1100 files, and so far get consistent failure over ~1000. This is on a 8 cores / 32GB VM.

@mbacou
Copy link
Author

mbacou commented Aug 6, 2020

In addition to rlmit.* is there a limit on the amount/size of /tmp files OpenCPU processes can generate?

@jeroen
Copy link
Member

jeroen commented Aug 6, 2020

In addition to rlmit.* is there a limit on the amount/size of /tmp files OpenCPU processes can generate?

Not that I am aware of, but ubuntu switched to systemd in 18.04 so perhaps there are other changes somewhere...

@mbacou
Copy link
Author

mbacou commented Aug 7, 2020

Got a few extra lines in /var/log/kern.log after switching mclapply(mc.silent=FALSE, ...). might be informative (?)

kernel: [44069.123439] audit: type=1400 audit(1596760456.037:190): apparmor="DENIED" 
operation="open" profile="opencpu-exec" name="/sys/fs/cgroup/memory/memory.limit_in_bytes" 
pid=42507 comm="apache2" requested_mask="r" denied_mask="r" fsuid=33 ouid=0  

@jeroen
Copy link
Member

jeroen commented Aug 7, 2020

Hmmm interesting. Try allowing that read in the apparmor profile?

/sys/fs/cgroup/** r,

@mbacou
Copy link
Author

mbacou commented Aug 7, 2020

no go. Trying to downgrade R package versions to our prior config, just in case... Also there's a fair amount of randomness in the error (yesterday could pull ~1000 files, today seems to fail at ~800). Puzzling. I'm excluding hardware issues since everything runs as usual outside of OpenCPU.

@mbacou
Copy link
Author

mbacou commented Aug 7, 2020

OK finally fixed that by removing all calls to mclapply() ! Not sure what's causing an issue with ocpu, but that's progress.
Adding MC_CORES=1 to /etc/opencpu/Renviron is taking care of the problem for now.

@jeroen
Copy link
Member

jeroen commented Aug 7, 2020

That's very strange. You're not using rJava, are you?

@mbacou
Copy link
Author

mbacou commented Aug 7, 2020

Not using rJava. Here is our local packages' deps:

import(data.table)
import(fitdistrplus)
import(httr)
import(parallel)
importFrom(MASS,fitdistr)
importFrom(RPostgres,Postgres)
importFrom(RPostgres,dbConnect)
importFrom(cli,cat_rule)
importFrom(fitdistrplus,fitdist)
importFrom(ggplot2,element_rect)
importFrom(ggplot2,element_text)
importFrom(ggplot2,margin)
importFrom(ggplot2,theme)
importFrom(ggplot2,theme_minimal)
importFrom(gtools,even)
importFrom(gtools,odd)
importFrom(lubridate,day)
importFrom(lubridate,floor_date)
importFrom(lubridate,leap_year)
importFrom(lubridate,years)
importFrom(raster,"crs<-")
importFrom(raster,crs)
importFrom(raster,extract)
importFrom(raster,raster)
importFrom(raster,rasterOptions)
importFrom(raster,removeTmpFiles)
importFrom(raster,stack)
importFrom(raster,writeRaster)
importFrom(sp,coordinates)
importFrom(tools,file_ext)
importFrom(tools,file_path_sans_ext)
importFrom(utils,untar)
importFrom(utils,unzip)

I have mclapply only kick in over 365*3 file reads, so the error pattern at around ~1000 files sort of makes sense.

Here is a snippet of that call (calling raster::stack() and raster::extract()):

  mc.cores = if(mc.cores > 1L && length(y) > 3) mc.cores else 1L

  if(verbose) message("Extracting ", nrow(catalog), " ", toupper(code), " rasters over ",
    mc.cores, " cores. Please stand by...\n")

  tmp = mclapply(y, mc.cores=mc.cores, mc.silent=FALSE, mc.allow.recursive=FALSE, function(x) {
    r = stack(as.list(catalog[year(date)==x, file]), quick=TRUE, native=native)
    names(r) = catalog[year(date)==x, layer]
    res = extract(r, data.grp)
    return(res)
  })

@jeroen
Copy link
Member

jeroen commented Aug 8, 2020

Can you try raising rlimit.nproc even higher? Perhaps there is a bug in R that it isn't properly closing the parallel procs.

@mbacou
Copy link
Author

mbacou commented Aug 8, 2020

Jeroen, yes indeed, had to raise "rlimit.nproc": 5000 but that's working now without MC_CORES=1.
Can't tell if it's a problem with raster::extract or an R bug (?)

Hum correction, it worked once, then failed consistently. I rebooted the VM, worked again one time and failed thereafter. So I'm reverting to MC_CORES=1!

@jeroen
Copy link
Member

jeroen commented Aug 8, 2020

Are you using mccollect() to clean up the finished parallel jobs?

@mbacou
Copy link
Author

mbacou commented Aug 8, 2020

Don't think I can use mcollect with mclapply, but I added mc.cleanup=TRUE:

  mc.cores = if(mc.cores > 1 && length(y) > 3) mc.cores else 1L

  tmp = mclapply(y,
    mc.cores=mc.cores, mc.silent=FALSE, mc.allow.recursive=FALSE, mc.cleanup=TRUE,
    function(x) {
      #rasterOptions(timer=FALSE, chunksize=2e+08, datatype=switch(code, rfe="INT1U", "FLT4S"))
      r = stack(as.list(catalog[year(date)==x, file]), quick=TRUE, native=native)
      names(r) = catalog[year(date)==x, layer]
      res = extract(r, data.grp)
      return(res)
    })

I'm getting really inconsistent behavior now, even with MC_CORES=1 requests will fail.

Also how can I check ocpu's environment variables with curl, just to be sure?

@jeroen
Copy link
Member

jeroen commented Aug 8, 2020

Also how can I check ocpu's environment variables with curl, just to be sure?

curl https://cloud.opencpu.org/ocpu/library/base/R/Sys.getenv/print -X POST

@mbacou
Copy link
Author

mbacou commented Aug 8, 2020

OK thx, so yes I can confirm, with the code above and MC_CORES=1 requests work consistently. Switching to e.g. MC_CORES=3 all fail.

@mbacou
Copy link
Author

mbacou commented Aug 8, 2020

There might be recent changes in raster::extract() that makes it spawn parallel processes now by default even with point queries. Will need to check with @rhijmans

@jeroen
Copy link
Member

jeroen commented Aug 8, 2020

OK, if you have time (not urgent) it would be helpful if you can create a small example for me to reproduce this. Maybe it has nothing to do with raster, and just a problem with files not getting closed in concurrent processes?

@mbacou
Copy link
Author

mbacou commented Aug 8, 2020

Yeh thought about that as well, slow I/O with attached SSD drives since we moved from Azure to AWS. Will work on a repex all right. Thx!

But wouldn't explain why that fails through OpenCPU only.

@mbacou mbacou changed the title Ubuntu 20.04 "child process has died" Ubuntu 20.04 mclapply "child process has died" Aug 8, 2020
@mbacou
Copy link
Author

mbacou commented Aug 8, 2020

Jeroen is a repex https://gist.github.com/mbacou/41c2fa42fa36cef2a19d9291432b8560

That reproduced the error on my Ubuntu 20.04 VM.

Use step1(dir="some location accessible to ocpu") to download & deflate a bunch of raster files (only need to call that once)
Use step2(dir="some location accessible to ocpu", mc.cores=3) to reproduce the error in OpenCPU

Let me know. Thx!

@baohongz
Copy link

baohongz commented Sep 7, 2020

I run into the same error message by doing http://104.131.40.122/ocpu/library/V8/R/ . Then fixed by turning off SELinux, https://linuxize.com/post/how-to-disable-selinux-on-centos-8/

OpenCPU error HTTP 400
child process has died

In call:
tryCatch({
    if (length(priority)) 
        setpriority(priority)
    if (length(rlimits)) 
        set_rlimits(rlimits)
    if (length(gid)) 
        setgid(gid)
    if (length(uid)) 
        setuid(uid)
    if (length(profile)) 
        aa_change_profile(profile)
    if (length(device)) 
        options(device = device)
    graphics.off()
    options(menu.graphics = FALSE)
    serialize(withVisible(eval(orig_expr, parent.frame())), NULL)
}, error = function(e) {
    old_class <- attr(e, "class")
    structure(e, class = c(old_class, "eval_fork_error"))
}, finally = substitute(graphics.off()))
CentOS Linux release 8.2.2004 (Core)
R version 4.0.2 (2020-06-22)
opencpu_2.2.0
V8_3.2.0 

@mbacou
Copy link
Author

mbacou commented Nov 9, 2020

FYI on our Ubuntu 20.04 system, SELinux is already disabled. Also tried to disable AppArmor with OCPU_DISABLE_APPARMOR=1 in /etc/opencpu/Renviron, but that didn't help with the child process has died bug, so still using MC_CORES=1 as a workaround for now.

@hai-wang
Copy link

hai-wang commented Jun 9, 2021

Is there any update on this? We are in the middle of migrating from Ubuntu 16.04 / OpenCPU 2.0.8.1 to Unbuntu 20.04 / OpenCPU 2.2.2. During our testing with the new version, we ran into the same problem frequently with the same code. I tried all the suggestions mentioned in this thread including setting MC_CORES=1, the problem keeps happening. But setting MC_CORES=1 defeats the purpose to use mclapply to begin with.

@jeroen
Copy link
Member

jeroen commented Jun 10, 2021

I'm sorry I have never been able to discover the problem. It sounds like an issue with the parallel package in R.

Could you try adding this line to your /etc/opencpu/Rprofile:

assign("setup_strategy", "sequential", parallel:::defaultClusterOptions)

And see if that makes any difference?

@hai-wang
Copy link

I'm sorry I have never been able to discover the problem. It sounds like an issue with the parallel package in R.

Could you try adding this line to your /etc/opencpu/Rprofile:

assign("setup_strategy", "sequential", parallel:::defaultClusterOptions)

And see if that makes any difference?

Unfortunately, it didn't make any difference. I checked these log files: /var/log/kern.log /var/log/apache2/error.log /var/log/opencpu/error.log /var/log/opencpu/apache_error.log and there is nothing logged.

I don't know this means anything. If I retry the failed call many times (as many as 15 times), the call would eventually succeed.

Questions:

  1. What "child process" is it referring to? Is this an Apache process? Or is this some other process spun off by R when using mclapply? Could this be a thread instead of a process?
  2. Is there any other logs that may have any clue of why the child process died?
  3. Is there a way to increase the amount of logging to understand more why the child process died?

Let me know anything I can help to find out more information to get to the bottom of this.

@karawoo
Copy link

karawoo commented Jun 29, 2021

Chiming in here as we've run into this issue too. With 1 core everything is fine, but when we parallelize across multiple we see child process has died intermittently. For us, repeating the failed call usually works right away, so the issue isn't very reproducible. I think usually not more than 10 requests out of every 1000 fail. We're running in Docker as well, in case that's relevant.

Ideally we'd prefer not to limit MC_CORES to 1 because we would like to keep the performance benefits of parallelizing. @jeroen should it theoretically be possible to have R code using mclapply() over multiple cores, or is that going to conflict with the way opencpu evaluates requests in forked processes?

@jeroen
Copy link
Member

jeroen commented Jun 29, 2021

It really shouldn't conflict, but I have not been able to figure out yet where this bug comes from. I suspect it is related to the changes that were made in R 4.0 wrt how R manages the worker subprocesses.

There were several bugs in that which would appear under only certain circumstances, some of them were only fixed very recently for R 4.2. I'll try to find time to get this sorted out soon.

@karawoo
Copy link

karawoo commented Jun 29, 2021

Thanks @jeroen, I can give it a try on devel and report back on if that changes things at all

@karawoo
Copy link

karawoo commented Jul 1, 2021

Tried using rocker/r-ver:devel as our base image (R 4.1.0) and I still see the error there.

@karawoo
Copy link

karawoo commented Jul 12, 2021

Another update: I tried switching to future/future.apply (using a multicore plan) and have not seen any child process has died errors in my testing so far. nope, it happens with future too.

@jeroen
Copy link
Member

jeroen commented Aug 4, 2021

I still haven't been able to figure out the issue, but I have disabled a feature that tries to kill orphaned processes. Perhaps it is related.

I'll push out opencpu 2.2.4.1 with this change, can you try if this changes anything at all?

@pilare
Copy link

pilare commented Aug 8, 2021

@jeroen thank you for that release. I was struggling with this since version 2.1.5, and was able to run it only on our linux boxes, Azure VM's and AKS containers were always crashing on one of our NGS pipeline scripts. So I was thinking it is some problem with containerization engine (as I could see difference in ocpu/info) or R in version 4 was to blame, but this release solved my issues. I was observing that thread for some time and during that time I was trying to get to the bottom of this which exact operation has the "precess has died" problem, to submit a similar issue with full report. I have one more request, please build and push to docker hub an opencpu/rstudio image based on that 2.2.4.1 version. We are using rstudio images in our development machines, for the ease of debugging and development of our packages.

@jeroen
Copy link
Member

jeroen commented Aug 8, 2021

@pilare ok i'll tag a docker release opencpu/rstudio:2.2.4.1. I'll wait for others to see if this really fixes the problem.

@karawoo
Copy link

karawoo commented Aug 11, 2021

So far in my testing I have not seen any child process has died errors with the new version. We haven't deployed this to production yet, so it's possible things would be different in the wild, but the update looks very promising.

I don't know if this is useful information, but I've noticed that one of the calls that sometimes produced child process has died now sometimes fails with timeout reached (30.000000 sec). It's a relatively long call and I mainly see the timeout error when I'm testing several API calls at once. We can increase the timeout limit and/or try to speed up the code, so that's not an issue, but maybe this gives some hint to the underlying cause?

Edited to note that I've tested this with our codebase now using future/future.apply, not mclapply. Before the new opencpu version we were still getting child process has died errors with future, but less frequently than with mclapply.

@jeroen
Copy link
Member

jeroen commented Aug 18, 2021

I've pushed a new release opencpu 2.2.5 (which includes this fix) to cran and all docker images. Also overhauled the docker build infrastructure so that it will be easier again to publish new releases if needed.

Hopefully the error will disappear for everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants