API: change compressed output format for get_val to reduce serialisation cost #327

PietrH · 2024-10-16T09:13:46Z

Currently the get_val() helper supports fetching from OpenCPU as JSON or RDS.

In #323 , Stijn found that we are crashing the session by running out of memory, possibly on a serialisation process. I believe base::writeRDS() -> base::serialize() might be the cause of this memory usage. Assuming the crash happens on the writing of the object as RDS to the output stream. I've not been able to replicate this issue locally or on the RStudio Server.

There are a few open issues on opencpu for child processes that died:

I did a quick local test on a deployments table to see if outputting as feather or parquet might help:

Unit: milliseconds
           expr       min        lq      mean    median
 rw_feather(df)  13.28443  14.40181  15.38457  15.07805
 rw_parquet(df)  38.05100  39.58587  41.90330  40.76735
     rw_rds(df) 263.91068 265.43494 269.96770 266.53872
        uq       max neval
  16.14836  22.61397   100
  42.62479  60.33460   100
 270.30619 296.12905   100

It looks like both are faster on my system, I have not benchmarked memory usage yet.

This is using lz4 compression for feather, snappy for parquet and gzip for rds.

Stijn proposes using an alternative fetch method; returning the session id and writing out paged result objects to the session dir, then having the client fetch these objects and serializing on the client. This ties in to an existing paging branch, but Stijn mentioned this will probably require some optimisation on the database so we have a nice column to sort on.

to benchmark:

# compare memory usage and speed of different ways of storing/fetching detections


# read a detections table -------------------------------------------------

# stored result object
df <- readRDS("~/Downloads/albertkanaal.rds")
### or you could create the object via a query: ###
# df <-
#   get_acoustic_detections(animal_project_code = "2013_albertkanaal",
#                           api = FALSE) # because it doesn't work via the API, that's what we are trying to fix

# subset
df_sample <- dplyr::slice_sample(df, prop = 0.1)

# functions for feather and rds extract and load --------------------------


rw_feather <- function(df){
  feather_path <- tempfile()
  arrow::write_feather(df, feather_path, compression = "lz4")
  arrow::read_feather(feather_path)
}

rw_rds <- function(df){
  rds_path <- tempfile()
  saveRDS(df, rds_path, compress = FALSE)
  readRDS(rds_path)
}

# benchmark ---------------------------------------------------------------

bench_result <-
  bench::mark(
    rw_feather(df_sample),
    rw_rds(df_sample),
    memory = TRUE,
    filter_gc = FALSE,
    iterations = 3
  )

Blockers / Action Points

Install arrow on Lifewatch RStudio
- Update R>4.0
- Include C++17 compiler on Lifewatch RStudio -> gcc>7 currently 5.4.0
Implement query paging on etnservice

Optional:

Implement chunked writing to file on etnservice
Switch to using File API instead of Object API for fetching file to client

The text was updated successfully, but these errors were encountered:

PietrH · 2024-10-16T09:57:58Z

OpenCPU supports outputting as feather and parquet, and you can pass arguments to their respective writing functions via the url:

https://github.com/opencpu/opencpu/blob/80ea353c14c8601f51ed519744149411d9cc3309/NEWS#L20-L23

PietrH · 2024-10-16T09:59:36Z

The albertkanaal request takes just under 4GB to run on it's own:

expression         min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result   memory    
  <bch:expr>       <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>   <list>    
1 "albertkanaal <… 53.3s  53.3s    0.0188    3.62GB    0.131     1     7      53.3s <tibble> <Rprofmem>

This is without writing it to rds.

PietrH · 2024-10-16T11:24:52Z

Feather uses less memory, and is faster, both for reading and writing. But it, in OpenCPU, we can only use it for tabular data (because it passes via arrow::as_arrow_table())

However, get_acoustic_detections() fails on the POST request, not the GET request. So while this does save memory and speeds things up, it doesn't solve our problem.

Writing to local files on the OpenCPU server, fetching the temp key, somehow listing these files, and then fetching them via GET requests over the OpenCPU files API (not documented), example request: curl https://cloud.opencpu.org/ocpu/tmp/x05b85461/files/ch01.pdf

Realistically, once we have a URL, I could read it directly with arrow::read_feather(), however I'm still not sure where OpenCPU is failing exactly, maybe before it can even store the object in memory?

PietrH · 2024-10-17T08:07:13Z

Benchmarking feather retreival vs rds:

feather is slightly faster and uses way less memory on the client, but also on the service. It does introduce an extra dependency.

It's probably a good idea to repeat the test with a truly big dataset.

2014_demer

feather

$ expression <bch:expr> <get_acoustic_detections(animal_…
$ min        <bch:tm> 8.89s
$ median     <bch:tm> 9.3s
$ `itr/sec`  <dbl> 0.1075611
$ mem_alloc  <bch:byt> 31.3MB
$ `gc/sec`   <dbl> 0.1398294
$ n_itr      <int> 10
$ n_gc       <dbl> 13
$ total_time <bch:tm> 1.55m
$ result     <list> [<tbl_df[236920 x 20]>]
$ memory     <list> [<Rprofmem[123 x 3]>]
$ time       <list> <8.89s, 9.02s, 9.67s, 9.34s, 9.22s…
$ gc         <list> [<tbl_df[10 x 3]>]

rds

$ expression <bch:expr> <get_acoustic_detections(animal_…
$ min        <bch:tm> 9.33s
$ median     <bch:tm> 10.5s
$ `itr/sec`  <dbl> 0.09537067
$ mem_alloc  <bch:byt> 138MB
$ `gc/sec`   <dbl> 0.1525931
$ n_itr      <int> 10
$ n_gc       <dbl> 16
$ total_time <bch:tm> 1.75m
$ result     <list> [<tbl_df[236920 x 20]>]
$ memory     <list> [<Rprofmem[1834 x 3]>]
$ time       <list> <10.27s, 11.19s, 10.92s, 10.52s, 1…
$ gc         <list> [<tbl_df[10 x 3]>]

PietrH · 2024-10-17T09:03:38Z

Benchmarking feather retreival vs rds:

feather is slightly faster and uses way less memory on the client, but also on the service. It does introduce an extra dependency.

It's probably a good idea to repeat the test with a truly

2014_demer

feather

$ expression <bch:expr> <get_acoustic_detections(animal_…
$ min        <bch:tm> 8.89s
$ median     <bch:tm> 9.3s
$ `itr/sec`  <dbl> 0.1075611
$ mem_alloc  <bch:byt> 31.3MB
$ `gc/sec`   <dbl> 0.1398294
$ n_itr      <int> 10
$ n_gc       <dbl> 13
$ total_time <bch:tm> 1.55m
$ result     <list> [<tbl_df[236920 x 20]>]
$ memory     <list> [<Rprofmem[123 x 3]>]
$ time       <list> <8.89s, 9.02s, 9.67s, 9.34s, 9.22s…
$ gc         <list> [<tbl_df[10 x 3]>]

rds

$ expression <bch:expr> <get_acoustic_detections(animal_…
$ min        <bch:tm> 9.33s
$ median     <bch:tm> 10.5s
$ `itr/sec`  <dbl> 0.09537067
$ mem_alloc  <bch:byt> 138MB
$ `gc/sec`   <dbl> 0.1525931
$ n_itr      <int> 10
$ n_gc       <dbl> 16
$ total_time <bch:tm> 1.75m
$ result     <list> [<tbl_df[236920 x 20]>]
$ memory     <list> [<Rprofmem[1834 x 3]>]
$ time       <list> <10.27s, 11.19s, 10.92s, 10.52s, 1…
$ gc         <list> [<tbl_df[10 x 3]>]

2014_demer, 2010_phd_reubens_sync, 2012_leopoldkanaal

feather

$ expression <bch:expr> <get_acoustic_detections(animal…
$ min        <bch:tm> 1.18m
$ median     <bch:tm> 1.2m
$ `itr/sec`  <dbl> 0.01372749
$ mem_alloc  <bch:byt> 485MB
$ `gc/sec`   <dbl> 0.04118246
$ n_itr      <int> 10
$ n_gc       <dbl> 30
$ total_time <bch:tm> 12.1m
$ result     <list> [<tbl_df[3500649 x 20]>]
$ memory     <list> [<Rprofmem[2601 x 3]>]
$ time       <list> <1.27m, 1.25m, 1.24m, 1.22m, 1.21…
$ gc         <list> [<tbl_df[10 x 3]>]

PietrH · 2024-10-17T09:10:44Z

Had a child process die while testing the rds version of the query above:

get_val_rds <- function(temp_key, api_domain = "https://opencpu.lifewatch.be") {
  # request data and open connection
  response_connection <- httr::RETRY(
    verb = "GET",
    url = glue::glue(
      "{api_domain}",
      "tmp/{temp_key}/R/.val/rds",
      .sep = "/"
    ),
    times = 5
  ) %>%
    httr::content(as = "raw") %>%
    rawConnection()
  # read connection
  api_response <- response_connection %>%
    gzcon() %>%
    readRDS()
  # close connection
  close(response_connection)
  # Return OpenCPU return object
  return(api_response)
}

error:

Error: child process has died

In call:
tryCatch({
    if (length(priority)) 
        setpriority(priority)
    if (length(rlimits)) 
        set_rlimits(rlimits)
    if (length(gid)) 
        setgid(gid)
    if (length(uid)) 
        setuid(uid)
    if (length(profile)) 
        aa_change_profile(profile)
    if (length(device)) 
        options(device = device)
    graphics.off()
    options(menu.graphics = FALSE)
    serialize(withVisible(eval(orig_expr, parent.frame())), NULL)
}, error = function(e) {
    old_class <- attr(e, "class")
    structure(e, class = c(old_class, "eval_fork_error"))
}, finally = substitute(graphics.off()))

I was able to get it to run after restarting, but there does seem to be instability. I can't exclude that this instability exists in the feather version, it might be just a coincidence it happened during rds testing.

PietrH · 2024-10-17T10:55:47Z

Tried Google protoBuff as implemented in {protolite} : 45117ab

As explained in the arrow FAQ: https://arrow.apache.org/faq/#how-does-arrow-relate-to-protobuf, protobuff is less suitable for large file transfers. I also noticed it filling my swap file, and crashing on large dataset transfers.

I feel this avenue is not worth it. I'll stick to Apache Arrow

PietrH added the API label Oct 16, 2024

PietrH self-assigned this Oct 16, 2024

PietrH added this to the v2.3.1 milestone Oct 16, 2024

PietrH linked a pull request Oct 17, 2024 that will close this issue

Use Apache Feather for API object transfer instead of RDS #328

Open

PietrH mentioned this issue Oct 17, 2024

Apache Arrow on Lifewatch RStudio Server: upgrade R version #329

Open

PietrH mentioned this issue Oct 31, 2024

v2.3 beta release: access ETN data from your local computer! #318

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: change compressed output format for get_val to reduce serialisation cost #327

API: change compressed output format for get_val to reduce serialisation cost #327

PietrH commented Oct 16, 2024 •

edited

Loading

PietrH commented Oct 16, 2024

PietrH commented Oct 16, 2024

PietrH commented Oct 16, 2024

PietrH commented Oct 17, 2024

PietrH commented Oct 17, 2024 •

edited

Loading

PietrH commented Oct 17, 2024 •

edited

Loading

PietrH commented Oct 17, 2024

API: change compressed output format for get_val to reduce serialisation cost #327

API: change compressed output format for get_val to reduce serialisation cost #327

Comments

PietrH commented Oct 16, 2024 • edited Loading

Blockers / Action Points

Optional:

PietrH commented Oct 16, 2024

PietrH commented Oct 16, 2024

PietrH commented Oct 16, 2024

PietrH commented Oct 17, 2024

2014_demer

feather

rds

PietrH commented Oct 17, 2024 • edited Loading

2014_demer

feather

rds

2014_demer, 2010_phd_reubens_sync, 2012_leopoldkanaal

feather

PietrH commented Oct 17, 2024 • edited Loading

PietrH commented Oct 17, 2024

PietrH commented Oct 16, 2024 •

edited

Loading

PietrH commented Oct 17, 2024 •

edited

Loading

PietrH commented Oct 17, 2024 •

edited

Loading