Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on modifying by reference with data.table::set() in the context of future.apply::future_apply() or furrr::future_map() #5376

Open
ramiromagno opened this issue May 4, 2022 · 8 comments

Comments

@ramiromagno
Copy link

Hi,

First of all, let me thank you for the development amazing {data.table} package.

My case is that I have a list of data tables that I am trying to modify by reference with data.table::set() inside a loop using future.apply::future_apply() and furrr::future_walk()/furrr::future_map().

However I am getting an error when using future.apply::future_apply() or furrr::future_walk()/furrr::future_map(). It works fine with lapply() although.

I am not sure the problem is with the {data.table} package itself... I will post this same issue in {furrr} and {future.apply} Issues, and link it here.

The error is:

Error in data.table::set(snp_pairs, i = i, j = col, value = df[[col]]) : 
  This data.table has either been loaded from disk (e.g. using readRDS()/load()) or constructed manually (e.g. using structure()). Please run setDT() or setalloccol() on it first (to pre-allocate space for new columns) before assigning by reference to it.

You will need to install {daeqtlr} first:

remotes::install_github("maialab/daeqtlr")
library(future.apply)
library(furrr)
# For now install from https://github.com/maialab/daeqtlr
library(daeqtlr)

plan(multisession)

snp_pairs <- read_snp_pairs(file = daeqtlr_example("snp_pairs.csv"))
zygosity <- read_snp_zygosity(file = daeqtlr_example("zygosity.csv"))
ae <- read_ae_ratios(file = daeqtlr_example("ae.csv"))

no_cores <- 6L
indices <- seq_len(nrow(snp_pairs))
partitioning_factor <- sort((indices)%%no_cores) + 1
snp_pairs_lst1 <- split(snp_pairs, partitioning_factor)
snp_pairs_lst2 <- split(snp_pairs, partitioning_factor)
snp_pairs_lst3 <- split(snp_pairs, partitioning_factor)

for( i in seq_along(snp_pairs_lst1)) {
  data.table::setkeyv(snp_pairs_lst1[[i]], 'dae_snp')
  data.table::setkeyv(snp_pairs_lst2[[i]], 'dae_snp')
  data.table::setkeyv(snp_pairs_lst3[[i]], 'dae_snp')
}

# Runs fine without errors.
lapply(snp_pairs_lst1,
              FUN = daeqtl_mapping,
              zygosity = zygosity,
              ae = ae)

# Fails with error:
# 
# Error in data.table::set(snp_pairs, i = i, j = col, value =
# df[[col]]) : This data.table has either been loaded from disk (e.g. using
# readRDS()/load()) or constructed manually (e.g. using structure()). Please run
# setDT() or setalloccol() on it first (to pre-allocate space for new columns)
# before assigning by reference to it.
future_lapply(snp_pairs_lst2,
              FUN = daeqtl_mapping,
              zygosity = zygosity,
              ae = ae)

# Fails with the same error as `future_lapply`
# It won't work with `future_map` either.
future_walk(snp_pairs_lst3,
              .f = daeqtl_mapping,
              zygosity = zygosity,
              ae = ae)


@ramiromagno
Copy link
Author

ramiromagno commented May 4, 2022

After fiddling around, it seems that including

  n <- nrow(snp_pairs)
  # `setalloccol` is needed because of `future.apply::future_lapply()`,
  # otherwise https://github.com/Rdatatable/data.table/issues/5376.
  data.table::setalloccol(snp_pairs, extra_cols*n)

inside the source code of the mapped function, i.e. daeqtl_mapping() makes the future_lapply() call work, i.e. run without errors. However, it does not change the data table snp_pairs_lst2 in-place as lapply() does with snp_pairs_lst1.

@ben-schwen
Copy link
Member

ben-schwen commented May 4, 2022

I have no idea how the internals of future.apply work but for parallel computing, you basically have to copy the objects you want to modify to your worker nodes.
This would at least explain the

This data.table has either been loaded from disk (e.g. using # readRDS()/load()) or constructed manually (e.g. using structure()).

Depending on how the serialization of future.apply works there might be a way to provide custom serialization/deserialization although I'm not sure if that's really something future.apply wants to achieve.

That the in-place change does not work after fixing the setalloccol problem is also clear, since you are modifying the data.table on your worker nodes and have to write them back at some point.

@ben-schwen
Copy link
Member

Also related to #5269 which caters for the call to setalloccol.

@ramiromagno
Copy link
Author

Without a call to setalloccol() I realize now that truelength(x) returns 0 inside the mapped function. Introducing a call to setalloccol() therein adds the right extra number of columns needed for set() to work without problems.

@jangorecki
Copy link
Member

If future.apply requires copy of data in your session then modification in-place will naturally not be possible. Unless you can pass a reference to an object I don't think there is a workaround for it. See related issues #3104 and #1336.

@HenrikBengtsson
Copy link

Author of futureverse here: FWIW, any type of parallel backends can be used in the future, e.g. forked parallelization via the mclapply() framework, background R processes via PSOCK cluster, background R process via the callr package, etc. So, it's parallelization business as usual. This also means that one cannot make assumptions of running with shared memory or what type of serialization is used.

It sounds like the problem here is related to the general problem of serializing a data.table object and re-using it in another R process (concurrently or later in time).

@iagogv3
Copy link
Contributor

iagogv3 commented Dec 5, 2022

May this issue be related to the fact that updating data.table by reference using := inside a foreach loop does not seem to work?

@HenrikBengtsson
Copy link

Yes, same problem if you run foreach in parallel. You can update a data.table in a parallel worker, but you cannot expect the update to be updated in the main R session.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants