From 762fad5e5d1499b20db81a75cbc448c1ef6fca03 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Mon, 3 Jan 2022 18:34:37 +0000 Subject: [PATCH] ARROW-14653: [R] head() hangs on CSV datasets > 600MB This PR switches to using the asynchronous scanner by default when reading in datasets. I've tested it locally on a large dataset (2.5Gb of CSV files) and it does resolve the original issue, but due to the size of the files involved I wasn't sure this was something I could easily write tests for. Closes #11992 from thisisnic/ARROW-14653_head_hangs Authored-by: Nic Crane Signed-off-by: Nic Crane --- r/NEWS.md | 1 + r/R/dataset-scan.R | 4 ++-- r/man/Scanner.Rd | 2 +- 3 files changed, 4 insertions(+), 3 deletions(-) diff --git a/r/NEWS.md b/r/NEWS.md index 90a9bba79d0b9..89e990ca2e246 100644 --- a/r/NEWS.md +++ b/r/NEWS.md @@ -27,6 +27,7 @@ * Added `decimal128()` (~~identical to `decimal()`~~) as the name is more explicit and updated docs to encourage its use. * Source builds now by default use `pkg-config` to search for system dependencies (such as `libz`) and link to them if present. To retain the previous behaviour of downloading and building all dependencies, set `ARROW_DEPENDENCY_SOURCE=BUNDLED`. +* Opening datasets now use async scanner by default which resolves a deadlock issues related to reading in large multi-CSV datasets # arrow 6.0.1 diff --git a/r/R/dataset-scan.R b/r/R/dataset-scan.R index 03c926fb43793..b7f58bfa4bd0b 100644 --- a/r/R/dataset-scan.R +++ b/r/R/dataset-scan.R @@ -34,7 +34,7 @@ #' to keep all rows. #' * `use_threads`: logical: should scanning use multithreading? Default `TRUE` #' * `use_async`: logical: should the async scanner (performs better on -#' high-latency/highly parallel filesystems like S3) be used? Default `FALSE` +#' high-latency/highly parallel filesystems like S3) be used? Default `TRUE` #' * `...`: Additional arguments, currently ignored #' @section Methods: #' `ScannerBuilder` has the following methods: @@ -73,7 +73,7 @@ Scanner$create <- function(dataset, projection = NULL, filter = TRUE, use_threads = option_use_threads(), - use_async = getOption("arrow.use_async", FALSE), + use_async = getOption("arrow.use_async", TRUE), batch_size = NULL, fragment_scan_options = NULL, ...) { diff --git a/r/man/Scanner.Rd b/r/man/Scanner.Rd index db6488f50127e..45184e321136a 100644 --- a/r/man/Scanner.Rd +++ b/r/man/Scanner.Rd @@ -22,7 +22,7 @@ named list of expressions to keep all rows. \item \code{use_threads}: logical: should scanning use multithreading? Default \code{TRUE} \item \code{use_async}: logical: should the async scanner (performs better on -high-latency/highly parallel filesystems like S3) be used? Default \code{FALSE} +high-latency/highly parallel filesystems like S3) be used? Default \code{TRUE} \item \code{...}: Additional arguments, currently ignored } }