ARROW-14653: [R] head() hangs on CSV datasets > 600MB

This PR switches to using the asynchronous scanner by default when reading in datasets. I've tested it locally on a large dataset (2.5Gb of CSV files) and it does resolve the original issue, but due to the size of the files involved I wasn't sure this was something I could easily write tests for. Closes #11992 from thisisnic/ARROW-14653_head_hangs Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
apache · Jan 3, 2022 · 762fad5 · 762fad5
1 parent cb1897e
commit 762fad5
Show file tree

Hide file tree

Showing 3 changed files with 4 additions and 3 deletions.
diff --git a/r/NEWS.md b/r/NEWS.md
@@ -27,6 +27,7 @@
 * Added `decimal128()` (~~identical to `decimal()`~~) as the name is more explicit and updated docs to encourage its use. 
 * Source builds now by default use `pkg-config` to search for system dependencies (such as `libz`) and link to them 
 if present. To retain the previous behaviour of downloading and building all dependencies, set `ARROW_DEPENDENCY_SOURCE=BUNDLED`. 
+* Opening datasets now use async scanner by default which resolves a deadlock issues related to reading in large multi-CSV datasets
 
 # arrow 6.0.1
 

diff --git a/r/R/dataset-scan.R b/r/R/dataset-scan.R
@@ -34,7 +34,7 @@
 #'    to keep all rows.
 #' * `use_threads`: logical: should scanning use multithreading? Default `TRUE`
 #' * `use_async`: logical: should the async scanner (performs better on
-#'    high-latency/highly parallel filesystems like S3) be used? Default `FALSE`
+#'    high-latency/highly parallel filesystems like S3) be used? Default `TRUE`
 #' * `...`: Additional arguments, currently ignored
 #' @section Methods:
 #' `ScannerBuilder` has the following methods:
@@ -73,7 +73,7 @@ Scanner$create <- function(dataset,
                            projection = NULL,
                            filter = TRUE,
                            use_threads = option_use_threads(),
-                           use_async = getOption("arrow.use_async", FALSE),
+                           use_async = getOption("arrow.use_async", TRUE),
                            batch_size = NULL,
                            fragment_scan_options = NULL,
                            ...) {

diff --git a/r/man/Scanner.Rd b/r/man/Scanner.Rd