From 762fad5e5d1499b20db81a75cbc448c1ef6fca03 Mon Sep 17 00:00:00 2001
From: Nic Crane <thisisnic@gmail.com>
Date: Mon, 3 Jan 2022 18:34:37 +0000
Subject: [PATCH] ARROW-14653: [R] head() hangs on CSV datasets > 600MB

This PR switches to using the asynchronous scanner by default when reading in datasets.  I've tested it locally on a large dataset (2.5Gb of CSV files) and it does resolve the original issue, but due to the size of the files involved I wasn't sure this was something I could easily write tests for.

Closes #11992 from thisisnic/ARROW-14653_head_hangs

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
---
 r/NEWS.md          | 1 +
 r/R/dataset-scan.R | 4 ++--
 r/man/Scanner.Rd   | 2 +-
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/r/NEWS.md b/r/NEWS.md
index 90a9bba79d0b9..89e990ca2e246 100644
--- a/r/NEWS.md
+++ b/r/NEWS.md
@@ -27,6 +27,7 @@
 * Added `decimal128()` (~~identical to `decimal()`~~) as the name is more explicit and updated docs to encourage its use. 
 * Source builds now by default use `pkg-config` to search for system dependencies (such as `libz`) and link to them 
 if present. To retain the previous behaviour of downloading and building all dependencies, set `ARROW_DEPENDENCY_SOURCE=BUNDLED`. 
+* Opening datasets now use async scanner by default which resolves a deadlock issues related to reading in large multi-CSV datasets
 
 # arrow 6.0.1
 
diff --git a/r/R/dataset-scan.R b/r/R/dataset-scan.R
index 03c926fb43793..b7f58bfa4bd0b 100644
--- a/r/R/dataset-scan.R
+++ b/r/R/dataset-scan.R
@@ -34,7 +34,7 @@
 #'    to keep all rows.
 #' * `use_threads`: logical: should scanning use multithreading? Default `TRUE`
 #' * `use_async`: logical: should the async scanner (performs better on
-#'    high-latency/highly parallel filesystems like S3) be used? Default `FALSE`
+#'    high-latency/highly parallel filesystems like S3) be used? Default `TRUE`
 #' * `...`: Additional arguments, currently ignored
 #' @section Methods:
 #' `ScannerBuilder` has the following methods:
@@ -73,7 +73,7 @@ Scanner$create <- function(dataset,
                            projection = NULL,
                            filter = TRUE,
                            use_threads = option_use_threads(),
-                           use_async = getOption("arrow.use_async", FALSE),
+                           use_async = getOption("arrow.use_async", TRUE),
                            batch_size = NULL,
                            fragment_scan_options = NULL,
                            ...) {
diff --git a/r/man/Scanner.Rd b/r/man/Scanner.Rd
index db6488f50127e..45184e321136a 100644
--- a/r/man/Scanner.Rd
+++ b/r/man/Scanner.Rd
@@ -22,7 +22,7 @@ named list of expressions
 to keep all rows.
 \item \code{use_threads}: logical: should scanning use multithreading? Default \code{TRUE}
 \item \code{use_async}: logical: should the async scanner (performs better on
-high-latency/highly parallel filesystems like S3) be used? Default \code{FALSE}
+high-latency/highly parallel filesystems like S3) be used? Default \code{TRUE}
 \item \code{...}: Additional arguments, currently ignored
 }
 }