-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17859: [C++] Use self-pipe in signal-receiving StopSource #14250
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
@paleolimbot Could you try R cancellation with this PR? |
Cancellation works exactly the same on this branch as it does on the master branch! It's probably not related to this PR, but it looks like attempting to call arrow from a forked process in R doesn't work. I think this is the same on the master branch (will check tomorrow). library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
tf <- tempfile()
readr::write_csv(vctrs::vec_rep(mtcars, 5e5), tf)
# try to slow down CSV reading
set_cpu_count(1)
set_io_thread_count(2)
# see if arrow actually works in a forked R process (it doesn't)
parallel::mclapply(1:4, function(...) read_csv_arrow(tf))
#> Warning in parallel::mclapply(1:4, function(...) read_csv_arrow(tf)): all
#> scheduled cores encountered errors in user code
#> [[1]]
#> [1] "Error in as.data.frame(tab) : object 'tab' not found\n"
#> attr(,"class")
#> [1] "try-error"
#> attr(,"condition")
#> <simpleError in as.data.frame(tab): object 'tab' not found>
#>
#> [[2]]
#> [1] "Error in as.data.frame(tab) : object 'tab' not found\n"
#> attr(,"class")
#> [1] "try-error"
#> attr(,"condition")
#> <simpleError in as.data.frame(tab): object 'tab' not found>
#>
#> [[3]]
#> [1] "Error in as.data.frame(tab) : object 'tab' not found\n"
#> attr(,"class")
#> [1] "try-error"
#> attr(,"condition")
#> <simpleError in as.data.frame(tab): object 'tab' not found>
#>
#> [[4]]
#> [1] "Error in as.data.frame(tab) : object 'tab' not found\n"
#> attr(,"class")
#> [1] "try-error"
#> attr(,"condition")
#> <simpleError in as.data.frame(tab): object 'tab' not found> Created on 2022-09-27 by the reprex package (v2.0.1) |
"object 'tab' not found" is purely an R error? It looks weird. |
980da8c
to
c5478a5
Compare
Ok, the R error was a bug in our error "helper" that was obscuring an actual real related error. This PR doesn't affect cancellation in the original process, but it does result in errors in a forked process (probably something I have to fix on that PR?): library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
tf <- tempfile()
readr::write_csv(vctrs::vec_rep(mtcars, 5e5), tf)
# try to slow down CSV reading
set_cpu_count(1)
set_io_thread_count(2)
# see if arrow actually works in a forked R process (it doesn't)
parallel::mclapply(1:2, function(...) read_csv_arrow(tf))
#> Warning in parallel::mclapply(1:2, function(...) read_csv_arrow(tf)): all
#> scheduled cores encountered errors in user code
#> [[1]]
#> [1] "Error in (function (file, delim = \",\", quote = \"\\\"\", escape_double = TRUE, : \n Invalid: Signal stop source was not set up\nCaused by error:\n! Invalid: Signal stop source was not set up\n"
#> attr(,"class")
#> [1] "try-error"
#> attr(,"condition")
#> <error/rlang_error>
#> Error:
#> ! Invalid: Signal stop source was not set up
#> Caused by error:
#> ! Invalid: Signal stop source was not set up
#> ---
#> Backtrace:
#> 1. parallel::mclapply(1:2, function(...) read_csv_arrow(tf))
#> 2. base::lapply(seq_len(cores), inner.do)
#> 3. parallel (local) FUN(X[[i]], ...)
#> 10. base::lapply(X = S, FUN = FUN, ...)
#> 11. global FUN(X[[i]], ...)
#> 16. arrow (local) `<fn>`(file = tf, delim = ",")
#>
#> [[2]]
#> [1] "Error in (function (file, delim = \",\", quote = \"\\\"\", escape_double = TRUE, : \n Invalid: Signal stop source was not set up\nCaused by error:\n! Invalid: Signal stop source was not set up\n"
#> attr(,"class")
#> [1] "try-error"
#> attr(,"condition")
#> <error/rlang_error>
#> Error:
#> ! Invalid: Signal stop source was not set up
#> Caused by error:
#> ! Invalid: Signal stop source was not set up
#> ---
#> Backtrace:
#> 1. parallel::mclapply(1:2, function(...) read_csv_arrow(tf))
#> 2. base::lapply(seq_len(cores), inner.do)
#> 3. parallel (local) FUN(X[[i]], ...)
#> 10. base::lapply(X = S, FUN = FUN, ...)
#> 11. global FUN(X[[i]], ...)
#> 16. arrow (local) `<fn>`(file = tf, delim = ",") Created on 2022-09-28 by the reprex package (v2.0.1) |
Thanks for checking @paleolimbot . That's setting that might be fixed on the C++ side. I'll think about it. |
I updated #13635 so that it just warns here instead of erroring. The gist of that is that the If the stop token didn't live with the filesystem it probably wouldn't be an issue because the process in general would never be forked before a cancellable operation was invoked (although I'm sure somebody will try to use |
I'm getting other crashes on this PR and so it may not be related, but in case it is, there's a crash using Flight from Python from R next to a bunch of messages about the StopSource already being set up. https://github.com/apache/arrow/actions/runs/3190877134/jobs/5206531644
|
Well, in any case I'll need to rebase and also complete the TODOs above. |
|
c5478a5
to
9e2bf2d
Compare
This comment was marked as outdated.
This comment was marked as outdated.
@paleolimbot Would you like to try out this PR again? |
9e2bf2d
to
7913919
Compare
This comment was marked as outdated.
This comment was marked as outdated.
I'll investigate more when I'm back from PTO on Monday, but I did a quick check and (1) cancellation still works interactively and (2) this does seem to break forked process behaviour when using R + arrow: library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
tf <- tempfile()
readr::write_csv(vctrs::vec_rep(mtcars, 5e5), tf)
# try to slow down CSV reading
set_cpu_count(1)
set_io_thread_count(2)
# make sure we can cancel
read_csv_arrow(tf)
#> # A tibble: 16,000,000 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # … with 15,999,990 more rows
# see if arrow works in a forked R process
parallel::mclapply(1:2, function(...) read_csv_arrow(tf))
#> Warning in parallel::mclapply(1:2, function(...) read_csv_arrow(tf)): scheduled
#> cores 1, 2 did not deliver results, all values of the jobs will be affected
#> [[1]]
#> NULL
#>
#> [[2]]
#> NULL Created on 2022-10-11 by the reprex package (v2.0.1) |
Ok, I checked this again, and the errors that I get are:
In the code in the R package, I'm careful not error (but instead warn) on the result of I would still prefer to not mess with signal handling in the R package since I have a limited ability to debug this myself. I'm aware that I have a poor understanding of the constraints here, but the R package could probably provide a thread-safe callable like |
Le 31/10/2022 à 14:48, Dewey Dunnington a écrit :
Ok, I checked this again, and the errors that I get are:
|/Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/status.cc:134:
Invalid: Signal stop source was not set up
/Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/status.cc:134:
Invalid: Signal stop source was not set up |
In the code in the R package, I'm careful not error (but instead warn)
on the result of |RegisterCancellingSignalHandler()|, and so I wonder if
this is coming from C++ instead?
I don't know, can you try to turn this error into an abort and get a
debugger backtrace?
Also, can you describe the flow of C++ API calls to manage the signal
stop source that R is doing?
|
The sequence is (for better or worse):
|
I can look into how to do this tomorrow; off the top of my head I don't have something set up to do this. Usually I get a full backtrace with my errors because I have the extra error context turned on, and I don't know why it's not appearing here. |
Oh, and the error is and abort, it just happens in a worker process so R reports it as a warning. |
It also should be said that |
Can you describe where the fork() call happens in that sequence? |
In the example I've been using the sequence is (I think)
I did another check, and it looks like the thing that "breaks" the mclapply behaviour is doing anything that registers signal handlers on the main thread before launching the workers: library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
tf <- tempfile()
write.csv(mtcars, tf, row.names = FALSE)
# If we use the signal handling infrastructure in the main process before
# doing the forked workers thing, then the calls will fail
# read_csv_arrow(tf)
parallel::mclapply(1:2, function(...) {
read_csv_arrow(tf)
})
#> [[1]]
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # … with 22 more rows
#>
#> [[2]]
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # … with 22 more rows Created on 2022-11-01 with reprex v2.0.2 |
I'll update this to use the at-fork facility introduced in #14594 |
7913919
to
9ef40c2
Compare
Yes, those are in the child process. |
Hmm, ok. Since I'm a R newbie, can you tell me how to run the reproducer code exactly? :-) |
What I did was:
library(arrow, warn.conflicts = FALSE)
tf <- tempfile()
write.csv(mtcars, tf, row.names = FALSE)
read_csv_arrow(tf)
parallel::mclapply(1:2, function(...) {
read_csv_arrow(tf)
}) Here's maybe some more raw output from lldb:
(edit: I fixed the bactraces to have both threads) |
Okay, I'm certainly going to use one of the Docker configurations. Would you like to explain how I can modify that script to just build R Arrow but not run the whole gamut of checks/tests/etc.? Edit: |
@paleolimbot I see that the R Arrow bindings are compiled with |
By the way, I actually reproduce the issue on git master, so the issue is not specifically with this PR, right? |
Last finding: things work fine with Edit: it simply seems to be because |
When we started this conversation I had tried both and the hanging was specific to this PR, but very possible that some other code change affected that (I tested with Arrow 10.0.0 via I am going to revisit the section of code that's hanging on the R side (specifically, I'm going to try avoiding the IO thread pool to launch the "do arrow stuff" thread) soon whilst trying to fix #14582 in a more permanent way and perhaps that may be a good time to revisit.
Yes...I will add it to the Dockerfile later today. It's along the lines of |
Ok, I think I've found the issue. It is a problem with initialization order of static/global variables in Arrow C++: #14704 |
653a584
to
17b576d
Compare
I've rebased this now that #14704 is merged. @paleolimbot Can you check the cancellation behavior in R again? |
@github-actions crossbow submit -g cpp -g python -g wheel |
Revision: 17b576d Submitted crossbow builds: ursacomputing/crossbow @ actions-9e8fdbbef9 |
I checked the R behaviour again and using this build of Arrow I can (1) cancel CSV reading and ExecPlans and (2) use Arrow from a forked process (i.e., |
Benchmark runs are scheduled for baseline = ada7e23 and contender = 94e45fa. 94e45fa is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
No description provided.