-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to get "source" / and error stack from an ArrowError to help debugging #2725
Comments
Thank you for starting this conversation. Further to the great points you raise, I think there are a couple of things worth highlighting about the approach that I think work relatively well and I think we should try to keep:
I think there are also some points worth highlighting, that you allude to, but that are weaknesses of the current approach
|
I agree the mapping logic is cumbersome and unnecessary in most cases (when it mostly needed to return the correct error type). I would say the errors needed are different, though the general purpose handling of them I suppose could be said to be the same |
I would like to propose switching to using anyhow for a couple of reasons, but appreciate other opinions may differ. Ecosystem Adoption There is a wide ecosystem already using anyhow, it sits at 53M downloads currently. For reference tokio sits at 66M. This means:
Simple and Ergonomic It is easy to use, unobtrusive, and allows focusing on the actual logic
We can easily return different types of errors, without needing to define any custom structs, enumerations, etc... It just works and gets out of the way. Additional Context Can easily add additional context to an existing error, without this resulting in a breaking API change or needing to change focus from the function to some error declaration at the top of, or in some other file entirely.
This will then get printed as
Backtraces Enabling the It is worth noting capturing backtraces is not cheap, however, in arrow and DataFusion errors are not used for control flow, but for exceptions, where the cost of a backtrace will be negligible compared to the cost of the failing network call, columnar kernel, etc...
Downcasting A given
It is hard to think of cases where it would make sense to perform this form of error handling within the context of a query engine, but perhaps we could envisage some system that can respond to
Here we can see that downcasting works, even when there are intervening operators adding additional context to aid in debugging |
I think the example of reacting to OOM in #2725 (comment) looks very nice @tustvold I read up on anyhow and I like what I saw. The docs on how display worked make a lot of sense to me https://docs.rs/anyhow/1.0.65/anyhow/struct.Error.html 👍 Also, the number of other crates which use it looks like it is widely adopted across the ecosystem (6172 when I looked): https://crates.io/crates/anyhow/reverse_dependencies. I think the next step might be to make a small POC of what "switching to using anyhow" might look like and we could make a POC in datafusion to see what types of downstream changes would be needed potentially (I am happy to try the datafusion PR) |
It is also appears that the backtrace capture is controlled by environment variable so we can have even finer grained control |
I think this is probably best done after #2594, not only will this avoid huge merge conflicts, but it would also provide a nice way to incrementally migrate one crate at a time, starting with the leaf crates. This will also give some time to build consensus around this as a path forward |
Notes on |
Unless anybody objects I plan to start this process in the coming days, starting with the parquet crate as it is fairly self-contained, and there have been further community requests for this functionality in #3285. I will send a note out to the mailing list shortly |
I think making a proposal in code is probably a good next step so we can see exactly what it would look like |
I think #3567 fixes the first part of this issue (getting the source) |
(note I am filing this in
arrow-rs
rather thandatafusion
as the same applies to lower level arrow errors and we would love to follow the same model in both projects and because it just came up in the context of refactoring the arrow crate: #2711 (comment))I also harbor perhaps unrealistic dreams, we can do something in arrow that is reasonable and show it works in a real set of projects, and then write about it / blog about it, use that as a bully pulpit to move some of the rust error projects along more speedily
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When developing IOx (which uses DataFusion, which uses Arrow) when an error is encountered, I get an error like this:
This is challenging for several reasons:
util.rs
in this case) is where the error was detected not the source of the error.grep
ping the source code of arrow and datatafusion) I don't know what the call stack was when that code was invoked. In this case I don't know what type of plan was being converted when the error happened.What I typically do (please don't laugh) to debug such errors is:
panic!
RUST_BACKTRACE=1
set so I can get a backtraceWhile this works it is both annoying and I suspect more than most users will be willing to bear as they don't already have local checkouts of arrow and datafusion.
What I really want is I want is errors in Arrow (and then also Datafusion) to provide a trace like what python provides. Stylistically:
Where the
Anonymous
is meant to signify a location where?
was used to propagate the error.I can write flesh out these ideas more if anyone is interested.
Items I do not care about
Note I don't want to get into a discussion about providing runtime backtrace support. I would be very happy to only have function names (ideally with line numbers) from arrow / datafusion and any other projects that add the support explicitly. I would also be fine with using a proper backtrace in the implementation but I don't want this ticket to get bogged down like other RFCs seem to have)
I also would like to avoid requiring every error site to have a different error enum to get this feature (though whatever we do here shouldn't prevent adding new error variants for error structured reporting)
Describe the solution you'd like
What I would like is some way to annotate Arrow errors with the source location it came from as well as any causing error and a way to walk the chain. Perhaps we can start with some macros -- here is a crazy idea to start thinking about
Which on error would result in an error like
We could potentially use a similar macro for annotating the result of
?
(somehow)Describe alternatives you've considered
We could wait for any of the various Rust RFCs in error handling to stabilize such as https://rust-lang.github.io/rfcs/0201-error-chaining.html or https://rust-lang.github.io/rfcs/2504-fix-error.html.
However, given how long they have been outstanding I am not going to hold my breath.
Additional context
@yahoNanJing and @mingmwang are discussing similar things on apache/datafusion#3410 (comment), I believe
The most recent time I hit this was in https://github.com/influxdata/influxdb_iox/pull/5606
@tustvold @andygrove and I discussed error handling in Arrow as well in #2711 (comment)
The text was updated successfully, but these errors were encountered: