-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36026: [C++][ORC] Catch all ORC exceptions to avoid crash #40697
Conversation
I have filed https://issues.apache.org/jira/browse/ORC-1661 to follow up this issue. For now I'd propose to eagerly check TZDB availability. Please let me know what do you think. @jorisvandenbossche @pitrou @kou |
I think it would be more useful to find why the exception isn't properly caught. |
The answer might be that |
Agreed. To my surprise that Windows behaves differently when opening a non-existent file: apache/orc#1856. I will debug it on my Windows PC after work. |
If my reading of the code in that PR is correct, it is testing that it raises a arrow/cpp/src/arrow/adapters/orc/adapter.cc Lines 73 to 83 in a36cc62
|
TimezoneError is added in ORC 2.0.0, but we are still on 1.9.2. @jorisvandenbossche |
I was looking at ORC 1.9.3 as well, which already has TimezoneError? https://github.com/apache/orc/blob/a765c316adfb80b47fb8c38831f8c0e0d7294cdb/c%2B%2B/src/Timezone.cc#L684-L685 |
@jorisvandenbossche Sorry for the confusion. I thought you were talking about this change: apache/orc#1587 which is ORC 2.0.0 only. I tried to use ORC 1.9.2 (w/o the aforementioned path existence check) to open a reader on Windows w/o TZDIR set. It actually throws a |
But that ParseError is still caught and re-raised as a TimezoneError in the snippet I linked? (https://github.com/apache/orc/blob/a765c316adfb80b47fb8c38831f8c0e0d7294cdb/c%2B%2B/src/Timezone.cc#L684-L685) (and sorry for causing the confusion! because I was initially also talking about your PR for 2.0, but then afterwards also went checking the code of 1.9) |
Yes, this is what I don't understand either. I am trying to reproduce this on the arrow side. |
Instead of spending time reproducing it, perhaps ensure that all C++ exceptions are caught? This would be more robust anyway, since Orc C++ uses exceptions. |
Well, it would perfectly explain why it is crashing. And the bug report you linked to (apache/orc#1577 (comment)) actually points to the line (704) that raises the TimezoneError. So that seems to all fit correctly together? |
@jorisvandenbossche You're right. This issue does not relate to Windows and I can also reproduce this on my Mac with a wrong TZDIR env set. The ORC C++ library throws |
catch (...) { \ | ||
return Status::UnknownError("ORC error"); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? (Is catch (const std::exception& e)
enough?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually it is enough. But just in case?
@@ -183,6 +183,22 @@ liborc::RowReaderOptions default_row_reader_options() { | |||
return options; | |||
} | |||
|
|||
// Proactively check the availability of timezone database. | |||
// Remove it once https://issues.apache.org/jira/browse/ORC-1661 has been fixed. | |||
Status check_timezone_database_availability() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this useful if we're correctly catching exceptions? I'm afraid this could get out of sync with ORC's own timezone loading code.
I'd rather not do this, and let ORC fix the issue by making the timezone file optional (what is it used for exactly?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac what would be the error message that you get when not doing the above now that you are catching the exceptions?
Because one reason to keep this, is to provide an informative error message (most Windows users will see this the first time, so I think it is important to give some guidance in the error message, and not just a "file not found").
(to avoid getting out of sync with ORC, we could add an ORC version check and only do this for ORC<2?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After catching all exceptions, the error message here without the check_timezone_database_availability()
is Invalid: Can't open /usr/share/zoneinfo/XXX
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(to avoid getting out of sync with ORC, we could add an ORC version check and only do this for ORC<2?)
That sounds reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not do this, and let ORC fix the issue by making the timezone file optional (what is it used for exactly?).
ORC has two timestamp types: timestamp (namely timestamp_without_timezone) and timestamp_instant (namely timestamp_with_local_timezone).
- timestamp: writer keeps the writer timezone in the stripe footer, and reader uses reader timezone to recover the same wall clock time by converting from writer timezone to reader timezone. that's why the reader try to call
getLocalTimezone()
on startup. - timestamp_with_local_timezone: use UTC everywhere so it does not have any conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- timestamp: writer keeps the writer timezone in the stripe footer, and reader uses reader timezone to recover the same wall clock time by converting from writer timezone to reader timezone. that's why the reader try to call
getLocalTimezone()
on startup.
If we convert, we should convert to UTC instead of converting to the local timezone. That's how Arrow timestamps work (but we still need a timezone DB for the conversion anyway):
Lines 283 to 288 in 5181791
/// Timestamps with a non-empty timezone | |
/// ------------------------------------ | |
/// | |
/// If a Timestamp column has a non-empty timezone value, its epoch is | |
/// 1970-01-01 00:00:00 (January 1st 1970, midnight) in the *UTC* timezone | |
/// (the Unix epoch), regardless of the Timestamp's own timezone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we convert, we should convert to UTC instead of converting to the local timezone.
I know that's a better idea. However, this was the design choice of ORC timestamp type which originates from Hive. So it would be better to use timestamp_instant type in favor of the old timestamp type in ORC at any time.
std::getline(ss, str, '.'); | ||
versions.emplace_back(std::stoi(str)); | ||
} | ||
if (versions.size() != 3) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if a future ORC versions advertises e.g. "2.1.0.1", this will fail? Sounds inflexible while we're only concerned about the major number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense! It may also be 2.1.0-SNAPSHOT
or whatever custom patched version. Let me change it to only care about major and minor version instead.
const char* tzdir_env_key = "TZDIR"; | ||
const char* expect_str = "IANA time zone database is unavailable but required by ORC"; | ||
auto tzdir_env_backup = std::getenv(tzdir_env_key); | ||
ARROW_EXPECT_OK(arrow::internal::SetEnvVar(tzdir_env_key, "/a/b/c/d/e")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use EnvVarGuard
so that the value doesn't leak if this test fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't notice that. Thanks!
@@ -173,7 +173,7 @@ class OrcStripeReader : public RecordBatchReader { | |||
int64_t batch_size_; | |||
}; | |||
|
|||
liborc::RowReaderOptions default_row_reader_options() { | |||
liborc::RowReaderOptions DefaultRowReaderOptions() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function was added by me in the past and now I have changed its name to follow the same style.
cpp/src/arrow/adapters/orc/adapter.h
Outdated
@@ -19,6 +19,7 @@ | |||
|
|||
#include <cstdint> | |||
#include <memory> | |||
#include <optional> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't required anymore, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, just a nit
@github-actions crossbow submit -g python |
Revision: 182e1d3 Submitted crossbow builds: ursacomputing/crossbow @ actions-4f707bcb76 |
Thanks a lot @wgtmac ! |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit dbff1f4. There were 2 benchmark results with an error:
There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
When /usr/share/zoneinfo is unavailable and TZDIR env is unset, creating C++ ORC reader will crash on Windows. We need to eagerly check this and prevent followup crash.
What changes are included in this PR?
Eagerly check TZDB availability before creating ORC reader/writer.
Are these changes tested?
Yes, added a test case to make sure the check work as expected.
Are there any user-facing changes?
Users on Windows (or other cases when TZDB is not availble) will clearly see this error message instead of crash.