Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36026: [C++][ORC] Catch all ORC exceptions to avoid crash #40697

Merged
merged 6 commits into from
Mar 26, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 25 additions & 7 deletions cpp/src/arrow/adapters/orc/adapter.cc
Original file line number Diff line number Diff line change
Expand Up @@ -18,17 +18,14 @@
#include "arrow/adapters/orc/adapter.h"

#include <algorithm>
#include <cstdint>
#include <functional>
#include <filesystem>
#include <list>
#include <memory>
#include <sstream>
#include <string>
#include <utility>
#include <vector>

#include "arrow/adapters/orc/util.h"
#include "arrow/buffer.h"
#include "arrow/builder.h"
#include "arrow/io/interfaces.h"
#include "arrow/memory_pool.h"
Expand All @@ -37,14 +34,11 @@
#include "arrow/table.h"
#include "arrow/table_builder.h"
#include "arrow/type.h"
#include "arrow/type_traits.h"
#include "arrow/util/bit_util.h"
#include "arrow/util/checked_cast.h"
#include "arrow/util/decimal.h"
#include "arrow/util/key_value_metadata.h"
#include "arrow/util/macros.h"
#include "arrow/util/range.h"
#include "arrow/util/visibility.h"
#include "orc/Exceptions.hh"

// alias to not interfere with nested orc namespace
Expand Down Expand Up @@ -80,6 +74,12 @@ namespace liborc = orc;
} \
catch (const liborc::NotImplementedYet& e) { \
return Status::NotImplemented(e.what()); \
} \
catch (const std::exception& e) { \
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
return Status::Invalid(e.what()); \
} \
catch (...) { \
return Status::UnknownError("ORC error"); \
Comment on lines +81 to +82
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? (Is catch (const std::exception& e) enough?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually it is enough. But just in case?

}

#define ORC_CATCH_NOT_OK(_s) \
Expand Down Expand Up @@ -183,6 +183,22 @@ liborc::RowReaderOptions default_row_reader_options() {
return options;
}

// Proactively check the availability of timezone database.
// Remove it once https://issues.apache.org/jira/browse/ORC-1661 has been fixed.
Status check_timezone_database_availability() {
Copy link
Member

@pitrou pitrou Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this useful if we're correctly catching exceptions? I'm afraid this could get out of sync with ORC's own timezone loading code.

I'd rather not do this, and let ORC fix the issue by making the timezone file optional (what is it used for exactly?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac what would be the error message that you get when not doing the above now that you are catching the exceptions?

Because one reason to keep this, is to provide an informative error message (most Windows users will see this the first time, so I think it is important to give some guidance in the error message, and not just a "file not found").
(to avoid getting out of sync with ORC, we could add an ORC version check and only do this for ORC<2?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After catching all exceptions, the error message here without the check_timezone_database_availability() is Invalid: Can't open /usr/share/zoneinfo/XXX.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(to avoid getting out of sync with ORC, we could add an ORC version check and only do this for ORC<2?)

That sounds reasonable to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not do this, and let ORC fix the issue by making the timezone file optional (what is it used for exactly?).

ORC has two timestamp types: timestamp (namely timestamp_without_timezone) and timestamp_instant (namely timestamp_with_local_timezone).

  • timestamp: writer keeps the writer timezone in the stripe footer, and reader uses reader timezone to recover the same wall clock time by converting from writer timezone to reader timezone. that's why the reader try to call getLocalTimezone() on startup.
  • timestamp_with_local_timezone: use UTC everywhere so it does not have any conversion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • timestamp: writer keeps the writer timezone in the stripe footer, and reader uses reader timezone to recover the same wall clock time by converting from writer timezone to reader timezone. that's why the reader try to call getLocalTimezone() on startup.

If we convert, we should convert to UTC instead of converting to the local timezone. That's how Arrow timestamps work (but we still need a timezone DB for the conversion anyway):

arrow/format/Schema.fbs

Lines 283 to 288 in 5181791

/// Timestamps with a non-empty timezone
/// ------------------------------------
///
/// If a Timestamp column has a non-empty timezone value, its epoch is
/// 1970-01-01 00:00:00 (January 1st 1970, midnight) in the *UTC* timezone
/// (the Unix epoch), regardless of the Timestamp's own timezone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we convert, we should convert to UTC instead of converting to the local timezone.

I know that's a better idea. However, this was the design choice of ORC timestamp type which originates from Hive. So it would be better to use timestamp_instant type in favor of the old timestamp type in ORC at any time.

wgtmac marked this conversation as resolved.
Show resolved Hide resolved
auto tz_dir = std::getenv("TZDIR");
bool is_tzdb_avaiable = tz_dir != nullptr
? std::filesystem::exists(tz_dir)
: std::filesystem::exists("/usr/share/zoneinfo");
if (!is_tzdb_avaiable) {
return Status::Invalid(
"IANA timezone database is unavailable but required by ORC."
" Please install it to /usr/share/zoneinfo or set TZDIR env to the installed"
" directory");
}
return Status::OK();
}

} // namespace

class ORCFileReader::Impl {
Expand Down Expand Up @@ -541,6 +557,7 @@ ORCFileReader::~ORCFileReader() {}

Result<std::unique_ptr<ORCFileReader>> ORCFileReader::Open(
const std::shared_ptr<io::RandomAccessFile>& file, MemoryPool* pool) {
RETURN_NOT_OK(check_timezone_database_availability());
auto result = std::unique_ptr<ORCFileReader>(new ORCFileReader());
RETURN_NOT_OK(result->impl_->Open(file, pool));
return std::move(result);
Expand Down Expand Up @@ -807,6 +824,7 @@ ORCFileWriter::ORCFileWriter() { impl_.reset(new ORCFileWriter::Impl()); }

Result<std::unique_ptr<ORCFileWriter>> ORCFileWriter::Open(
io::OutputStream* output_stream, const WriteOptions& writer_options) {
RETURN_NOT_OK(check_timezone_database_availability());
std::unique_ptr<ORCFileWriter> result =
std::unique_ptr<ORCFileWriter>(new ORCFileWriter());
Status status = result->impl_->Open(output_stream, writer_options);
Expand Down
26 changes: 26 additions & 0 deletions cpp/src/arrow/adapters/orc/adapter_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,10 @@
#include "arrow/status.h"
#include "arrow/table.h"
#include "arrow/testing/gtest_util.h"
#include "arrow/testing/matchers.h"
#include "arrow/testing/random.h"
#include "arrow/type.h"
#include "arrow/util/io_util.h"
#include "arrow/util/key_value_metadata.h"

namespace liborc = orc;
Expand Down Expand Up @@ -636,6 +638,30 @@ TEST(TestAdapterReadWrite, FieldAttributesRoundTrip) {
AssertSchemaEqual(schema, read_schema, /*check_metadata=*/true);
}

TEST(TestAdapterReadWrite, ThrowWhenTZDBUnavaiable) {
// Backup the original TZDIR env and set a wrong value by purpose to trigger the check.
const char* tzdir_env_key = "TZDIR";
const char* expect_str = "IANA timezone database is unavailable but required by ORC";
auto tzdir_env_backup = std::getenv(tzdir_env_key);
ARROW_EXPECT_OK(arrow::internal::SetEnvVar(tzdir_env_key, "/a/b/c/d/e"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use EnvVarGuard so that the value doesn't leak if this test fails.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice that. Thanks!


EXPECT_OK_AND_ASSIGN(auto out_stream, io::BufferOutputStream::Create(1024));
EXPECT_THAT(
adapters::orc::ORCFileWriter::Open(out_stream.get(), adapters::orc::WriteOptions()),
Raises(StatusCode::Invalid, testing::HasSubstr(expect_str)));

EXPECT_OK_AND_ASSIGN(auto buffer, out_stream->Finish());
EXPECT_THAT(adapters::orc::ORCFileReader::Open(
std::make_shared<io::BufferReader>(buffer), default_memory_pool()),
Raises(StatusCode::Invalid, testing::HasSubstr(expect_str)));

// Restore TZDIR env.
ARROW_EXPECT_OK(arrow::internal::DelEnvVar(tzdir_env_key));
if (tzdir_env_backup) {
ARROW_EXPECT_OK(arrow::internal::SetEnvVar(tzdir_env_key, tzdir_env_backup));
}
}

// Trivial

class TestORCWriterTrivialNoWrite : public ::testing::Test {};
Expand Down
Loading