-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41755: [C++][ORC] Ensure setting detected ORC version #41767
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
|
return Status::Invalid( | ||
"IANA time zone database is unavailable but required by ORC." | ||
" Please install it to /usr/share/zoneinfo or set TZDIR env to the installed" | ||
" directory"); | ||
" directory. If you are using conda, simply install the package `tzdata`."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, it would be annoying to non-conda users to see this irrelevant error message. Perhaps we can add this only when CONDA_PREFIX
is set but TZDB is not installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how that would be annoying TBH. It just adds one more way to fix the problem, and the user can choose the appropriate solution. I mean, I can do what you ask, but I feel it's implementation complexity for very little (or even negative) gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kou WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either is OK.
I don't know why this is happen with conda. Can we add tzdata
to dependencies of arrow-cpp
or orc
conda packages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why this is happen with conda.
I described it in the issue (#41755) and in #36026: it's due to apache/orc#1882, which was the only way to avoid injecting TZDIR
as an environment variable into every conda environment using arrow on windows, which would have been very intrusive.
Can we add
tzdata
to dependencies ofarrow-cpp
ororc
conda packages?
That already happened in conda-forge/pyarrow-feedstock#122 (and backports to 15.x, 14.x, 13.x). In general, someone needs to sync the feedstocks back to the repo here, but that's unrelated to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ORC 2.0.1 can find tzdb without setting
TZDIR
with conda (CONDA_PREFIX
), right?
Yes. If the environment variable CONDA_PREFIX
is set, the code will prefer looking in $CONDA_PREFIX/share/zoneinfo
, which is what's shipped through tzdata
.
libarrow
conda package uses ORC 2.0.1, right?
As of conda-forge/arrow-cpp-feedstock#1421 (and the various backports), yes.
So this message will not be shown for conda users. If this message is shown with ORC 2.0.1, ORC version detection may be failed.
Indeed something might be going wrong there, because conda-forge/pyarrow-feedstock#122 failed before I backported this PR (in conda-forge/arrow-cpp-feedstock#1424), despite building against orc 2.0.1 already
Can we add
tzdata
to dependencies oforc
only on Windows? If we can do it, it will solve this problem.
We can add tzdata as a dependency on orc
, but I want to make this unconditional. It's better to always rely on conda-forge's tzdata, rather than potentially out-of-date stuff in /usr/share/zoneinfo
(aside from not introducing unnecessary divergence between platforms).
In any case, the presence of tzdata
itself is not enough. We either need to ensure that ARROW_ORC_NEED_TIME_ZONE_DATABASE_CHECK
correctly gets set to false on orc >=2.0.1, or do something like this PR. I think this PR is better, because you cannot know at compile-time if the package will end up in a conda-environment, and IMO this check should match what orc
does, i.e. check CONDA_PREFIX
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed something might be going wrong there
Could you try this?
diff --git a/cpp/cmake_modules/FindorcAlt.cmake b/cpp/cmake_modules/FindorcAlt.cmake
index 289416678a..ce8cd11b4c 100644
--- a/cpp/cmake_modules/FindorcAlt.cmake
+++ b/cpp/cmake_modules/FindorcAlt.cmake
@@ -71,4 +71,5 @@ if(orcAlt_FOUND)
PROPERTIES IMPORTED_LOCATION "${ORC_STATIC_LIB}"
INTERFACE_INCLUDE_DIRECTORIES "${ORC_INCLUDE_DIR}")
endif()
+ set(orcAlt_VERSION ${ORC_VERSION})
endif()
We can add tzdata as a dependency on
orc
, but I want to make this unconditional. It's better to always rely on conda-forge's tzdata, rather than potentially out-of-date stuff in/usr/share/zoneinfo
(aside from not introducing unnecessary divergence between platforms).
It makes sense.
We either need to ensure that
ARROW_ORC_NEED_TIME_ZONE_DATABASE_CHECK
correctly gets set to false on orc >=2.0.1, or do something like this PR. I think this PR is better
We don't want to maintain this code as much as possible ORC 2.0.0 or later have the same check. So we want to choose the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try this?
Confirmed in conda-forge/arrow-cpp-feedstock#1432 that this works, thank you! Do you want to open a separate PR, or should I change this PR to your one-line suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should I change this PR to your one-line suggestion?
Could you do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Suggested-By: Sutou Kouhei <kou@clear-code.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit ff9921f. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 114 possible false positives for unstable benchmarks that are known to sometimes produce them. |
FindorcAlt.cmake
doesn't setorcAlt_VERSION
when it finds ORC byfind_library()
/find_path()
. IforcAlt_VERSION
isn't set, ORC version detection by caller is failed.cpp/src/arrow/adapters/orc/adapter.cc
uses detected ORC version. If detected ORC version isn't correct, needless time zone database check is used.Deployed in conda-forge through conda-forge/arrow-cpp-feedstock#1424 and confirmed as working in conda-forge/pyarrow-feedstock#122