-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] BUG: Reading ORC segfaults on windows (if TZDIR isn't set) #36026
Comments
The same segfault still appears with orc 1.9 |
@wgtmac may this related to C++ Orc release? |
I don't think so. It seems that this issue exists in the old 1.8.x releases as well. |
Definitely not related to 1.9; the segfault happened in the exact same place before for 1.8.3. I don't have data before that. |
I am not familiar with python or pyarrow. Does it mean I can simply run the failed test on Windows to reproduce this issue? @h-vetinari |
Hey @wgtmac, thanks for taking a look! It's from a pretty "standard" build, though arrow & pyarrow are quite a handful, especially considering all the dependencies. I could publish the failing builds in a way that they're (opt-in) installable using conda/mamba/miniforge, would that help? Otherwise, you could just try building arrow & pyarrow with |
Thanks for the info! Let me spend some time to build it on Windows first. |
Still segfaulting as of 13.0.0 |
It does seem that ORC is disabled on our Windows wheels: |
@kou @jorisvandenbossche do you know if we are testing ORC enabled on Windows for Pyarrow somewhere else? |
Do we have any other Windows testing at all? I suppose not testing ORC for pyarrow on appveyor is just an oversight, since it needs to be enabled manually? (and we forgot to do that) So we could start with enabling it there, and see if we can reproduce those errors? |
Oh, let's enable it. |
I've enabled it for Appveyor and I was not able to reproduce the issue, see the successful build here: https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/47879511 |
I haven't been able to reproduce this issue neither on the Windows wheels nor the Appveyor build. |
Sorry, it seems I was missing the following to test orc on the Wheels job: ce4fe1c stacktrace
|
Segfault on windows remains as of v14 |
Still segfaulting as of v15 |
Still segfaulting also with orc 2.0.0 (arrow 15.0.1) |
I tried to reproduce it on my Windows PC following the guide https://arrow.apache.org/docs/developers/python.html#building-on-windows Following commands were executed successfully and arrow cpp library was installed.
However, when I tried to build pyarrow as below,
it complained for missing LZ4:
I can confirm that the file Do you have any idea? @h-vetinari @raulcd |
Currently it is hardcoded (if the env variable is not set) to "/usr/share/zoneinfo" (https://github.com/apache/orc/blob/0b4631bc343ed559e284ea6f824f31890f798e31/c%2B%2B/src/Timezone.cc#L33-L34) I am not super familiar about whether there are standard ways to get a "user" or "app" prefix (we use Another option would also be to allow to specify a path at run time (we do that in Arrow for the tz database that is used by the kernels with |
Thanks for the suggestion!
This sounds like a good alternative to me. |
…test (#40609) ### Rationale for this change The pyarrow orc reader always crashes when it tries to create an internal orc reader. This is caused by failing to read tz database on the local host. This also disables windows wheel build when ARROW_ORC is turned on. ### What changes are included in this PR? Download IANA timezone database on the test host and explicitly setting TZDIR to make the CIs happy. ### Are these changes tested? Make sure all python wheel windows CIs pass. ### Are there any user-facing changes? No. * GitHub Issue: #36026 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Issue resolved by pull request 40609 |
As I mentioned above (#36026 (comment)), the PR didn't really fix this issue, it just hides it in our own tests, but users still get a segfault. For 16.0 (and until ORC fixes this), we could also add a check on our side? Eg, if on windows, check that the env variable is set / the path is available, and if not raise an informative error to the user instead of segfaulting. |
Right, can the crash be averted? If a C++ exception is raised, then it can be caught and turned into a proper error. |
If we can upgrade to ORC 2.0.0, then it will throw if the file path does not exist: https://github.com/apache/orc/blob/main/c%2B%2B/src/Timezone.cc#L678-L682 However, it seems that the current version (1.9.2) simply crashes when open is called: https://github.com/apache/orc/blob/0b4631bc343ed559e284ea6f824f31890f798e31/c%2B%2B/src/OrcFile.cc#L58-L68 |
Well, it throws a C++ exception ( |
I need to debug it on Windows. If open() returns -1 and throws the exception, then we should already catch it and see the error message. |
### What changes were proposed in this pull request? Enable TestTimezone.testMissingTZDB unit test to run on Windows. ### Why are the changes needed? When /usr/share/zoneinfo is unavailable and TZDIR env is unset, creating C++ ORC reader will crash on Windows. We need to better deal with this case. See context from the Apache Arrow community: apache/arrow#36026 and apache/arrow#40633 ### How was this patch tested? Make sure the test passes on Windows. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #1856 from wgtmac/win_tz_test. Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? Enable TestTimezone.testMissingTZDB unit test to run on Windows. ### Why are the changes needed? When /usr/share/zoneinfo is unavailable and TZDIR env is unset, creating C++ ORC reader will crash on Windows. We need to better deal with this case. See context from the Apache Arrow community: apache/arrow#36026 and apache/arrow#40633 ### How was this patch tested? Make sure the test passes on Windows. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #1856 from wgtmac/win_tz_test. Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### Rationale for this change When /usr/share/zoneinfo is unavailable and TZDIR env is unset, creating C++ ORC reader will crash on Windows. We need to eagerly check this and prevent followup crash. ### What changes are included in this PR? Eagerly check TZDB availability before creating ORC reader/writer. ### Are these changes tested? Yes, added a test case to make sure the check work as expected. ### Are there any user-facing changes? Users on Windows (or other cases when TZDB is not availble) will clearly see this error message instead of crash. * GitHub Issue: #36026 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Issue resolved by pull request 40697 |
I tried to do this, but with recent changes that guard the check on |
### What changes were proposed in this pull request? Find tzdb without having to set `TZDIR` when in a conda-environment (where `tzdata` [has](https://conda-metadata-app.streamlit.app/?q=conda-forge%2Fnoarch%2Ftzdata-2024a-h0c530f3_0.conda) a uniform location of `$CONDA_PREFIX/share/zoneinfo` across all platforms). ### Why are the changes needed? This is due to issues in arrow (see apache/arrow#36026) that cannot really be fixed there, as it assumes that orc >=2.0 knows how to find the tzdb. Having to set `TZDIR` in all user environments is an intrusive change that should be avoided, and since the cost here is checking a single environment variable, it's hopefully not too onerous for consideration. ### How was this patch tested? CI here ### Was this patch authored or co-authored using generative AI tooling? No CC wgtmac See also: #1587 Closes #1882 from h-vetinari/tzdb. Authored-by: H. Vetinari <h.vetinari@gmx.com> Signed-off-by: Gang Wu <ustcwg@gmail.com>
### What changes were proposed in this pull request? Find tzdb without having to set `TZDIR` when in a conda-environment (where `tzdata` [has](https://conda-metadata-app.streamlit.app/?q=conda-forge%2Fnoarch%2Ftzdata-2024a-h0c530f3_0.conda) a uniform location of `$CONDA_PREFIX/share/zoneinfo` across all platforms). ### Why are the changes needed? This is due to issues in arrow (see apache/arrow#36026) that cannot really be fixed there, as it assumes that orc >=2.0 knows how to find the tzdb. Having to set `TZDIR` in all user environments is an intrusive change that should be avoided, and since the cost here is checking a single environment variable, it's hopefully not too onerous for consideration. ### How was this patch tested? CI here ### Was this patch authored or co-authored using generative AI tooling? No CC wgtmac See also: #1587 Closes #1882 from h-vetinari/tzdb. Authored-by: H. Vetinari <h.vetinari@gmx.com> Signed-off-by: Gang Wu <ustcwg@gmail.com> (cherry picked from commit e89ca33) Signed-off-by: Gang Wu <ustcwg@gmail.com>
JFYI: apache/orc#1882 was merged (thanks @wgtmac!), so we'll be able to fix this in conda-forge without injecting |
Describe the bug, including details regarding any error messages, version, and platform.
orc on windows doesn't exist for a long time yet in conda-forge, and we've only recently enabled it for the C++ portion of arrow. I tried to switch it on for pyarrow now as well in conda-forge/arrow-cpp-feedstock#1086, and the test suite segfaults as soon as it gets to
test_dataset.py::test_orc_format
stacktrace
Component(s)
Format, Packaging, Python
The text was updated successfully, but these errors were encountered: