Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address a subset of ARROW-2676 PR comments #2

Merged
merged 3 commits into from
Jun 18, 2018
Merged

Conversation

cpcloud
Copy link

@cpcloud cpcloud commented Jun 18, 2018

No description provided.

@kszucs kszucs merged commit 3d427c9 into kszucs:nightly Jun 18, 2018
kszucs pushed a commit that referenced this pull request Jan 7, 2019
I am contributing to [Arrow 3731](https://issues.apache.org/jira/browse/ARROW-3731). This PR has the minimum functionality to read parquet files into an arrow::Table, which can then be converted to a tibble. Multiple parquet files can be read inside `lapply`, and then concatenated at the end.

Steps to compile
1) Build arrow and parquet c++ projects
2) In R run `devtools::load_all()`

What I could use help with:
The biggest challenge for me is my lack of experience with pkg-config. The R library has a `configure` file which uses pkg-config to figure out what c++ libraries to link to. Currently, `configure` looks up the Arrow project and links to -larrow only. We need it to also link to -lparquet. I do not know how to modify pkg-config's metadata to let it know to link to both -larrow and -lparquet

Author: Jeffrey Wong <jeffreyw@netflix.com>
Author: Romain Francois <romain@purrple.cat>
Author: jeffwong-nflx <jeffreyw@netflix.com>

Closes apache#3230 from jeffwong-nflx/master and squashes the following commits:

c67fa3d <jeffwong-nflx> Merge pull request #3 from jeffwong-nflx/cleanup
1df3026 <Jeffrey Wong> don't hard code -larrow and -lparquet
8ccaa51 <Jeffrey Wong> cleanup
75ba5c9 <Jeffrey Wong> add contributor
56adad2 <jeffwong-nflx> Merge pull request #2 from romainfrancois/3731/parquet-2
7d6e64d <Romain Francois> read_parquet() only reading one parquet file, and gains a `as_tibble` argument
e936b44 <Romain Francois> need parquet on travis too
ff260c5 <Romain Francois> header was too commented, renamed to parquet.cpp
9e1897f <Romain Francois> styling etc ...
456c5d2 <Jeffrey Wong> read parquet files
22d89dd <Jeffrey Wong> hardcode -larrow and -lparquet
kszucs pushed a commit that referenced this pull request Jan 30, 2019
https://issues.apache.org/jira/browse/ARROW-3965

This creates an object which configures the BaseAllocator and Calendar used during to configure the translation from a JDBC ResultSet to an Arrow vector.

Author: Mike Pigott <mpigott@gmail.com>
Author: Michael Pigott <mikepigott@users.noreply.github.com>

Closes apache#3133 from mikepigott/jdbc-to-arrow-config and squashes the following commits:

be95426 <Mike Pigott> ARROW-3965: JDBC-To-Arrow Config Builder javadocs.
d6c64a7 <Mike Pigott> ARROW-3965: JdbcToArrowConfigBuilder
d7ca982 <Mike Pigott> Merge branch 'master' into jdbc-to-arrow-config
789c8c8 <Michael Pigott> Merge pull request #4 from apache/master
e5b19ee <Michael Pigott> Merge pull request #3 from apache/master
3b17c29 <Michael Pigott> Merge pull request #2 from apache/master
5b1b364 <Mike Pigott> Merge branch 'master' into jdbc-to-arrow-config
881c6c8 <Michael Pigott> Merge pull request #1 from apache/master
bb3165b <Mike Pigott> Updating the function calls to use the JdbcToArrowConfig versions.
68c91e7 <Mike Pigott> Modifying the jdbcToArrowSchema and jdbcToArrowVectors methods to receive JdbcToArrowConfig objects.
8d6cf00 <Mike Pigott> Documentation for public static VectorSchemaRoot sqlToArrow(Connection connection, String query, JdbcToArrowConfig config)
4f1260c <Mike Pigott> Adding documentation for public static VectorSchemaRoot sqlToArrow(ResultSet resultSet, JdbcToArrowConfig config)
df632e3 <Mike Pigott> Updating the SQL tests to include JdbcToArrowConfig versions.
b270044 <Mike Pigott> Updated validaton & documentation, and unit tests for the new JdbcToArrowConfig.
da77cbe <Mike Pigott> Creating a configuration class for the JDBC-to-Arrow converter.
kszucs pushed a commit that referenced this pull request Feb 13, 2019
https://issues.apache.org/jira/browse/ARROW-3923

Hello!  I was reading through the JDBC source code and I noticed that a java.util.Calendar was required for creating an Arrow Schema and Arrow Vectors from a JDBC ResultSet, when none is required.

This change makes the Calendar optional.

Unit Tests:
The existing SureFire plugin configuration uses a UTC calendar for the database, which is the default Calendar in the existing code.  Likewise, no changes to the unit tests are required to provide adequate coverage for the change.

Author: Michael Pigott <mikepigott@users.noreply.github.com>
Author: Mike Pigott <mpigott@gmail.com>

Closes apache#3066 from mikepigott/jdbc-timestamp-no-calendar and squashes the following commits:

4d95da0 <Mike Pigott> ARROW-3923: Supporting a null Calendar in the config, and reverting the breaking change.
cd9a230 <Mike Pigott> Merge branch 'master' into jdbc-timestamp-no-calendar
509a1cc <Michael Pigott> Merge pull request #5 from apache/master
789c8c8 <Michael Pigott> Merge pull request #4 from apache/master
e5b19ee <Michael Pigott> Merge pull request #3 from apache/master
3b17c29 <Michael Pigott> Merge pull request #2 from apache/master
881c6c8 <Michael Pigott> Merge pull request #1 from apache/master
089cff4 <Mike Pigott> Format fixes
a58a4a5 <Mike Pigott> Fixing calendar usage.
e12832a <Mike Pigott> Allowing for timestamps without a time zone.
kszucs pushed a commit that referenced this pull request Feb 13, 2019
https://issues.apache.org/jira/browse/ARROW-3966

This change includes apache#3133, and supports a new configuration item called "Include Metadata."  If true, metadata from the JDBC ResultSetMetaData object is pulled along to the Schema Field Metadata.  For now, this includes:
* Catalog Name
* Table Name
* Column Name
* Column Type Name

Author: Mike Pigott <mpigott@gmail.com>
Author: Michael Pigott <mikepigott@users.noreply.github.com>

Closes apache#3134 from mikepigott/jdbc-column-metadata and squashes the following commits:

02f2f34 <Mike Pigott> ARROW-3966: Picking up lost change to support null calendars.
7049c36 <Mike Pigott> Merge branch 'master' into jdbc-column-metadata
e9a9b2b <Michael Pigott> Merge pull request #6 from apache/master
65741a9 <Mike Pigott> ARROW-3966: Code review feedback
cc6cc88 <Mike Pigott> ARROW-3966: Using a 1:N loop instead of a 0:N-1 loop for fewer index offsets in code.
cfb2ba6 <Mike Pigott> ARROW-3966: Using a helper method for building a UTC calendar with root locale.
2928513 <Mike Pigott> ARROW-3966: Moving the metadata flag assignment into the builder.
69022c2 <Mike Pigott> ARROW-3966: Fixing merge.
4a6de86 <Mike Pigott> Merge branch 'master' into jdbc-column-metadata
509a1cc <Michael Pigott> Merge pull request #5 from apache/master
789c8c8 <Michael Pigott> Merge pull request #4 from apache/master
e5b19ee <Michael Pigott> Merge pull request #3 from apache/master
3b17c29 <Michael Pigott> Merge pull request #2 from apache/master
d847ebc <Mike Pigott> Fixing file location
1ceac9e <Mike Pigott> Merge branch 'master' into jdbc-column-metadata
881c6c8 <Michael Pigott> Merge pull request #1 from apache/master
03091a8 <Mike Pigott> Unit tests for including result set metadata.
72d64cc <Mike Pigott> Affirming the field metadata is empty when the configuration excludes field metadata.
7b4527c <Mike Pigott> Test for the include-metadata flag in the configuration.
7e9ce37 <Mike Pigott> Merge branch 'jdbc-to-arrow-config' into jdbc-column-metadata
bb3165b <Mike Pigott> Updating the function calls to use the JdbcToArrowConfig versions.
a6fb1be <Mike Pigott> Fixing function call
5bfd6a2 <Mike Pigott> Merge branch 'jdbc-to-arrow-config' into jdbc-column-metadata
68c91e7 <Mike Pigott> Modifying the jdbcToArrowSchema and jdbcToArrowVectors methods to receive JdbcToArrowConfig objects.
b5b0cb1 <Mike Pigott> Merge branch 'jdbc-to-arrow-config' into jdbc-column-metadata
8d6cf00 <Mike Pigott> Documentation for public static VectorSchemaRoot sqlToArrow(Connection connection, String query, JdbcToArrowConfig config)
4f1260c <Mike Pigott> Adding documentation for public static VectorSchemaRoot sqlToArrow(ResultSet resultSet, JdbcToArrowConfig config)
e34a9e7 <Mike Pigott> Fixing formatting.
fe097c8 <Mike Pigott> Merge branch 'jdbc-to-arrow-config' into jdbc-column-metadata
df632e3 <Mike Pigott> Updating the SQL tests to include JdbcToArrowConfig versions.
b270044 <Mike Pigott> Updated validaton & documentation, and unit tests for the new JdbcToArrowConfig.
da77cbe <Mike Pigott> Creating a configuration class for the JDBC-to-Arrow converter.
a78c770 <Mike Pigott> Updating Javadocs.
523387f <Mike Pigott> Updating the API to support an optional 'includeMetadata' field.
5af1b5b <Mike Pigott> Separating out the field-type creation from the field creation.
kszucs pushed a commit that referenced this pull request Mar 18, 2019
I'm sure I'll need some guidance on this one from @sunchao or @liurenjie1024 but I am keen to get parquet support added for primitive types so that I can actually use DataFusion and Arrow in production at some point.

Author: Andy Grove <andygrove73@gmail.com>
Author: Neville Dipale <nevilledips@gmail.com>
Author: Andy Grove <andygrove@users.noreply.github.com>

Closes apache#3851 from andygrove/ARROW-4466 and squashes the following commits:

3158529 <Andy Grove> add test for reading small batches
549c829 <Andy Grove> Remove hard-coded batch size, fix nits
8d2df06 <Andy Grove> move schema projection function from arrow into datafusion
204db83 <Andy Grove> fix timestamp nano issue
73aa934 <Andy Grove> Remove println from test
25d34ac <Andy Grove> Make INT32/64/96 handling consistent with C++ implementation
9b1308f <Andy Grove> clean up handling of INT96 and DATE/TIME/TIMESTAMP types in schema converter
1ec815b <Andy Grove> Clean up imports
023dc25 <Andy Grove> Merge pull request #2 from nevi-me/ARROW-4466
02b2ed3 <Neville Dipale> fix int96 conversion to read timestamps correctly
2aeea24 <Andy Grove> remove println from tests
9d3047a <Andy Grove> code cleanup
639e13e <Andy Grove> null handling for int96
1503855 <Andy Grove> handle nulls for binary data
80cf303 <Andy Grove> add date support
5a3368c <Andy Grove> Remove unnecessary slice, fix null handling
306d07a <Neville Dipale> fmt
3c711a5 <Neville Dipale> immediately allocate vec
e6cbbaa <Neville Dipale> replace read_column! macro with generic
607a29f <Andy Grove> return result if there are null values
e8aa784 <Andy Grove> revert temp debug change to error messages
6457c36 <Andy Grove> use parquet::reader::schema::parquet_to_arrow_schema
c56510e <Andy Grove> projection takes slice instead of vec
7e1a98f <Andy Grove> remove println and unwrap
dddb7d7 <Andy Grove> update to use partition-aware changes from master
157512e <Andy Grove> Remove invalid TODO comment
debb2fb <Andy Grove> code cleanup
6c3b7e2 <Andy Grove> add support for all primitive parquet types
b4981ed <Andy Grove> implement more parquet column types and tests
5ce3086 <Andy Grove> revert to columnar reads
c3f71d7 <Andy Grove> add integration test
aea9f8a <Andy Grove> convert to use row iter
f46e6f7 <Andy Grove> save
eaddafb <Andy Grove> save
322fc87 <Andy Grove> add test for reading strings from parquet
3a412b1 <Andy Grove> first parquet test passes
ff3e5b7 <Andy Grove> test
10710a2 <Andy Grove> Parquet datasource
kszucs pushed a commit that referenced this pull request Aug 28, 2019
This updates the language in `install_arrow()` to follow the README revision that will land in https://github.com/apache/arrow/pull/4948/files#diff-563b2cb2c8c2d51b2ff6b177e2d84286R33.

The [Jira ticket](https://issues.apache.org/jira/browse/ARROW-6142) requested three things; this is `#2` in the list. On `#1`, I defer to the C++ installation docs, which are already included in the install_arrow message, rather than duplicating content here. `#3` is out of scope.

Closes apache#5027 from nealrichardson/no-ppa and squashes the following commits:

80b142e <Neal Richardson> s/arrow/Arrow/
44c9659 <Neal Richardson> Tweak language again
36cfe28 <Neal Richardson> Further linux install revisions
79bd7e0 <Neal Richardson> One more PPurge
63f75bd <Neal Richardson> Revise install_arrow instructions for Linux

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
kszucs pushed a commit that referenced this pull request Feb 24, 2020
…comments.

The reset method allow the data structures to be re-used so they don't have to be allocated over and over again.

Closes apache#6430 from richardartoul/ra/merge-upstream and squashes the following commits:

5a08281 <Richard Artoul> Add license to test file
d76be05 <Richard Artoul> Add test for data reset
d102b1f <Richard Artoul> Add tests
d3e6e67 <Richard Artoul> cleanup comments
c8525ae <Richard Artoul> Add Reset method to int array (#5)
489ca25 <Richard Artoul> Fix array.setData() to retain before release (#4)
88cd05f <Richard Artoul> Add reset method to Data (#3)
6d1b277 <Richard Artoul> Add Reset() method to String array (#2)
dca2303 <Richard Artoul> Add Reset method to buffer and cleanup comments (#1)

Lead-authored-by: Richard Artoul <richard.artoul@datadoghq.com>
Co-authored-by: Richard Artoul <richardartoul@gmail.com>
Signed-off-by: Sebastien Binet <binet@cern.ch>
kszucs pushed a commit that referenced this pull request May 11, 2020
This PR enables tests for `ARROW_COMPUTE`, `ARROW_DATASET`, `ARROW_FILESYSTEM`, `ARROW_HDFS`, `ARROW_ORC`, and `ARROW_IPC` (default on). apache#7131 enabled a minimal set of tests as a starting point.

I confirmed that these tests pass locally with the current master. In the current TravisCI environment, we cannot see this result due to a lot of error messages in `arrow-utility-test`.

```
$ git log | head -1
commit ed5f534
% ctest
...
      Start  1: arrow-array-test
 1/51 Test  #1: arrow-array-test .....................   Passed    4.62 sec
      Start  2: arrow-buffer-test
 2/51 Test  #2: arrow-buffer-test ....................   Passed    0.14 sec
      Start  3: arrow-extension-type-test
 3/51 Test  #3: arrow-extension-type-test ............   Passed    0.12 sec
      Start  4: arrow-misc-test
 4/51 Test  #4: arrow-misc-test ......................   Passed    0.14 sec
      Start  5: arrow-public-api-test
 5/51 Test  #5: arrow-public-api-test ................   Passed    0.12 sec
      Start  6: arrow-scalar-test
 6/51 Test  #6: arrow-scalar-test ....................   Passed    0.13 sec
      Start  7: arrow-type-test
 7/51 Test  #7: arrow-type-test ......................   Passed    0.14 sec
      Start  8: arrow-table-test
 8/51 Test  #8: arrow-table-test .....................   Passed    0.13 sec
      Start  9: arrow-tensor-test
 9/51 Test  #9: arrow-tensor-test ....................   Passed    0.13 sec
      Start 10: arrow-sparse-tensor-test
10/51 Test #10: arrow-sparse-tensor-test .............   Passed    0.16 sec
      Start 11: arrow-stl-test
11/51 Test #11: arrow-stl-test .......................   Passed    0.12 sec
      Start 12: arrow-concatenate-test
12/51 Test #12: arrow-concatenate-test ...............   Passed    0.53 sec
      Start 13: arrow-diff-test
13/51 Test #13: arrow-diff-test ......................   Passed    1.45 sec
      Start 14: arrow-c-bridge-test
14/51 Test #14: arrow-c-bridge-test ..................   Passed    0.18 sec
      Start 15: arrow-io-buffered-test
15/51 Test #15: arrow-io-buffered-test ...............   Passed    0.20 sec
      Start 16: arrow-io-compressed-test
16/51 Test #16: arrow-io-compressed-test .............   Passed    3.48 sec
      Start 17: arrow-io-file-test
17/51 Test #17: arrow-io-file-test ...................   Passed    0.74 sec
      Start 18: arrow-io-hdfs-test
18/51 Test #18: arrow-io-hdfs-test ...................   Passed    0.12 sec
      Start 19: arrow-io-memory-test
19/51 Test #19: arrow-io-memory-test .................   Passed    2.77 sec
      Start 20: arrow-utility-test
20/51 Test apache#20: arrow-utility-test ...................***Failed    5.65 sec
      Start 21: arrow-threading-utility-test
21/51 Test apache#21: arrow-threading-utility-test .........   Passed    1.34 sec
      Start 22: arrow-compute-compute-test
22/51 Test apache#22: arrow-compute-compute-test ...........   Passed    0.13 sec
      Start 23: arrow-compute-boolean-test
23/51 Test apache#23: arrow-compute-boolean-test ...........   Passed    0.15 sec
      Start 24: arrow-compute-cast-test
24/51 Test apache#24: arrow-compute-cast-test ..............   Passed    0.22 sec
      Start 25: arrow-compute-hash-test
25/51 Test apache#25: arrow-compute-hash-test ..............   Passed    2.61 sec
      Start 26: arrow-compute-isin-test
26/51 Test apache#26: arrow-compute-isin-test ..............   Passed    0.81 sec
      Start 27: arrow-compute-match-test
27/51 Test apache#27: arrow-compute-match-test .............   Passed    0.40 sec
      Start 28: arrow-compute-sort-to-indices-test
28/51 Test apache#28: arrow-compute-sort-to-indices-test ...   Passed    3.33 sec
      Start 29: arrow-compute-nth-to-indices-test
29/51 Test apache#29: arrow-compute-nth-to-indices-test ....   Passed    1.51 sec
      Start 30: arrow-compute-util-internal-test
30/51 Test apache#30: arrow-compute-util-internal-test .....   Passed    0.13 sec
      Start 31: arrow-compute-add-test
31/51 Test apache#31: arrow-compute-add-test ...............   Passed    0.12 sec
      Start 32: arrow-compute-aggregate-test
32/51 Test apache#32: arrow-compute-aggregate-test .........   Passed   14.70 sec
      Start 33: arrow-compute-compare-test
33/51 Test apache#33: arrow-compute-compare-test ...........   Passed    7.96 sec
      Start 34: arrow-compute-take-test
34/51 Test apache#34: arrow-compute-take-test ..............   Passed    4.80 sec
      Start 35: arrow-compute-filter-test
35/51 Test apache#35: arrow-compute-filter-test ............   Passed    8.23 sec
      Start 36: arrow-dataset-dataset-test
36/51 Test apache#36: arrow-dataset-dataset-test ...........   Passed    0.25 sec
      Start 37: arrow-dataset-discovery-test
37/51 Test apache#37: arrow-dataset-discovery-test .........   Passed    0.13 sec
      Start 38: arrow-dataset-file-ipc-test
38/51 Test apache#38: arrow-dataset-file-ipc-test ..........   Passed    0.21 sec
      Start 39: arrow-dataset-file-test
39/51 Test apache#39: arrow-dataset-file-test ..............   Passed    0.12 sec
      Start 40: arrow-dataset-filter-test
40/51 Test apache#40: arrow-dataset-filter-test ............   Passed    0.16 sec
      Start 41: arrow-dataset-partition-test
41/51 Test apache#41: arrow-dataset-partition-test .........   Passed    0.13 sec
      Start 42: arrow-dataset-scanner-test
42/51 Test apache#42: arrow-dataset-scanner-test ...........   Passed    0.20 sec
      Start 43: arrow-filesystem-test
43/51 Test apache#43: arrow-filesystem-test ................   Passed    1.62 sec
      Start 44: arrow-hdfs-test
44/51 Test apache#44: arrow-hdfs-test ......................   Passed    0.13 sec
      Start 45: arrow-feather-test
45/51 Test apache#45: arrow-feather-test ...................   Passed    0.91 sec
      Start 46: arrow-ipc-read-write-test
46/51 Test apache#46: arrow-ipc-read-write-test ............   Passed    5.77 sec
      Start 47: arrow-ipc-json-simple-test
47/51 Test apache#47: arrow-ipc-json-simple-test ...........   Passed    0.16 sec
      Start 48: arrow-ipc-json-test
48/51 Test apache#48: arrow-ipc-json-test ..................   Passed    0.27 sec
      Start 49: arrow-json-integration-test
49/51 Test apache#49: arrow-json-integration-test ..........   Passed    0.13 sec
      Start 50: arrow-json-test
50/51 Test apache#50: arrow-json-test ......................   Passed    0.26 sec
      Start 51: arrow-orc-adapter-test
51/51 Test apache#51: arrow-orc-adapter-test ...............   Passed    1.92 sec

98% tests passed, 1 tests failed out of 51

Label Time Summary:
arrow-tests      =  27.38 sec (27 tests)
arrow_compute    =  45.11 sec (14 tests)
arrow_dataset    =   1.21 sec (7 tests)
arrow_ipc        =   6.20 sec (3 tests)
unittest         =  79.91 sec (51 tests)

Total Test time (real) =  79.99 sec

The following tests FAILED:
	 20 - arrow-utility-test (Failed)
Errors while running CTest
```

Closes apache#7142 from kiszk/ARROW-8754

Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
kszucs pushed a commit that referenced this pull request May 11, 2020
…lure on big-endian platforms

This PR gets an element data using an endianless API in Flatbuffer instead of getting a pointer. This can fix a failure of TestPlasmaSerialization.DeleteReply in plasma-serialization-tests.

Without this PR
```
1: [==========] Running 14 tests from 1 test case.
1: [----------] Global test environment set-up.
1: [----------] 14 tests from TestPlasmaSerialization
1: [ RUN      ] TestPlasmaSerialization.CreateRequest
1: /home/ishizaki/Arrow/arrow/cpp/src/plasma/test/serialization_tests.cc:87: file path: '/tmp/ser-test-kk8t88p9/fileXXXXXX'
1: [       OK ] TestPlasmaSerialization.CreateRequest (2 ms)
1: [ RUN      ] TestPlasmaSerialization.CreateReply
1: /home/ishizaki/Arrow/arrow/cpp/src/plasma/test/serialization_tests.cc:87: file path: '/tmp/ser-test-97gspx5v/fileXXXXXX'
1: [       OK ] TestPlasmaSerialization.CreateReply (0 ms)
1: [ RUN      ] TestPlasmaSerialization.SealRequest
1: /home/ishizaki/Arrow/arrow/cpp/src/plasma/test/serialization_tests.cc:87: file path: '/tmp/ser-test-dkksx76p/fileXXXXXX'
1: [       OK ] TestPlasmaSerialization.SealRequest (1 ms)
1: [ RUN      ] TestPlasmaSerialization.SealReply
1: /home/ishizaki/Arrow/arrow/cpp/src/plasma/test/serialization_tests.cc:87: file path: '/tmp/ser-test-oqbs9vm0/fileXXXXXX'
1: [       OK ] TestPlasmaSerialization.SealReply (0 ms)
1: [ RUN      ] TestPlasmaSerialization.GetRequest
1: /home/ishizaki/Arrow/arrow/cpp/src/plasma/test/serialization_tests.cc:87: file path: '/tmp/ser-test-d7q6h5q4/fileXXXXXX'
1: [       OK ] TestPlasmaSerialization.GetRequest (1 ms)
1: [ RUN      ] TestPlasmaSerialization.GetReply
1: /home/ishizaki/Arrow/arrow/cpp/src/plasma/test/serialization_tests.cc:87: file path: '/tmp/ser-test-sxsncs72/fileXXXXXX'
1: [       OK ] TestPlasmaSerialization.GetReply (1 ms)
1: [ RUN      ] TestPlasmaSerialization.ReleaseRequest
1: /home/ishizaki/Arrow/arrow/cpp/src/plasma/test/serialization_tests.cc:87: file path: '/tmp/ser-test-njc3g3b5/fileXXXXXX'
1: [       OK ] TestPlasmaSerialization.ReleaseRequest (0 ms)
1: [ RUN      ] TestPlasmaSerialization.ReleaseReply
1: /home/ishizaki/Arrow/arrow/cpp/src/plasma/test/serialization_tests.cc:87: file path: '/tmp/ser-test-917ybxmo/fileXXXXXX'
1: [       OK ] TestPlasmaSerialization.ReleaseReply (1 ms)
1: [ RUN      ] TestPlasmaSerialization.DeleteRequest
1: /home/ishizaki/Arrow/arrow/cpp/src/plasma/test/serialization_tests.cc:87: file path: '/tmp/ser-test-1kwauefv/fileXXXXXX'
1: [       OK ] TestPlasmaSerialization.DeleteRequest (0 ms)
1: [ RUN      ] TestPlasmaSerialization.DeleteReply
1: /home/ishizaki/Arrow/arrow/cpp/src/plasma/test/serialization_tests.cc:87: file path: '/tmp/ser-test-4ftq28pq/fileXXXXXX'
1: /home/ishizaki/Arrow/arrow/cpp/src/plasma/test/serialization_tests.cc:271: Failure
1: Value of: error_vec[0] == PlasmaError::ObjectExists
1:   Actual: false
1: Expected: true
1: [  FAILED  ] TestPlasmaSerialization.DeleteReply (1 ms)
1: [ RUN      ] TestPlasmaSerialization.EvictRequest
1: /home/ishizaki/Arrow/arrow/cpp/src/plasma/test/serialization_tests.cc:87: file path: '/tmp/ser-test-vl97870w/fileXXXXXX'
1: [       OK ] TestPlasmaSerialization.EvictRequest (0 ms)
1: [ RUN      ] TestPlasmaSerialization.EvictReply
1: /home/ishizaki/Arrow/arrow/cpp/src/plasma/test/serialization_tests.cc:87: file path: '/tmp/ser-test-3am9a6rv/fileXXXXXX'
1: [       OK ] TestPlasmaSerialization.EvictReply (1 ms)
1: [ RUN      ] TestPlasmaSerialization.DataRequest
1: /home/ishizaki/Arrow/arrow/cpp/src/plasma/test/serialization_tests.cc:87: file path: '/tmp/ser-test-plye5tmm/fileXXXXXX'
1: [       OK ] TestPlasmaSerialization.DataRequest (0 ms)
1: [ RUN      ] TestPlasmaSerialization.DataReply
1: /home/ishizaki/Arrow/arrow/cpp/src/plasma/test/serialization_tests.cc:87: file path: '/tmp/ser-test-mbu6lqsq/fileXXXXXX'
1: [       OK ] TestPlasmaSerialization.DataReply (1 ms)
1: [----------] 14 tests from TestPlasmaSerialization (9 ms total)
1:
1: [----------] Global test environment tear-down
1: [==========] 14 tests from 1 test case ran. (9 ms total)
1: [  PASSED  ] 13 tests.
1: [  FAILED  ] 1 test, listed below:
1: [  FAILED  ] TestPlasmaSerialization.DeleteReply
1:
1:  1 FAILED TEST
1: /home/ishizaki/Arrow/arrow/cpp/src/plasma
1/3 Test #1: plasma-serialization-tests .......***Failed    0.27 sec
...
3/3 Test #3: plasma-external-store-tests ......   Passed    0.46 sec
```

With this PR
```
$ ctest
Test project /home/ishizaki/Arrow/arrow/cpp/src/plasma
    Start 1: plasma-serialization-tests
1/3 Test #1: plasma-serialization-tests .......   Passed    0.26 sec
    Start 2: plasma-client-tests
2/3 Test #2: plasma-client-tests ..............   Passed   14.99 sec
    Start 3: plasma-external-store-tests
3/3 Test #3: plasma-external-store-tests ......   Passed    0.49 sec

100% tests passed, 0 tests failed out of 3

Label Time Summary:
plasma-tests    =  15.74 sec (3 tests)
unittest        =  15.74 sec (3 tests)

Total Test time (real) =  15.74 sec
```

Closes apache#7148 from kiszk/ARROW-8759

Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
kszucs pushed a commit that referenced this pull request Apr 7, 2021
From a deadlocked run...

```
#0  0x00007f8a5d48dccd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f8a5d486f05 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007f8a566e7e89 in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#3  0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#4  0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#5  0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#6  0x00007f8a566e827d in arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> >::Callback<arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler> >::invoke() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#7  0x00007f8a5650efa0 in arrow::FutureImpl::AddCallback(arrow::internal::FnOnce<void ()>) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#8  0x00007f8a566e67a9 in arrow::fs::(anonymous namespace)::TreeWalker::ListObjectsV2Handler::SpawnListObjectsV2() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#9  0x00007f8a566e723f in arrow::fs::(anonymous namespace)::TreeWalker::WalkChild(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
#10 0x00007f8a566e74b1 in arrow::fs::(anonymous namespace)::TreeWalker::DoWalk() () from /arrow/r/check/arrow.Rcheck/arrow/libs/arrow.so
```

The callback `ListObjectsV2Handler` is being called recursively and the mutex is non-reentrant thus deadlock.

To fix it I got rid of the mutex on `TreeWalker` by using `arrow::util::internal::TaskGroup` instead of manually tracking the #/status of in-flight requests.

Closes apache#9842 from westonpace/bugfix/arrow-12040

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
kszucs pushed a commit that referenced this pull request Jun 18, 2021
Before change:

```
Direct leak of 65536 byte(s) in 1 object(s) allocated from:
    #0 0x522f09 in
    #1 0x7f28ae5826f4 in
    #2 0x7f28ae57fa5d in
    #3 0x7f28ae58cb0f in
    #4 0x7f28ae58bda0 in
    ...
```

After change:
```
Direct leak of 65536 byte(s) in 1 object(s) allocated from:
    #0 0x522f09 in posix_memalign (/build/cpp/debug/arrow-dataset-file-csv-test+0x522f09)
    #1 0x7f28ae5826f4 in arrow::(anonymous namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) /arrow/cpp/src/arrow/memory_pool.cc:213:24
    #2 0x7f28ae57fa5d in arrow::BaseMemoryPoolImpl<arrow::(anonymous namespace)::SystemAllocator>::Allocate(long, unsigned char**) /arrow/cpp/src/arrow/memory_pool.cc:405:5
    #3 0x7f28ae58cb0f in arrow::PoolBuffer::Reserve(long) /arrow/cpp/src/arrow/memory_pool.cc:717:9
    #4 0x7f28ae58bda0 in arrow::PoolBuffer::Resize(long, bool) /arrow/cpp/src/arrow/memory_pool.cc:741:7
    ...
```

Closes apache#10498 from westonpace/feature/ARROW-13027--c-fix-asan-stack-traces-in-ci

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
kszucs pushed a commit that referenced this pull request Dec 20, 2024
…n timezone (apache#45051)

### Rationale for this change

If the timezone database is present on the system, but does not contain a timezone referenced in a ORC file, the ORC reader will crash with an uncaught C++ exception.

This can happen for example on Ubuntu 24.04 where some timezone aliases have been removed from the main `tzdata` package to a `tzdata-legacy` package. If `tzdata-legacy` is not installed, trying to read a ORC file that references e.g. the "US/Pacific" timezone would crash.

Here is a backtrace excerpt:
```
#12 0x00007f1a3ce23a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#13 0x00007f1a3ce39391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007f1a3f4accc4 in orc::loadTZDB(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#15 0x00007f1a3f4ad392 in std::call_once<orc::LazyTimezone::getImpl() const::{lambda()#1}>(std::once_flag&, orc::LazyTimezone::getImpl() const::{lambda()#1}&&)::{lambda()#2}::_FUN() () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#16 0x00007f1a4298bec3 in __pthread_once_slow (once_control=0xa5ca7c8, init_routine=0x7f1a3ce69420 <__once_proxy>) at ./nptl/pthread_once.c:116
#17 0x00007f1a3f4a9ad0 in orc::LazyTimezone::getEpoch() const ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#18 0x00007f1a3f4e76b1 in orc::TimestampColumnReader::TimestampColumnReader(orc::Type const&, orc::StripeStreams&, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
#19 0x00007f1a3f4e84ad in orc::buildReader(orc::Type const&, orc::StripeStreams&, bool, bool, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
apache#20 0x00007f1a3f4e8dd7 in orc::StructColumnReader::StructColumnReader(orc::Type const&, orc::StripeStreams&, bool, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
apache#21 0x00007f1a3f4e8532 in orc::buildReader(orc::Type const&, orc::StripeStreams&, bool, bool, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
apache#22 0x00007f1a3f4925e9 in orc::RowReaderImpl::startNextStripe() ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
apache#23 0x00007f1a3f492c9d in orc::RowReaderImpl::next(orc::ColumnVectorBatch&) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
apache#24 0x00007f1a3e6b251f in arrow::adapters::orc::ORCFileReader::Impl::ReadBatch(orc::RowReaderOptions const&, std::shared_ptr<arrow::Schema> const&, long) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
```

### What changes are included in this PR?

Catch C++ exceptions when iterating ORC batches instead of letting them slip through.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40633

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants