ARROW-68: some fixes for errors encountered on not fully setup systems #26

emkornfield · 2016-03-16T23:59:40Z

My bash skills are mainly based on stack-overflow so hopefully these changes seem reasonable.

wesm · 2016-03-17T00:18:54Z

Let me know when you want me to review something (Travis CI will verify the build)

emkornfield · 2016-03-17T00:38:50Z

Thanks. Opened up a separate pull request, sorry for the spam.

On Wed, Mar 16, 2016 at 5:18 PM, Wes McKinney notifications@github.com
wrote:

Let me know when you want me to review something (Travis CI will verify
the build)

—
You are receiving this because you modified the open/close state.
Reply to this email directly or view it on GitHub
#26 (comment)

… than scalar Column scanning and record reconstruction is independent of the Parquet file format and depends, among other things, on the data structures where the reconstructed data will end up. This is a work-in progress, but the basic idea is: - APIs for reading a batch of repetition `ReadRepetitionLevels` or definition levels `ReadDefinitionLevels` into a preallocated `int16_t*` - APIs for reading arrays of decoded values into preallocated memory (`ReadValues`) These methods are only able to read data within a particular data page. Once you exhaust the data available in the data page (`ReadValues` returns 0), you must call `ReadNewPage`, which returns `true` is there is more data available. Separately, I added a simple `Scanner` class that emulates the scalar value iteration functionality that existed previously. I used this to reimplement the `DebugPrint` method in `parquet_scanner.cc`. This obviously only works currently for flat data. I would like to keep the `ColumnReader` low level and primitive, concerned only with providing access to the raw data in a Parquet file as fast as possible. We can devise separate algorithms for inferring nested record structure by examining the arrays of decoded values and repetition/definition levels. The major benefit of separating raw data access from structure inference is that this can be pipelined with threads: one thread decompresses and decodes values and levels, and another thread can turn batches into a nested record- or column-oriented structure. Author: Wes McKinney <wes@cloudera.com> Closes apache#26 from wesm/PARQUET-435 and squashes the following commits: 4bf5cd4 [Wes McKinney] Fix cpplint 852f4ec [Wes McKinney] Address review comments, also be sure to use Scanner::HasNext 7ea261e [Wes McKinney] Add TODO comment 4999719 [Wes McKinney] Make ColumnReader::ReadNewPage private and call HasNext() in ReadBatch 0d2e111 [Wes McKinney] Fix function description. Change #define to constexpr 111ef13 [Wes McKinney] Incorporate review comments and add some better comments e16f7fd [Wes McKinney] Typo ef52404 [Wes McKinney] Fix function doc 5e95cda [Wes McKinney] Configurable scanner batch size. Do not use printf in DebugPrint 1b4eca0 [Wes McKinney] New batch read API which reads levels and values in one shot de4d6b6 [Wes McKinney] Move column_* files into parquet/column folder aad4a86 [Wes McKinney] Finish refactoring scanner API with shared pointers 4506748 [Wes McKinney] Refactoring, do not have shared_from_this working yet 6489b15 [Wes McKinney] Batch level/value read interface on ColumnReader. Add Scanner class for flat columns. Add a couple smoke unit tests

… than scalar Column scanning and record reconstruction is independent of the Parquet file format and depends, among other things, on the data structures where the reconstructed data will end up. This is a work-in progress, but the basic idea is: - APIs for reading a batch of repetition `ReadRepetitionLevels` or definition levels `ReadDefinitionLevels` into a preallocated `int16_t*` - APIs for reading arrays of decoded values into preallocated memory (`ReadValues`) These methods are only able to read data within a particular data page. Once you exhaust the data available in the data page (`ReadValues` returns 0), you must call `ReadNewPage`, which returns `true` is there is more data available. Separately, I added a simple `Scanner` class that emulates the scalar value iteration functionality that existed previously. I used this to reimplement the `DebugPrint` method in `parquet_scanner.cc`. This obviously only works currently for flat data. I would like to keep the `ColumnReader` low level and primitive, concerned only with providing access to the raw data in a Parquet file as fast as possible. We can devise separate algorithms for inferring nested record structure by examining the arrays of decoded values and repetition/definition levels. The major benefit of separating raw data access from structure inference is that this can be pipelined with threads: one thread decompresses and decodes values and levels, and another thread can turn batches into a nested record- or column-oriented structure. Author: Wes McKinney <wes@cloudera.com> Closes apache#26 from wesm/PARQUET-435 and squashes the following commits: 4bf5cd4 [Wes McKinney] Fix cpplint 852f4ec [Wes McKinney] Address review comments, also be sure to use Scanner::HasNext 7ea261e [Wes McKinney] Add TODO comment 4999719 [Wes McKinney] Make ColumnReader::ReadNewPage private and call HasNext() in ReadBatch 0d2e111 [Wes McKinney] Fix function description. Change #define to constexpr 111ef13 [Wes McKinney] Incorporate review comments and add some better comments e16f7fd [Wes McKinney] Typo ef52404 [Wes McKinney] Fix function doc 5e95cda [Wes McKinney] Configurable scanner batch size. Do not use printf in DebugPrint 1b4eca0 [Wes McKinney] New batch read API which reads levels and values in one shot de4d6b6 [Wes McKinney] Move column_* files into parquet/column folder aad4a86 [Wes McKinney] Finish refactoring scanner API with shared pointers 4506748 [Wes McKinney] Refactoring, do not have shared_from_this working yet 6489b15 [Wes McKinney] Batch level/value read interface on ColumnReader. Add Scanner class for flat columns. Add a couple smoke unit tests Change-Id: I6bb6281f9216e0f3a9ce160a5ba320de4148fae1

This PR enables tests for `ARROW_COMPUTE`, `ARROW_DATASET`, `ARROW_FILESYSTEM`, `ARROW_HDFS`, `ARROW_ORC`, and `ARROW_IPC` (default on). #7131 enabled a minimal set of tests as a starting point. I confirmed that these tests pass locally with the current master. In the current TravisCI environment, we cannot see this result due to a lot of error messages in `arrow-utility-test`. ``` $ git log | head -1 commit ed5f534 % ctest ... Start 1: arrow-array-test 1/51 Test #1: arrow-array-test ..................... Passed 4.62 sec Start 2: arrow-buffer-test 2/51 Test #2: arrow-buffer-test .................... Passed 0.14 sec Start 3: arrow-extension-type-test 3/51 Test #3: arrow-extension-type-test ............ Passed 0.12 sec Start 4: arrow-misc-test 4/51 Test #4: arrow-misc-test ...................... Passed 0.14 sec Start 5: arrow-public-api-test 5/51 Test #5: arrow-public-api-test ................ Passed 0.12 sec Start 6: arrow-scalar-test 6/51 Test #6: arrow-scalar-test .................... Passed 0.13 sec Start 7: arrow-type-test 7/51 Test #7: arrow-type-test ...................... Passed 0.14 sec Start 8: arrow-table-test 8/51 Test #8: arrow-table-test ..................... Passed 0.13 sec Start 9: arrow-tensor-test 9/51 Test #9: arrow-tensor-test .................... Passed 0.13 sec Start 10: arrow-sparse-tensor-test 10/51 Test #10: arrow-sparse-tensor-test ............. Passed 0.16 sec Start 11: arrow-stl-test 11/51 Test #11: arrow-stl-test ....................... Passed 0.12 sec Start 12: arrow-concatenate-test 12/51 Test #12: arrow-concatenate-test ............... Passed 0.53 sec Start 13: arrow-diff-test 13/51 Test #13: arrow-diff-test ...................... Passed 1.45 sec Start 14: arrow-c-bridge-test 14/51 Test #14: arrow-c-bridge-test .................. Passed 0.18 sec Start 15: arrow-io-buffered-test 15/51 Test #15: arrow-io-buffered-test ............... Passed 0.20 sec Start 16: arrow-io-compressed-test 16/51 Test #16: arrow-io-compressed-test ............. Passed 3.48 sec Start 17: arrow-io-file-test 17/51 Test #17: arrow-io-file-test ................... Passed 0.74 sec Start 18: arrow-io-hdfs-test 18/51 Test #18: arrow-io-hdfs-test ................... Passed 0.12 sec Start 19: arrow-io-memory-test 19/51 Test #19: arrow-io-memory-test ................. Passed 2.77 sec Start 20: arrow-utility-test 20/51 Test #20: arrow-utility-test ...................***Failed 5.65 sec Start 21: arrow-threading-utility-test 21/51 Test #21: arrow-threading-utility-test ......... Passed 1.34 sec Start 22: arrow-compute-compute-test 22/51 Test #22: arrow-compute-compute-test ........... Passed 0.13 sec Start 23: arrow-compute-boolean-test 23/51 Test #23: arrow-compute-boolean-test ........... Passed 0.15 sec Start 24: arrow-compute-cast-test 24/51 Test #24: arrow-compute-cast-test .............. Passed 0.22 sec Start 25: arrow-compute-hash-test 25/51 Test #25: arrow-compute-hash-test .............. Passed 2.61 sec Start 26: arrow-compute-isin-test 26/51 Test #26: arrow-compute-isin-test .............. Passed 0.81 sec Start 27: arrow-compute-match-test 27/51 Test #27: arrow-compute-match-test ............. Passed 0.40 sec Start 28: arrow-compute-sort-to-indices-test 28/51 Test #28: arrow-compute-sort-to-indices-test ... Passed 3.33 sec Start 29: arrow-compute-nth-to-indices-test 29/51 Test #29: arrow-compute-nth-to-indices-test .... Passed 1.51 sec Start 30: arrow-compute-util-internal-test 30/51 Test #30: arrow-compute-util-internal-test ..... Passed 0.13 sec Start 31: arrow-compute-add-test 31/51 Test #31: arrow-compute-add-test ............... Passed 0.12 sec Start 32: arrow-compute-aggregate-test 32/51 Test #32: arrow-compute-aggregate-test ......... Passed 14.70 sec Start 33: arrow-compute-compare-test 33/51 Test #33: arrow-compute-compare-test ........... Passed 7.96 sec Start 34: arrow-compute-take-test 34/51 Test #34: arrow-compute-take-test .............. Passed 4.80 sec Start 35: arrow-compute-filter-test 35/51 Test #35: arrow-compute-filter-test ............ Passed 8.23 sec Start 36: arrow-dataset-dataset-test 36/51 Test #36: arrow-dataset-dataset-test ........... Passed 0.25 sec Start 37: arrow-dataset-discovery-test 37/51 Test #37: arrow-dataset-discovery-test ......... Passed 0.13 sec Start 38: arrow-dataset-file-ipc-test 38/51 Test #38: arrow-dataset-file-ipc-test .......... Passed 0.21 sec Start 39: arrow-dataset-file-test 39/51 Test #39: arrow-dataset-file-test .............. Passed 0.12 sec Start 40: arrow-dataset-filter-test 40/51 Test #40: arrow-dataset-filter-test ............ Passed 0.16 sec Start 41: arrow-dataset-partition-test 41/51 Test #41: arrow-dataset-partition-test ......... Passed 0.13 sec Start 42: arrow-dataset-scanner-test 42/51 Test #42: arrow-dataset-scanner-test ........... Passed 0.20 sec Start 43: arrow-filesystem-test 43/51 Test #43: arrow-filesystem-test ................ Passed 1.62 sec Start 44: arrow-hdfs-test 44/51 Test #44: arrow-hdfs-test ...................... Passed 0.13 sec Start 45: arrow-feather-test 45/51 Test #45: arrow-feather-test ................... Passed 0.91 sec Start 46: arrow-ipc-read-write-test 46/51 Test #46: arrow-ipc-read-write-test ............ Passed 5.77 sec Start 47: arrow-ipc-json-simple-test 47/51 Test #47: arrow-ipc-json-simple-test ........... Passed 0.16 sec Start 48: arrow-ipc-json-test 48/51 Test #48: arrow-ipc-json-test .................. Passed 0.27 sec Start 49: arrow-json-integration-test 49/51 Test #49: arrow-json-integration-test .......... Passed 0.13 sec Start 50: arrow-json-test 50/51 Test #50: arrow-json-test ...................... Passed 0.26 sec Start 51: arrow-orc-adapter-test 51/51 Test #51: arrow-orc-adapter-test ............... Passed 1.92 sec 98% tests passed, 1 tests failed out of 51 Label Time Summary: arrow-tests = 27.38 sec (27 tests) arrow_compute = 45.11 sec (14 tests) arrow_dataset = 1.21 sec (7 tests) arrow_ipc = 6.20 sec (3 tests) unittest = 79.91 sec (51 tests) Total Test time (real) = 79.99 sec The following tests FAILED: 20 - arrow-utility-test (Failed) Errors while running CTest ``` Closes #7142 from kiszk/ARROW-8754 Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* add support to int32 type * fix fortmat of cast varchar

…iour-for-getObject-getString [JAVA] [JDBC] Change IntervalAccessor getString format

some fixes for errors encountered on not fully setup systems

252afcc

emkornfield closed this Mar 17, 2016

emkornfield reopened this Mar 17, 2016

emkornfield closed this Mar 17, 2016

emkornfield deleted the emk_add_nice_errors_PR branch March 17, 2016 00:15

guyuqi mentioned this pull request Nov 22, 2018

ARROW-3849: [C++] Leverage Armv8 crc32 extension instructions to accelerate the hash computation for Arm64 #3010

Closed

brills mentioned this pull request Jun 4, 2019

importing pyarrow and Tensorflow crashes #4472

Closed

rui-mo added a commit to rui-mo/arrow-1 that referenced this pull request Aug 2, 2021

Add int32 support and change format of castVarchar (apache#26)

b8bb16a

* add support to int32 type * fix fortmat of cast varchar

zhztheplayer pushed a commit to zhztheplayer/arrow-1 that referenced this pull request Feb 8, 2022

Add int32 support and change format of castVarchar (apache#26)

d7c47ee

* add support to int32 type * fix fortmat of cast varchar

zhztheplayer pushed a commit to zhztheplayer/arrow-1 that referenced this pull request Mar 3, 2022

Add int32 support and change format of castVarchar (apache#26)

9a4876e

* add support to int32 type * fix fortmat of cast varchar

rui-mo added a commit to rui-mo/arrow-1 that referenced this pull request Mar 23, 2022

Add int32 support and change format of castVarchar (apache#26)

a419935

* add support to int32 type * fix fortmat of cast varchar

helloqingbing mentioned this pull request Jun 8, 2022

[Urgent]parquet-cpp write a row done without column type exception ,but cause a segfault #13339

Closed

jayhomn-bitquill pushed a commit to Bit-Quill/arrow that referenced this pull request Aug 10, 2022

Merge pull request apache#26 from rafael-telles/change-interval-behav…

44796fa

…iour-for-getObject-getString [JAVA] [JDBC] Change IntervalAccessor getString format

paleolimbot mentioned this pull request Jan 28, 2023

[R] Crash on MacOS (x86) when running tests with homebrew apache-arrow also installed #33903

Closed

jorisvandenbossche mentioned this pull request Oct 6, 2023

GH-38068: [C++][CI] Fixing Parquet unittest arrow_reader_writer_test.cc compile #38069

Merged

prniii mentioned this pull request Jan 12, 2024

[Python] macOS Segfault on Import, Both arm64 and x86_64 #37010

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-68: some fixes for errors encountered on not fully setup systems #26

ARROW-68: some fixes for errors encountered on not fully setup systems #26

emkornfield commented Mar 16, 2016

wesm commented Mar 17, 2016

emkornfield commented Mar 17, 2016

ARROW-68: some fixes for errors encountered on not fully setup systems #26

ARROW-68: some fixes for errors encountered on not fully setup systems #26

Conversation

emkornfield commented Mar 16, 2016

wesm commented Mar 17, 2016

emkornfield commented Mar 17, 2016