Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-68: some fixes for errors encountered on not fully setup systems #26

Closed

Conversation

emkornfield
Copy link
Contributor

My bash skills are mainly based on stack-overflow so hopefully these changes seem reasonable.

@emkornfield emkornfield reopened this Mar 17, 2016
@emkornfield emkornfield deleted the emk_add_nice_errors_PR branch March 17, 2016 00:15
@wesm
Copy link
Member

wesm commented Mar 17, 2016

Let me know when you want me to review something (Travis CI will verify the build)

@emkornfield
Copy link
Contributor Author

Thanks. Opened up a separate pull request, sorry for the spam.

On Wed, Mar 16, 2016 at 5:18 PM, Wes McKinney notifications@github.com
wrote:

Let me know when you want me to review something (Travis CI will verify
the build)


You are receiving this because you modified the open/close state.
Reply to this email directly or view it on GitHub
#26 (comment)

wesm added a commit to wesm/arrow that referenced this pull request Sep 2, 2018
… than scalar

Column scanning and record reconstruction is independent of the Parquet file format and depends, among other things, on the data structures where the reconstructed data will end up. This is a work-in progress, but the basic idea is:

- APIs for reading a batch of repetition `ReadRepetitionLevels` or definition levels `ReadDefinitionLevels` into a preallocated `int16_t*`
- APIs for reading arrays of decoded values into preallocated memory (`ReadValues`)

These methods are only able to read data within a particular data page. Once you exhaust the data available in the data page (`ReadValues` returns 0), you must call `ReadNewPage`, which returns `true` is there is more data available.

Separately, I added a simple `Scanner` class that emulates the scalar value iteration functionality that existed previously. I used this to reimplement the `DebugPrint` method in `parquet_scanner.cc`. This obviously only works currently for flat data.

I would like to keep the `ColumnReader` low level and primitive, concerned only with providing access to the raw data in a Parquet file as fast as possible. We can devise separate algorithms for inferring nested record structure by examining the arrays of decoded values and repetition/definition levels. The major benefit of separating raw data access from structure inference is that this can be pipelined with threads: one thread decompresses and decodes values and levels, and another thread can turn batches into a nested record- or column-oriented structure.

Author: Wes McKinney <wes@cloudera.com>

Closes apache#26 from wesm/PARQUET-435 and squashes the following commits:

4bf5cd4 [Wes McKinney] Fix cpplint
852f4ec [Wes McKinney] Address review comments, also be sure to use Scanner::HasNext
7ea261e [Wes McKinney] Add TODO comment
4999719 [Wes McKinney] Make ColumnReader::ReadNewPage private and call HasNext() in ReadBatch
0d2e111 [Wes McKinney] Fix function description. Change #define to constexpr
111ef13 [Wes McKinney] Incorporate review comments and add some better comments
e16f7fd [Wes McKinney] Typo
ef52404 [Wes McKinney] Fix function doc
5e95cda [Wes McKinney] Configurable scanner batch size. Do not use printf in DebugPrint
1b4eca0 [Wes McKinney] New batch read API which reads levels and values in one shot
de4d6b6 [Wes McKinney] Move column_* files into parquet/column folder
aad4a86 [Wes McKinney] Finish refactoring scanner API with shared pointers
4506748 [Wes McKinney] Refactoring, do not have shared_from_this working yet
6489b15 [Wes McKinney] Batch level/value read interface on ColumnReader. Add Scanner class for flat columns. Add a couple smoke unit tests
wesm added a commit to wesm/arrow that referenced this pull request Sep 4, 2018
… than scalar

Column scanning and record reconstruction is independent of the Parquet file format and depends, among other things, on the data structures where the reconstructed data will end up. This is a work-in progress, but the basic idea is:

- APIs for reading a batch of repetition `ReadRepetitionLevels` or definition levels `ReadDefinitionLevels` into a preallocated `int16_t*`
- APIs for reading arrays of decoded values into preallocated memory (`ReadValues`)

These methods are only able to read data within a particular data page. Once you exhaust the data available in the data page (`ReadValues` returns 0), you must call `ReadNewPage`, which returns `true` is there is more data available.

Separately, I added a simple `Scanner` class that emulates the scalar value iteration functionality that existed previously. I used this to reimplement the `DebugPrint` method in `parquet_scanner.cc`. This obviously only works currently for flat data.

I would like to keep the `ColumnReader` low level and primitive, concerned only with providing access to the raw data in a Parquet file as fast as possible. We can devise separate algorithms for inferring nested record structure by examining the arrays of decoded values and repetition/definition levels. The major benefit of separating raw data access from structure inference is that this can be pipelined with threads: one thread decompresses and decodes values and levels, and another thread can turn batches into a nested record- or column-oriented structure.

Author: Wes McKinney <wes@cloudera.com>

Closes apache#26 from wesm/PARQUET-435 and squashes the following commits:

4bf5cd4 [Wes McKinney] Fix cpplint
852f4ec [Wes McKinney] Address review comments, also be sure to use Scanner::HasNext
7ea261e [Wes McKinney] Add TODO comment
4999719 [Wes McKinney] Make ColumnReader::ReadNewPage private and call HasNext() in ReadBatch
0d2e111 [Wes McKinney] Fix function description. Change #define to constexpr
111ef13 [Wes McKinney] Incorporate review comments and add some better comments
e16f7fd [Wes McKinney] Typo
ef52404 [Wes McKinney] Fix function doc
5e95cda [Wes McKinney] Configurable scanner batch size. Do not use printf in DebugPrint
1b4eca0 [Wes McKinney] New batch read API which reads levels and values in one shot
de4d6b6 [Wes McKinney] Move column_* files into parquet/column folder
aad4a86 [Wes McKinney] Finish refactoring scanner API with shared pointers
4506748 [Wes McKinney] Refactoring, do not have shared_from_this working yet
6489b15 [Wes McKinney] Batch level/value read interface on ColumnReader. Add Scanner class for flat columns. Add a couple smoke unit tests

Change-Id: I6bb6281f9216e0f3a9ce160a5ba320de4148fae1
wesm added a commit to wesm/arrow that referenced this pull request Sep 6, 2018
… than scalar

Column scanning and record reconstruction is independent of the Parquet file format and depends, among other things, on the data structures where the reconstructed data will end up. This is a work-in progress, but the basic idea is:

- APIs for reading a batch of repetition `ReadRepetitionLevels` or definition levels `ReadDefinitionLevels` into a preallocated `int16_t*`
- APIs for reading arrays of decoded values into preallocated memory (`ReadValues`)

These methods are only able to read data within a particular data page. Once you exhaust the data available in the data page (`ReadValues` returns 0), you must call `ReadNewPage`, which returns `true` is there is more data available.

Separately, I added a simple `Scanner` class that emulates the scalar value iteration functionality that existed previously. I used this to reimplement the `DebugPrint` method in `parquet_scanner.cc`. This obviously only works currently for flat data.

I would like to keep the `ColumnReader` low level and primitive, concerned only with providing access to the raw data in a Parquet file as fast as possible. We can devise separate algorithms for inferring nested record structure by examining the arrays of decoded values and repetition/definition levels. The major benefit of separating raw data access from structure inference is that this can be pipelined with threads: one thread decompresses and decodes values and levels, and another thread can turn batches into a nested record- or column-oriented structure.

Author: Wes McKinney <wes@cloudera.com>

Closes apache#26 from wesm/PARQUET-435 and squashes the following commits:

4bf5cd4 [Wes McKinney] Fix cpplint
852f4ec [Wes McKinney] Address review comments, also be sure to use Scanner::HasNext
7ea261e [Wes McKinney] Add TODO comment
4999719 [Wes McKinney] Make ColumnReader::ReadNewPage private and call HasNext() in ReadBatch
0d2e111 [Wes McKinney] Fix function description. Change #define to constexpr
111ef13 [Wes McKinney] Incorporate review comments and add some better comments
e16f7fd [Wes McKinney] Typo
ef52404 [Wes McKinney] Fix function doc
5e95cda [Wes McKinney] Configurable scanner batch size. Do not use printf in DebugPrint
1b4eca0 [Wes McKinney] New batch read API which reads levels and values in one shot
de4d6b6 [Wes McKinney] Move column_* files into parquet/column folder
aad4a86 [Wes McKinney] Finish refactoring scanner API with shared pointers
4506748 [Wes McKinney] Refactoring, do not have shared_from_this working yet
6489b15 [Wes McKinney] Batch level/value read interface on ColumnReader. Add Scanner class for flat columns. Add a couple smoke unit tests

Change-Id: I6bb6281f9216e0f3a9ce160a5ba320de4148fae1
wesm added a commit to wesm/arrow that referenced this pull request Sep 7, 2018
… than scalar

Column scanning and record reconstruction is independent of the Parquet file format and depends, among other things, on the data structures where the reconstructed data will end up. This is a work-in progress, but the basic idea is:

- APIs for reading a batch of repetition `ReadRepetitionLevels` or definition levels `ReadDefinitionLevels` into a preallocated `int16_t*`
- APIs for reading arrays of decoded values into preallocated memory (`ReadValues`)

These methods are only able to read data within a particular data page. Once you exhaust the data available in the data page (`ReadValues` returns 0), you must call `ReadNewPage`, which returns `true` is there is more data available.

Separately, I added a simple `Scanner` class that emulates the scalar value iteration functionality that existed previously. I used this to reimplement the `DebugPrint` method in `parquet_scanner.cc`. This obviously only works currently for flat data.

I would like to keep the `ColumnReader` low level and primitive, concerned only with providing access to the raw data in a Parquet file as fast as possible. We can devise separate algorithms for inferring nested record structure by examining the arrays of decoded values and repetition/definition levels. The major benefit of separating raw data access from structure inference is that this can be pipelined with threads: one thread decompresses and decodes values and levels, and another thread can turn batches into a nested record- or column-oriented structure.

Author: Wes McKinney <wes@cloudera.com>

Closes apache#26 from wesm/PARQUET-435 and squashes the following commits:

4bf5cd4 [Wes McKinney] Fix cpplint
852f4ec [Wes McKinney] Address review comments, also be sure to use Scanner::HasNext
7ea261e [Wes McKinney] Add TODO comment
4999719 [Wes McKinney] Make ColumnReader::ReadNewPage private and call HasNext() in ReadBatch
0d2e111 [Wes McKinney] Fix function description. Change #define to constexpr
111ef13 [Wes McKinney] Incorporate review comments and add some better comments
e16f7fd [Wes McKinney] Typo
ef52404 [Wes McKinney] Fix function doc
5e95cda [Wes McKinney] Configurable scanner batch size. Do not use printf in DebugPrint
1b4eca0 [Wes McKinney] New batch read API which reads levels and values in one shot
de4d6b6 [Wes McKinney] Move column_* files into parquet/column folder
aad4a86 [Wes McKinney] Finish refactoring scanner API with shared pointers
4506748 [Wes McKinney] Refactoring, do not have shared_from_this working yet
6489b15 [Wes McKinney] Batch level/value read interface on ColumnReader. Add Scanner class for flat columns. Add a couple smoke unit tests

Change-Id: I6bb6281f9216e0f3a9ce160a5ba320de4148fae1
wesm added a commit to wesm/arrow that referenced this pull request Sep 8, 2018
… than scalar

Column scanning and record reconstruction is independent of the Parquet file format and depends, among other things, on the data structures where the reconstructed data will end up. This is a work-in progress, but the basic idea is:

- APIs for reading a batch of repetition `ReadRepetitionLevels` or definition levels `ReadDefinitionLevels` into a preallocated `int16_t*`
- APIs for reading arrays of decoded values into preallocated memory (`ReadValues`)

These methods are only able to read data within a particular data page. Once you exhaust the data available in the data page (`ReadValues` returns 0), you must call `ReadNewPage`, which returns `true` is there is more data available.

Separately, I added a simple `Scanner` class that emulates the scalar value iteration functionality that existed previously. I used this to reimplement the `DebugPrint` method in `parquet_scanner.cc`. This obviously only works currently for flat data.

I would like to keep the `ColumnReader` low level and primitive, concerned only with providing access to the raw data in a Parquet file as fast as possible. We can devise separate algorithms for inferring nested record structure by examining the arrays of decoded values and repetition/definition levels. The major benefit of separating raw data access from structure inference is that this can be pipelined with threads: one thread decompresses and decodes values and levels, and another thread can turn batches into a nested record- or column-oriented structure.

Author: Wes McKinney <wes@cloudera.com>

Closes apache#26 from wesm/PARQUET-435 and squashes the following commits:

4bf5cd4 [Wes McKinney] Fix cpplint
852f4ec [Wes McKinney] Address review comments, also be sure to use Scanner::HasNext
7ea261e [Wes McKinney] Add TODO comment
4999719 [Wes McKinney] Make ColumnReader::ReadNewPage private and call HasNext() in ReadBatch
0d2e111 [Wes McKinney] Fix function description. Change #define to constexpr
111ef13 [Wes McKinney] Incorporate review comments and add some better comments
e16f7fd [Wes McKinney] Typo
ef52404 [Wes McKinney] Fix function doc
5e95cda [Wes McKinney] Configurable scanner batch size. Do not use printf in DebugPrint
1b4eca0 [Wes McKinney] New batch read API which reads levels and values in one shot
de4d6b6 [Wes McKinney] Move column_* files into parquet/column folder
aad4a86 [Wes McKinney] Finish refactoring scanner API with shared pointers
4506748 [Wes McKinney] Refactoring, do not have shared_from_this working yet
6489b15 [Wes McKinney] Batch level/value read interface on ColumnReader. Add Scanner class for flat columns. Add a couple smoke unit tests

Change-Id: I6bb6281f9216e0f3a9ce160a5ba320de4148fae1
kou pushed a commit that referenced this pull request May 10, 2020
This PR enables tests for `ARROW_COMPUTE`, `ARROW_DATASET`, `ARROW_FILESYSTEM`, `ARROW_HDFS`, `ARROW_ORC`, and `ARROW_IPC` (default on). #7131 enabled a minimal set of tests as a starting point.

I confirmed that these tests pass locally with the current master. In the current TravisCI environment, we cannot see this result due to a lot of error messages in `arrow-utility-test`.

```
$ git log | head -1
commit ed5f534
% ctest
...
      Start  1: arrow-array-test
 1/51 Test  #1: arrow-array-test .....................   Passed    4.62 sec
      Start  2: arrow-buffer-test
 2/51 Test  #2: arrow-buffer-test ....................   Passed    0.14 sec
      Start  3: arrow-extension-type-test
 3/51 Test  #3: arrow-extension-type-test ............   Passed    0.12 sec
      Start  4: arrow-misc-test
 4/51 Test  #4: arrow-misc-test ......................   Passed    0.14 sec
      Start  5: arrow-public-api-test
 5/51 Test  #5: arrow-public-api-test ................   Passed    0.12 sec
      Start  6: arrow-scalar-test
 6/51 Test  #6: arrow-scalar-test ....................   Passed    0.13 sec
      Start  7: arrow-type-test
 7/51 Test  #7: arrow-type-test ......................   Passed    0.14 sec
      Start  8: arrow-table-test
 8/51 Test  #8: arrow-table-test .....................   Passed    0.13 sec
      Start  9: arrow-tensor-test
 9/51 Test  #9: arrow-tensor-test ....................   Passed    0.13 sec
      Start 10: arrow-sparse-tensor-test
10/51 Test #10: arrow-sparse-tensor-test .............   Passed    0.16 sec
      Start 11: arrow-stl-test
11/51 Test #11: arrow-stl-test .......................   Passed    0.12 sec
      Start 12: arrow-concatenate-test
12/51 Test #12: arrow-concatenate-test ...............   Passed    0.53 sec
      Start 13: arrow-diff-test
13/51 Test #13: arrow-diff-test ......................   Passed    1.45 sec
      Start 14: arrow-c-bridge-test
14/51 Test #14: arrow-c-bridge-test ..................   Passed    0.18 sec
      Start 15: arrow-io-buffered-test
15/51 Test #15: arrow-io-buffered-test ...............   Passed    0.20 sec
      Start 16: arrow-io-compressed-test
16/51 Test #16: arrow-io-compressed-test .............   Passed    3.48 sec
      Start 17: arrow-io-file-test
17/51 Test #17: arrow-io-file-test ...................   Passed    0.74 sec
      Start 18: arrow-io-hdfs-test
18/51 Test #18: arrow-io-hdfs-test ...................   Passed    0.12 sec
      Start 19: arrow-io-memory-test
19/51 Test #19: arrow-io-memory-test .................   Passed    2.77 sec
      Start 20: arrow-utility-test
20/51 Test #20: arrow-utility-test ...................***Failed    5.65 sec
      Start 21: arrow-threading-utility-test
21/51 Test #21: arrow-threading-utility-test .........   Passed    1.34 sec
      Start 22: arrow-compute-compute-test
22/51 Test #22: arrow-compute-compute-test ...........   Passed    0.13 sec
      Start 23: arrow-compute-boolean-test
23/51 Test #23: arrow-compute-boolean-test ...........   Passed    0.15 sec
      Start 24: arrow-compute-cast-test
24/51 Test #24: arrow-compute-cast-test ..............   Passed    0.22 sec
      Start 25: arrow-compute-hash-test
25/51 Test #25: arrow-compute-hash-test ..............   Passed    2.61 sec
      Start 26: arrow-compute-isin-test
26/51 Test #26: arrow-compute-isin-test ..............   Passed    0.81 sec
      Start 27: arrow-compute-match-test
27/51 Test #27: arrow-compute-match-test .............   Passed    0.40 sec
      Start 28: arrow-compute-sort-to-indices-test
28/51 Test #28: arrow-compute-sort-to-indices-test ...   Passed    3.33 sec
      Start 29: arrow-compute-nth-to-indices-test
29/51 Test #29: arrow-compute-nth-to-indices-test ....   Passed    1.51 sec
      Start 30: arrow-compute-util-internal-test
30/51 Test #30: arrow-compute-util-internal-test .....   Passed    0.13 sec
      Start 31: arrow-compute-add-test
31/51 Test #31: arrow-compute-add-test ...............   Passed    0.12 sec
      Start 32: arrow-compute-aggregate-test
32/51 Test #32: arrow-compute-aggregate-test .........   Passed   14.70 sec
      Start 33: arrow-compute-compare-test
33/51 Test #33: arrow-compute-compare-test ...........   Passed    7.96 sec
      Start 34: arrow-compute-take-test
34/51 Test #34: arrow-compute-take-test ..............   Passed    4.80 sec
      Start 35: arrow-compute-filter-test
35/51 Test #35: arrow-compute-filter-test ............   Passed    8.23 sec
      Start 36: arrow-dataset-dataset-test
36/51 Test #36: arrow-dataset-dataset-test ...........   Passed    0.25 sec
      Start 37: arrow-dataset-discovery-test
37/51 Test #37: arrow-dataset-discovery-test .........   Passed    0.13 sec
      Start 38: arrow-dataset-file-ipc-test
38/51 Test #38: arrow-dataset-file-ipc-test ..........   Passed    0.21 sec
      Start 39: arrow-dataset-file-test
39/51 Test #39: arrow-dataset-file-test ..............   Passed    0.12 sec
      Start 40: arrow-dataset-filter-test
40/51 Test #40: arrow-dataset-filter-test ............   Passed    0.16 sec
      Start 41: arrow-dataset-partition-test
41/51 Test #41: arrow-dataset-partition-test .........   Passed    0.13 sec
      Start 42: arrow-dataset-scanner-test
42/51 Test #42: arrow-dataset-scanner-test ...........   Passed    0.20 sec
      Start 43: arrow-filesystem-test
43/51 Test #43: arrow-filesystem-test ................   Passed    1.62 sec
      Start 44: arrow-hdfs-test
44/51 Test #44: arrow-hdfs-test ......................   Passed    0.13 sec
      Start 45: arrow-feather-test
45/51 Test #45: arrow-feather-test ...................   Passed    0.91 sec
      Start 46: arrow-ipc-read-write-test
46/51 Test #46: arrow-ipc-read-write-test ............   Passed    5.77 sec
      Start 47: arrow-ipc-json-simple-test
47/51 Test #47: arrow-ipc-json-simple-test ...........   Passed    0.16 sec
      Start 48: arrow-ipc-json-test
48/51 Test #48: arrow-ipc-json-test ..................   Passed    0.27 sec
      Start 49: arrow-json-integration-test
49/51 Test #49: arrow-json-integration-test ..........   Passed    0.13 sec
      Start 50: arrow-json-test
50/51 Test #50: arrow-json-test ......................   Passed    0.26 sec
      Start 51: arrow-orc-adapter-test
51/51 Test #51: arrow-orc-adapter-test ...............   Passed    1.92 sec

98% tests passed, 1 tests failed out of 51

Label Time Summary:
arrow-tests      =  27.38 sec (27 tests)
arrow_compute    =  45.11 sec (14 tests)
arrow_dataset    =   1.21 sec (7 tests)
arrow_ipc        =   6.20 sec (3 tests)
unittest         =  79.91 sec (51 tests)

Total Test time (real) =  79.99 sec

The following tests FAILED:
	 20 - arrow-utility-test (Failed)
Errors while running CTest
```

Closes #7142 from kiszk/ARROW-8754

Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
rui-mo added a commit to rui-mo/arrow-1 that referenced this pull request Aug 2, 2021
* add support to int32 type

* fix fortmat of cast varchar
zhztheplayer pushed a commit to zhztheplayer/arrow-1 that referenced this pull request Feb 8, 2022
* add support to int32 type

* fix fortmat of cast varchar
zhztheplayer pushed a commit to zhztheplayer/arrow-1 that referenced this pull request Mar 3, 2022
* add support to int32 type

* fix fortmat of cast varchar
rui-mo added a commit to rui-mo/arrow-1 that referenced this pull request Mar 23, 2022
* add support to int32 type

* fix fortmat of cast varchar
jayhomn-bitquill pushed a commit to Bit-Quill/arrow that referenced this pull request Aug 10, 2022
…iour-for-getObject-getString

[JAVA] [JDBC] Change IntervalAccessor getString format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants