Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-33: [C++] Implement zero-copy array slicing #56

Closed
wants to merge 1 commit into from

Conversation

fengguangyuan
Copy link

Add two methods to slice an Array. It will generate a new Array
containing a Buffer with parent pointer, offset and length.

@fengguangyuan
Copy link
Author

Some confusions follow:
1.If it's necessary to add the slicing methods into ListArray, UnionArray etc.
2.Before slice(...) returns an Array, which is better the Array contains a base class Buffer
or a derived Buffer.
e...Anyway, hope this request not too bad. : )

@fengguangyuan fengguangyuan changed the title ARROW-33: Implement zero-copy array slicing ARROW-33: [C++] Implement zero-copy array slicing Apr 1, 2016
} \
\
std::shared_ptr<NAME> slice(int32_t start, int32_t length) const{ \
int32_t uFrom = start; \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert start >= 0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the variable naming conventions from https://google.github.io/styleguide/cppguide.html#Variable_Names. Do not use Hungarian notation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably use DCHECK_GT for the assertion

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your advices, I will modify it according to the expected format.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh....
Actually, this method will accept start and length when they are negative.
Thanks for your advices. I will modify the variable names.
:)

@wesm
Copy link
Member

wesm commented Apr 1, 2016

What about other Array types (like ListArray, StringArray)?

*static_cast<const PrimitiveArray*>(&other)); \
} \
\
std::shared_ptr<NAME> slice(int32_t start) const{ \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also move this implementation to primitive.cc

@wesm
Copy link
Member

wesm commented Apr 1, 2016

One major issue (I think you mentioned on JIRA) is the null bitmap — slicing may require a memory allocation and copying over the bits. Another option is to add a layer of indirection and a Bitmap class that can retrieve bits including an offset, so that zero-copy is possible.

Otherwise, if creating a new bitmap is required, then the API must change to return Status (since the allocation could fail)

@drankye
Copy link

drankye commented Apr 4, 2016

Another option is to add a layer of indirection and a Bitmap class that can retrieve bits including an offset, so that zero-copy is possible.

This option sounds better, as similar to the data buffer.

@wesm
Copy link
Member

wesm commented Apr 4, 2016

It's a performance trade-off. The downside of that approach is that a bitmap offset then potentially leaks into many algorithms

@fengguangyuan
Copy link
Author

@wesm @drankye
Yes, you mentioned features are great.
And I will try to add an indirection layer to do tricks with null bitmap.

Then I still have a confusion with the following code.
auto sliced_buf = std::make_shared<Buffer>(parent_data, int64_t(uFrom * sizeof(T)), int64_t(uLength * sizeof(T)));
For the above code. I am not sure when the slice method returns an specified Array, such as Int32Array, the Array should contain a base-class Buffer object, or a derived-class PoolBuffer object.
After all, Int32Array is created with a PoolBuffer which owns some features that Buffer dosen't.
Maybe, the new code just looks like this:
auto sliced_buf = std::make_shared<PoolBuffer>(parent_data, int64_t(uFrom * sizeof(T)), int64_t(uLength * sizeof(T)));

Anyway, I will rewirte the code based on all of your suggestions.
Thank you, everybody.
Have a nice day. :)

@wesm
Copy link
Member

wesm commented Apr 5, 2016

I'm concerned about working more on this patch until we figure out a function evaluation model. See ARROW-34. Zero-copy slicing is difficult unless you construct a view object (definitely an option).

@fengguangyuan
Copy link
Author

@wesm
OK. I will reduce the influence of this issue as much as possible.

By the way, it's hard to put the method into Primitive.cc, unless class PrimitiveArray is templatable
or provide more arguments for this method, because user may expect this method ot return a specific
Array, like Int32Array, so some more tricks should be considered to adapt to the change.

Thanks a lot.

@drankye
Copy link

drankye commented Apr 10, 2016

To summarize the question Guangyuan mentioned above: we have some Buffer types including Buffer, PoolBuffer and etc., to slice an array, what's the concrete Buffer type to be used by the resultant array? It would be good to use the same Buffer type the parent uses, but not sure how easy it's to be done.

I have a question related. I'm not sure how much benefit we would obtain by having quite a few Buffer subclasses. Might it be just to have one Buffer class, and add all the necessary methods like Append, Resize, Reserve to it? It should simplify something I guess.

@wesm
Copy link
Member

wesm commented Apr 10, 2016

Since users of Arrow will have memory originating from many places and with varying ownership semantics, forcing a particular Buffer or memory allocated strategy is not a good idea. By enabling users to implement their own Buffer subclasses and their own memory pools, they can hook the Arrow C++ library more easily into existing applications.

I'm open to other ideas ways to arrange Arrow's internal memory management, reference counting, allocation, and ownership semantics. This probably would need to take the form of a patch or the start of a patch, though.

@fengguangyuan
Copy link
Author

@drankye
Yes, maybe your question is good, I am not sure.
@wesm
Great. The user-defined buffer should be nice. Arrow has the rules, so does the users.
Arrow defines the basic measures of internal memory management etc.
An interface of Buffer or memory pool for user definition is needed, I think.

Add slice functions and test cases for ListArray.
Except the value buffer, offsets buffer and bitmap buffer take the
onzero-copy stragtegy.
@fengguangyuan fengguangyuan force-pushed the ARROW-33 branch 3 times, most recently from d9b872b to d589961 Compare May 6, 2016 08:33
@fengguangyuan
Copy link
Author

Why the CI test failed, however it occurs no failures in my environment?
Please. :(

@wesm
Copy link
Member

wesm commented May 6, 2016

The test suite is failing the valgrind memory leak checks. We enabled this to be printed in parquet (https://github.com/apache/parquet-cpp/blob/master/ci/travis_script_cpp.sh).

I'm pretty concerned about this patch because of the semantics of slicing bitmaps per my April 4 comment. I have not reviewed the code yet.

@asfgit asfgit closed this in 5439b71 Feb 6, 2017
wesm added a commit to wesm/arrow that referenced this pull request Sep 2, 2018
…ocalFileSource

I also added the `file_descriptor` API so that we can verify that dtors elsewhere successfully close open files. Closes apache#56

Author: Wes McKinney <wesm@apache.org>

Closes apache#66 from wesm/PARQUET-520 and squashes the following commits:

9d638ba [Wes McKinney] Add memory-mapping option to ParquetFileReader::OpenFile. Add --no-memory-map flag to parquet_reader
6389683 [Wes McKinney] Add Read API tests
dbf6a45 [Wes McKinney] Test some failure modes for LocalFileSource / MemoryMapSource
01a7d64 [Wes McKinney] Add a MemoryMapSource and use this by default for SerializedFileReader
wesm added a commit to wesm/arrow that referenced this pull request Sep 4, 2018
…ocalFileSource

I also added the `file_descriptor` API so that we can verify that dtors elsewhere successfully close open files. Closes apache#56

Author: Wes McKinney <wesm@apache.org>

Closes apache#66 from wesm/PARQUET-520 and squashes the following commits:

9d638ba [Wes McKinney] Add memory-mapping option to ParquetFileReader::OpenFile. Add --no-memory-map flag to parquet_reader
6389683 [Wes McKinney] Add Read API tests
dbf6a45 [Wes McKinney] Test some failure modes for LocalFileSource / MemoryMapSource
01a7d64 [Wes McKinney] Add a MemoryMapSource and use this by default for SerializedFileReader

Change-Id: I467fcda7439d36c244d74bf5fec0ae61f6b674f0
wesm added a commit to wesm/arrow that referenced this pull request Sep 6, 2018
…ocalFileSource

I also added the `file_descriptor` API so that we can verify that dtors elsewhere successfully close open files. Closes apache#56

Author: Wes McKinney <wesm@apache.org>

Closes apache#66 from wesm/PARQUET-520 and squashes the following commits:

9d638ba [Wes McKinney] Add memory-mapping option to ParquetFileReader::OpenFile. Add --no-memory-map flag to parquet_reader
6389683 [Wes McKinney] Add Read API tests
dbf6a45 [Wes McKinney] Test some failure modes for LocalFileSource / MemoryMapSource
01a7d64 [Wes McKinney] Add a MemoryMapSource and use this by default for SerializedFileReader

Change-Id: I467fcda7439d36c244d74bf5fec0ae61f6b674f0
wesm added a commit to wesm/arrow that referenced this pull request Sep 7, 2018
…ocalFileSource

I also added the `file_descriptor` API so that we can verify that dtors elsewhere successfully close open files. Closes apache#56

Author: Wes McKinney <wesm@apache.org>

Closes apache#66 from wesm/PARQUET-520 and squashes the following commits:

9d638ba [Wes McKinney] Add memory-mapping option to ParquetFileReader::OpenFile. Add --no-memory-map flag to parquet_reader
6389683 [Wes McKinney] Add Read API tests
dbf6a45 [Wes McKinney] Test some failure modes for LocalFileSource / MemoryMapSource
01a7d64 [Wes McKinney] Add a MemoryMapSource and use this by default for SerializedFileReader

Change-Id: I467fcda7439d36c244d74bf5fec0ae61f6b674f0
wesm added a commit to wesm/arrow that referenced this pull request Sep 8, 2018
…ocalFileSource

I also added the `file_descriptor` API so that we can verify that dtors elsewhere successfully close open files. Closes apache#56

Author: Wes McKinney <wesm@apache.org>

Closes apache#66 from wesm/PARQUET-520 and squashes the following commits:

9d638ba [Wes McKinney] Add memory-mapping option to ParquetFileReader::OpenFile. Add --no-memory-map flag to parquet_reader
6389683 [Wes McKinney] Add Read API tests
dbf6a45 [Wes McKinney] Test some failure modes for LocalFileSource / MemoryMapSource
01a7d64 [Wes McKinney] Add a MemoryMapSource and use this by default for SerializedFileReader

Change-Id: I467fcda7439d36c244d74bf5fec0ae61f6b674f0
jikunshang pushed a commit to jikunshang/arrow that referenced this pull request May 29, 2020
zhouyuan pushed a commit to zhouyuan/arrow that referenced this pull request Dec 22, 2021
…d fix issue in unit test (apache#56)

* Support date format with no hyphen

* Correct the unit test

* Keep previous test case

* Correct a comment
zhztheplayer pushed a commit to zhztheplayer/arrow-1 that referenced this pull request Feb 8, 2022
…d fix issue in unit test (apache#56)

* Support date format with no hyphen

* Correct the unit test

* Keep previous test case

* Correct a comment
zhztheplayer pushed a commit to zhztheplayer/arrow-1 that referenced this pull request Mar 3, 2022
…d fix issue in unit test (apache#56)

* Support date format with no hyphen

* Correct the unit test

* Keep previous test case

* Correct a comment
rui-mo pushed a commit to rui-mo/arrow-1 that referenced this pull request Mar 23, 2022
…d fix issue in unit test (apache#56)

* Support date format with no hyphen

* Correct the unit test

* Keep previous test case

* Correct a comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants