-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-33: [C++] Implement zero-copy array slicing #56
Conversation
Some confusions follow: |
} \ | ||
\ | ||
std::shared_ptr<NAME> slice(int32_t start, int32_t length) const{ \ | ||
int32_t uFrom = start; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert start >= 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the variable naming conventions from https://google.github.io/styleguide/cppguide.html#Variable_Names. Do not use Hungarian notation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably use DCHECK_GT
for the assertion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your advices, I will modify it according to the expected format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh....
Actually, this method will accept start and length when they are negative.
Thanks for your advices. I will modify the variable names.
:)
What about other Array types (like ListArray, StringArray)? |
*static_cast<const PrimitiveArray*>(&other)); \ | ||
} \ | ||
\ | ||
std::shared_ptr<NAME> slice(int32_t start) const{ \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also move this implementation to primitive.cc
One major issue (I think you mentioned on JIRA) is the null bitmap — slicing may require a memory allocation and copying over the bits. Another option is to add a layer of indirection and a Otherwise, if creating a new bitmap is required, then the API must change to return |
This option sounds better, as similar to the data buffer. |
It's a performance trade-off. The downside of that approach is that a bitmap offset then potentially leaks into many algorithms |
@wesm @drankye Then I still have a confusion with the following code. Anyway, I will rewirte the code based on all of your suggestions. |
I'm concerned about working more on this patch until we figure out a function evaluation model. See ARROW-34. Zero-copy slicing is difficult unless you construct a view object (definitely an option). |
@wesm By the way, it's hard to put the method into Thanks a lot. |
To summarize the question Guangyuan mentioned above: we have some Buffer types including Buffer, PoolBuffer and etc., to slice an array, what's the concrete Buffer type to be used by the resultant array? It would be good to use the same Buffer type the parent uses, but not sure how easy it's to be done. I have a question related. I'm not sure how much benefit we would obtain by having quite a few Buffer subclasses. Might it be just to have one Buffer class, and add all the necessary methods like Append, Resize, Reserve to it? It should simplify something I guess. |
Since users of Arrow will have memory originating from many places and with varying ownership semantics, forcing a particular Buffer or memory allocated strategy is not a good idea. By enabling users to implement their own Buffer subclasses and their own memory pools, they can hook the Arrow C++ library more easily into existing applications. I'm open to other ideas ways to arrange Arrow's internal memory management, reference counting, allocation, and ownership semantics. This probably would need to take the form of a patch or the start of a patch, though. |
Add slice functions and test cases for ListArray. Except the value buffer, offsets buffer and bitmap buffer take the onzero-copy stragtegy.
d9b872b
to
d589961
Compare
Why the CI test failed, however it occurs no failures in my environment? |
The test suite is failing the valgrind memory leak checks. We enabled this to be printed in parquet (https://github.com/apache/parquet-cpp/blob/master/ci/travis_script_cpp.sh). I'm pretty concerned about this patch because of the semantics of slicing bitmaps per my April 4 comment. I have not reviewed the code yet. |
…ocalFileSource I also added the `file_descriptor` API so that we can verify that dtors elsewhere successfully close open files. Closes apache#56 Author: Wes McKinney <wesm@apache.org> Closes apache#66 from wesm/PARQUET-520 and squashes the following commits: 9d638ba [Wes McKinney] Add memory-mapping option to ParquetFileReader::OpenFile. Add --no-memory-map flag to parquet_reader 6389683 [Wes McKinney] Add Read API tests dbf6a45 [Wes McKinney] Test some failure modes for LocalFileSource / MemoryMapSource 01a7d64 [Wes McKinney] Add a MemoryMapSource and use this by default for SerializedFileReader
…ocalFileSource I also added the `file_descriptor` API so that we can verify that dtors elsewhere successfully close open files. Closes apache#56 Author: Wes McKinney <wesm@apache.org> Closes apache#66 from wesm/PARQUET-520 and squashes the following commits: 9d638ba [Wes McKinney] Add memory-mapping option to ParquetFileReader::OpenFile. Add --no-memory-map flag to parquet_reader 6389683 [Wes McKinney] Add Read API tests dbf6a45 [Wes McKinney] Test some failure modes for LocalFileSource / MemoryMapSource 01a7d64 [Wes McKinney] Add a MemoryMapSource and use this by default for SerializedFileReader Change-Id: I467fcda7439d36c244d74bf5fec0ae61f6b674f0
…ocalFileSource I also added the `file_descriptor` API so that we can verify that dtors elsewhere successfully close open files. Closes apache#56 Author: Wes McKinney <wesm@apache.org> Closes apache#66 from wesm/PARQUET-520 and squashes the following commits: 9d638ba [Wes McKinney] Add memory-mapping option to ParquetFileReader::OpenFile. Add --no-memory-map flag to parquet_reader 6389683 [Wes McKinney] Add Read API tests dbf6a45 [Wes McKinney] Test some failure modes for LocalFileSource / MemoryMapSource 01a7d64 [Wes McKinney] Add a MemoryMapSource and use this by default for SerializedFileReader Change-Id: I467fcda7439d36c244d74bf5fec0ae61f6b674f0
…ocalFileSource I also added the `file_descriptor` API so that we can verify that dtors elsewhere successfully close open files. Closes apache#56 Author: Wes McKinney <wesm@apache.org> Closes apache#66 from wesm/PARQUET-520 and squashes the following commits: 9d638ba [Wes McKinney] Add memory-mapping option to ParquetFileReader::OpenFile. Add --no-memory-map flag to parquet_reader 6389683 [Wes McKinney] Add Read API tests dbf6a45 [Wes McKinney] Test some failure modes for LocalFileSource / MemoryMapSource 01a7d64 [Wes McKinney] Add a MemoryMapSource and use this by default for SerializedFileReader Change-Id: I467fcda7439d36c244d74bf5fec0ae61f6b674f0
…ocalFileSource I also added the `file_descriptor` API so that we can verify that dtors elsewhere successfully close open files. Closes apache#56 Author: Wes McKinney <wesm@apache.org> Closes apache#66 from wesm/PARQUET-520 and squashes the following commits: 9d638ba [Wes McKinney] Add memory-mapping option to ParquetFileReader::OpenFile. Add --no-memory-map flag to parquet_reader 6389683 [Wes McKinney] Add Read API tests dbf6a45 [Wes McKinney] Test some failure modes for LocalFileSource / MemoryMapSource 01a7d64 [Wes McKinney] Add a MemoryMapSource and use this by default for SerializedFileReader Change-Id: I467fcda7439d36c244d74bf5fec0ae61f6b674f0
Enable CI for Java Datasets
…d fix issue in unit test (apache#56) * Support date format with no hyphen * Correct the unit test * Keep previous test case * Correct a comment
…d fix issue in unit test (apache#56) * Support date format with no hyphen * Correct the unit test * Keep previous test case * Correct a comment
…d fix issue in unit test (apache#56) * Support date format with no hyphen * Correct the unit test * Keep previous test case * Correct a comment
…d fix issue in unit test (apache#56) * Support date format with no hyphen * Correct the unit test * Keep previous test case * Correct a comment
Add two methods to slice an Array. It will generate a new Array
containing a Buffer with parent pointer, offset and length.