-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-3191: [Java] Make ArrowBuf work with arbitrary underlying memory #4151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
What does "arbitrary memory" mean? Would this include also device memory? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about changing the API to take LONGs here instead? Or do you think this should be a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should encumber this patch with that change.
Let's also have a discussion about that change on the mailing list before we make it. What use case is there where this is important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can bring this up for discussion on the mailing list but the suggestion is to bring this in line with the C++ side of things so large message batches can be read in java if desired see (https://issues.apache.org/jira/browse/ARROW-679)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change this to either take a char, or replicate documentation saying which bits are ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, is there a use-case to store 2-byte characters in Arrow (I'm not thinking of one off the top of my head).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eliminate or document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to add javadoc to each method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure people are happy with overall shape/approach of patch before we spend time on this type of changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider returning a collection here (if it isn't too big a change, even even we want to change to support 64 bit size memory in a separate PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally same question about changing the API to use Longs instead of Ints for all getters/settings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be tied to memory? What if we want a file backed implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely should be tied to a memory address. Whether that is backed by a memory mapped file will be up to the consumer with this change.
It just means any memory address, not just those attached to a Netty allocator. Right now, ArrowBuf must be allocated by the Netty allocator which means that if a user allocated memory some other way, he is unable to connect it to Arrow. For example, the Java Plasma client has that problem right now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this name doesn't seem to match up well with the documentation if an actual arrowbuf is being allocated.
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the changes, looks great @siddharthteotia ! I just had some minor comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need a ++address here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be fixed in the latest commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to decrement length and increment dstAddress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be fixed in latest commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it returns a boolean though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the interface in old ArrowBuf had the semantics that it will return true if ref count has dropped to 0, false otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say decrement by 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could it be helpful to return the new count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess why do you need this since retain() is below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These interfaces are borrowed from how they were earlier in ArrowBuf. I didn't want to change or remove this reference count related APIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then I agree not to change it. I thought it was a new api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again here, could it be useful to return the new reference count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be. I just kept the way it was in previous implementation of ArrowBuf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if it's overlimit then it reallocates? Maybe you could clarify
|
Thanks @jacques-n , @BryanCutler, @emkornfield , @pearu for the comments so far. I will be addressing them soon. In my latest commit, I have made all the necessary changes in java code to work with new ArrowBuf and ReferenceManager interface. More importantly, there is a wrapper buffer NettyArrowBuf interface to comply with usage in RPC and Netty related code. It will be good to get feedback on this one as well. I am also adding javadocs for the existing/new interfaces in ArrowBuf but might have missed out a few methods here and there. As of now, the java modules build fine but I have to fix test failures. That is in progress. |
|
Please update the org/apache/arrow/memory/README.md to explain how does the ReferenceManager fit into Arrow Memory Management. |
and fix build issues
usage in Netty framework
4a4ba75 to
2fbb2c5
Compare
Codecov Report
@@ Coverage Diff @@
## master #4151 +/- ##
==========================================
+ Coverage 87.77% 89.45% +1.67%
==========================================
Files 758 620 -138
Lines 92506 83442 -9064
Branches 1251 0 -1251
==========================================
- Hits 81201 74644 -6557
+ Misses 11188 8798 -2390
+ Partials 117 0 -117
Continue to review full report at Codecov.
|
|
I got a clean build with my previous commit. The latest commit just has come cleanup and I expect it to be clean. Please review the changes. We should plan to merge this soon. |
|
Can this be merged? |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
merged to master, thanks @siddharthteotia |
Thank you, @BryanCutler |
|
I think @pravindra and I both wanted to do another pass over this before merging. I guess we'll open up follow-on JIRAs to address additional comments. |
jacques-n
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments. Still going through it.
| * @return netty compliant {@link NettyArrowBuf} | ||
| */ | ||
| public TransferResult transferOwnership(BufferAllocator target) { | ||
| public NettyArrowBuf asNettyBuffer() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Netty should not be exposed in the public API. Please remove
| * </p> | ||
| */ | ||
| public final class ArrowBuf extends AbstractByteBuf implements AutoCloseable { | ||
| public final class ArrowBuf implements AutoCloseable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should no longer be in the io.netty.buffer package.
| * Reference Manager manages one or more ArrowBufs that share the | ||
| * reference count for the underlying memory chunk. | ||
| */ | ||
| public interface ReferenceManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide a default implementation with noop operations
| import io.netty.buffer.ArrowBuf; | ||
| import io.netty.buffer.ArrowBuf.TransferResult; | ||
|
|
||
| public class TestBaseAllocator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a set of tests that prove that this jira does what it says it does. Let's create a simple allocator that doesn't use netty and then use the vector apis doing that. This should probably be a new test class (as opposed to being added to this class).
| private final BufferManager bufManager; | ||
| private final ArrowByteBufAllocator alloc; | ||
| private final boolean isEmpty; | ||
| private int readerIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like these shoudn't be part of ArrowBuf and can be an addon/new class if people want to have index based behavior.
| @Override | ||
| public ByteBuffer[] nioBuffers() { | ||
| return new ByteBuffer[] {nioBuffer()}; | ||
| return isEmpty ? ByteBuffer.allocateDirect(0) : asNettyBuffer().nioBuffer(index, length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird that this delegates to netty buffer since ideally ArrowBuf shouldn't be depending on Netty.
|
Apologies @jacques-n and @pravindra if I merged too soon , I thought there were no more comments |
no problem at all. we'll address issues, if any, in follow-up jiras. |
This patch has the following goals: (1) Make ArrowBuf work with any arbitrary memory. (2) Decouple the usage of data get/set in ArrowBuf and memory accounting, reference management, ownership etc. Changes (1) A ReferenceManager interface that can be provided to ArrowBuf. This allows the users to provide their own custom implementation of reference management or it can be a NO-OP. (2) All the accounting, ownership, reference related APIs have been moved to the default implementation of ReferenceManager -- BufferLedger, AllocationManager (3) ArrowBuf is now literally an abstraction over some user provided underlying memory chunk. All it needs is starting virtual address and length of data to access along with a user provided implementation of ReferenceManager. (4) ArrowBuf no longer extends or implements any of Netty's buffer interfaces. Thus all of the extra and unused APIs have been removed and it just provides simple get/set. There is quite a bit of cleanup that needs to be done since some APIs have been moved out of ArrowBuf. So the caller code needs to change. They are likely going to be boilerplate changes but I would like to do them once we have consensus on the major set of changes here and the decoupling between ArrowBuf usage and reference management. So the code doesn't compile yet because of the above mentioned reason. Secondly, there are a few things that I have removed assuming they are not being used -- like BufferManager in ArrowBuf. I am still evaluating its usage. So there a few TODOs in code for these reasons. Raising PR before the code is complete to get feedback on the important set of changes. Author: siddharth <siddharth@dremio.com> Closes apache#4151 from siddharthteotia/ARROW-3191 and squashes the following commits: 6bb9cfc <siddharth> Cleanup 2fa139c <siddharth> integration test issues 283916f <siddharth> Fix integration test issues 47a303b <siddharth> Setting io.netty.tryReflectionSetAccessible to true 76096e5 <siddharth> Refactor NettyArrowBuf 0b9d5d2 <siddharth> Fix test failures happening in non-debug mode and gandiva build errors 2fbb2c5 <siddharth> Fix some test failures and rebase 9e8beb6 <siddharth> Fix test failures, add javadoc and wrapper over ArrowBuf for usage in Netty framework 348be8b <siddharth> Change callers of ArrowBuf APIs to use ReferenceManager interface and fix build issues 68fe274 <siddharth> ARROW-3191: WIP for pointing ArrowBuf to arbitrary memory
This patch has the following goals:
(1) Make ArrowBuf work with any arbitrary memory.
(2) Decouple the usage of data get/set in ArrowBuf and memory accounting, reference management, ownership etc.
Changes
(1) A ReferenceManager interface that can be provided to ArrowBuf. This allows the users to provide their own custom implementation of reference management or it can be a NO-OP.
(2) All the accounting, ownership, reference related APIs have been moved to the default implementation of ReferenceManager -- BufferLedger, AllocationManager
(3) ArrowBuf is now literally an abstraction over some user provided underlying memory chunk. All it needs is starting virtual address and length of data to access along with a user provided implementation of ReferenceManager.
(4) ArrowBuf no longer extends or implements any of Netty's buffer interfaces. Thus all of the extra and unused APIs have been removed and it just provides simple get/set.
There is quite a bit of cleanup that needs to be done since some APIs have been moved out of ArrowBuf. So the caller code needs to change. They are likely going to be boilerplate changes but I would like to do them once we have consensus on the major set of changes here and the decoupling between ArrowBuf usage and reference management.
So the code doesn't compile yet because of the above mentioned reason. Secondly, there are a few things that I have removed assuming they are not being used -- like BufferManager in ArrowBuf. I am still evaluating its usage. So there a few TODOs in code for these reasons.
Raising PR before the code is complete to get feedback on the important set of changes.