Skip to content

Conversation

@siddharthteotia
Copy link
Contributor

This patch has the following goals:

(1) Make ArrowBuf work with any arbitrary memory.
(2) Decouple the usage of data get/set in ArrowBuf and memory accounting, reference management, ownership etc.

Changes

(1) A ReferenceManager interface that can be provided to ArrowBuf. This allows the users to provide their own custom implementation of reference management or it can be a NO-OP.
(2) All the accounting, ownership, reference related APIs have been moved to the default implementation of ReferenceManager -- BufferLedger, AllocationManager
(3) ArrowBuf is now literally an abstraction over some user provided underlying memory chunk. All it needs is starting virtual address and length of data to access along with a user provided implementation of ReferenceManager.
(4) ArrowBuf no longer extends or implements any of Netty's buffer interfaces. Thus all of the extra and unused APIs have been removed and it just provides simple get/set.

There is quite a bit of cleanup that needs to be done since some APIs have been moved out of ArrowBuf. So the caller code needs to change. They are likely going to be boilerplate changes but I would like to do them once we have consensus on the major set of changes here and the decoupling between ArrowBuf usage and reference management.

So the code doesn't compile yet because of the above mentioned reason. Secondly, there are a few things that I have removed assuming they are not being used -- like BufferManager in ArrowBuf. I am still evaluating its usage. So there a few TODOs in code for these reasons.

Raising PR before the code is complete to get feedback on the important set of changes.

@pearu
Copy link
Contributor

pearu commented Apr 14, 2019

Make ArrowBuf work with any arbitrary memory.

What does "arbitrary memory" mean? Would this include also device memory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about changing the API to take LONGs here instead? Or do you think this should be a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should encumber this patch with that change.

Let's also have a discussion about that change on the mailing list before we make it. What use case is there where this is important?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can bring this up for discussion on the mailing list but the suggestion is to bring this in line with the C++ side of things so large message batches can be read in java if desired see (https://issues.apache.org/jira/browse/ARROW-679)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change this to either take a char, or replicate documentation saying which bits are ignored.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, is there a use-case to store 2-byte characters in Arrow (I'm not thinking of one off the top of my head).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eliminate or document.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to add javadoc to each method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure people are happy with overall shape/approach of patch before we spend time on this type of changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider returning a collection here (if it isn't too big a change, even even we want to change to support 64 bit size memory in a separate PR).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally same question about changing the API to use Longs instead of Ints for all getters/settings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be tied to memory? What if we want a file backed implementation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely should be tied to a memory address. Whether that is backed by a memory mapped file will be up to the consumer with this change.

@jacques-n
Copy link
Contributor

Make ArrowBuf work with any arbitrary memory.

What does "arbitrary memory" mean? Would this include also device memory?

It just means any memory address, not just those attached to a Netty allocator. Right now, ArrowBuf must be allocated by the Netty allocator which means that if a user allocated memory some other way, he is unable to connect it to Arrow. For example, the Java Plasma client has that problem right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this name doesn't seem to match up well with the documentation if an actual arrowbuf is being allocated.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the changes, looks great @siddharthteotia ! I just had some minor comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need a ++address here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be fixed in the latest commit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to decrement length and increment dstAddress?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be fixed in latest commit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it returns a boolean though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the interface in old ArrowBuf had the semantics that it will return true if ref count has dropped to 0, false otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say decrement by 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could it be helpful to return the new count?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess why do you need this since retain() is below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These interfaces are borrowed from how they were earlier in ArrowBuf. I didn't want to change or remove this reference count related APIs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then I agree not to change it. I thought it was a new api.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again here, could it be useful to return the new reference count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be. I just kept the way it was in previous implementation of ArrowBuf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if it's overlimit then it reallocates? Maybe you could clarify

@siddharthteotia
Copy link
Contributor Author

siddharthteotia commented Apr 18, 2019

Thanks @jacques-n , @BryanCutler, @emkornfield , @pearu for the comments so far. I will be addressing them soon. In my latest commit, I have made all the necessary changes in java code to work with new ArrowBuf and ReferenceManager interface. More importantly, there is a wrapper buffer NettyArrowBuf interface to comply with usage in RPC and Netty related code. It will be good to get feedback on this one as well. I am also adding javadocs for the existing/new interfaces in ArrowBuf but might have missed out a few methods here and there.

As of now, the java modules build fine but I have to fix test failures. That is in progress.

@yuruiz
Copy link

yuruiz commented Apr 23, 2019

Please update the org/apache/arrow/memory/README.md to explain how does the ReferenceManager fit into Arrow Memory Management.

@codecov-io
Copy link

codecov-io commented May 4, 2019

Codecov Report

Merging #4151 into master will increase coverage by 1.67%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4151      +/-   ##
==========================================
+ Coverage   87.77%   89.45%   +1.67%     
==========================================
  Files         758      620     -138     
  Lines       92506    83442    -9064     
  Branches     1251        0    -1251     
==========================================
- Hits        81201    74644    -6557     
+ Misses      11188     8798    -2390     
+ Partials      117        0     -117
Impacted Files Coverage Δ
python/pyarrow/jvm.py 95.86% <100%> (ø) ⬆️
cpp/src/arrow/array/builder_union.h 61.9% <0%> (-38.1%) ⬇️
cpp/src/arrow/array/builder_base.cc 76.36% <0%> (-5.99%) ⬇️
cpp/src/arrow/array/builder_dict.cc 64.41% <0%> (-3.33%) ⬇️
cpp/src/arrow/array/builder_base.h 94.11% <0%> (-2.44%) ⬇️
cpp/src/parquet/statistics.cc 87.96% <0%> (-2.3%) ⬇️
cpp/src/arrow/compute/kernel.h 62.76% <0%> (-2.07%) ⬇️
cpp/src/plasma/thirdparty/ae/ae.c 70.75% <0%> (-0.95%) ⬇️
cpp/src/arrow/array/builder_binary.h 97% <0%> (-0.81%) ⬇️
python/pyarrow/tests/test_array.py 96.31% <0%> (-0.49%) ⬇️
... and 226 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 532450d...6bb9cfc. Read the comment docs.

@kou kou changed the title ARROW-3191:[JAVA] WIP for pointing ArrowBuf to arbitrary memory ARROW-3191: [Java] WIP for pointing ArrowBuf to arbitrary memory May 4, 2019
@siddharthteotia siddharthteotia changed the title ARROW-3191: [Java] WIP for pointing ArrowBuf to arbitrary memory ARROW-3191: [Java] Make ArrowBuf work with arbitrary underlying memory May 6, 2019
@siddharthteotia
Copy link
Contributor Author

I got a clean build with my previous commit. The latest commit just has come cleanup and I expect it to be clean. Please review the changes. We should plan to merge this soon.

@siddharthteotia
Copy link
Contributor Author

Can this be merged?

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BryanCutler
Copy link
Member

merged to master, thanks @siddharthteotia

@siddharthteotia
Copy link
Contributor Author

merged to master, thanks @siddharthteotia

Thank you, @BryanCutler

@jacques-n
Copy link
Contributor

I think @pravindra and I both wanted to do another pass over this before merging. I guess we'll open up follow-on JIRAs to address additional comments.

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments. Still going through it.

* @return netty compliant {@link NettyArrowBuf}
*/
public TransferResult transferOwnership(BufferAllocator target) {
public NettyArrowBuf asNettyBuffer() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Netty should not be exposed in the public API. Please remove

* </p>
*/
public final class ArrowBuf extends AbstractByteBuf implements AutoCloseable {
public final class ArrowBuf implements AutoCloseable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should no longer be in the io.netty.buffer package.

* Reference Manager manages one or more ArrowBufs that share the
* reference count for the underlying memory chunk.
*/
public interface ReferenceManager {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide a default implementation with noop operations

import io.netty.buffer.ArrowBuf;
import io.netty.buffer.ArrowBuf.TransferResult;

public class TestBaseAllocator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a set of tests that prove that this jira does what it says it does. Let's create a simple allocator that doesn't use netty and then use the vector apis doing that. This should probably be a new test class (as opposed to being added to this class).

private final BufferManager bufManager;
private final ArrowByteBufAllocator alloc;
private final boolean isEmpty;
private int readerIndex;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like these shoudn't be part of ArrowBuf and can be an addon/new class if people want to have index based behavior.

@Override
public ByteBuffer[] nioBuffers() {
return new ByteBuffer[] {nioBuffer()};
return isEmpty ? ByteBuffer.allocateDirect(0) : asNettyBuffer().nioBuffer(index, length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird that this delegates to netty buffer since ideally ArrowBuf shouldn't be depending on Netty.

@BryanCutler
Copy link
Member

Apologies @jacques-n and @pravindra if I merged too soon , I thought there were no more comments

@pravindra
Copy link
Contributor

Apologies @jacques-n and @pravindra if I merged too soon , I thought there were no more comments

no problem at all. we'll address issues, if any, in follow-up jiras.

pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
This patch has the following goals:

(1) Make ArrowBuf work with any arbitrary memory.
(2) Decouple the usage of data get/set in ArrowBuf and memory accounting, reference management, ownership etc.

Changes

(1) A ReferenceManager interface that can be provided to ArrowBuf. This allows the users to provide their own custom implementation of reference management or it can be a NO-OP.
(2) All the accounting, ownership, reference related APIs have been moved to the default implementation of ReferenceManager -- BufferLedger, AllocationManager
(3) ArrowBuf is now literally an abstraction over some user provided underlying memory chunk. All it needs is starting virtual address and length of data to access along with a user provided implementation of ReferenceManager.
(4) ArrowBuf no longer extends or implements any of Netty's buffer interfaces. Thus all of the extra and unused APIs have been removed and it just provides simple get/set.

There is quite a bit of cleanup that needs to be done since some APIs have been moved out of ArrowBuf. So the caller code needs to change. They are likely going to be boilerplate changes but I would like to do them once we have consensus on the major set of changes here and the decoupling between ArrowBuf usage and reference management.

So the code doesn't compile yet because of the above mentioned reason. Secondly, there are a few things that I have removed assuming they are not being used  -- like BufferManager in ArrowBuf. I am still evaluating its usage. So there a few TODOs in code for these reasons.

Raising PR before the code is complete to get feedback on the important set of changes.

Author: siddharth <siddharth@dremio.com>

Closes apache#4151 from siddharthteotia/ARROW-3191 and squashes the following commits:

6bb9cfc <siddharth> Cleanup
2fa139c <siddharth> integration test issues
283916f <siddharth> Fix integration test issues
47a303b <siddharth> Setting io.netty.tryReflectionSetAccessible to true
76096e5 <siddharth> Refactor NettyArrowBuf
0b9d5d2 <siddharth> Fix test failures happening in non-debug mode and gandiva build errors
2fbb2c5 <siddharth> Fix some test failures and rebase
9e8beb6 <siddharth> Fix test failures, add javadoc and wrapper over ArrowBuf for usage in Netty framework
348be8b <siddharth> Change callers of ArrowBuf APIs to use ReferenceManager interface and fix build issues
68fe274 <siddharth> ARROW-3191: WIP for pointing ArrowBuf to arbitrary memory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants