-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
Describe the bug, including details regarding any error messages, version, and platform.
Arrow version
Built from main branch as of 9/21/2023
Problem description
When appending two variable length vectors, VectorAppender repeatedly resizes the validity and offset buffers of the target vector until they can hold the combined elements. While doing so, it also resizes the data buffer which can cause the data buffer to exceed the max allocation limit when we append a large number of small elements to a vector with a single large element.
// Body of VectorAppender::visit
// make sure there is enough capacity
while (targetVector.getValueCapacity() < newValueCount) {
targetVector.reAlloc(); // should only realloc validity and offset buffers.
}
while (targetVector.getDataBuffer().capacity() < newValueCapacity) {
((BaseVariableWidthVector) targetVector).reallocDataBuffer();
}Steps to reproduce
The error can be reproduced using the snippet below.
@Test
public void testResizingBug() {
var allocator = new RootAllocator();
System.err.println("max allocation size: " + BaseValueVector.MAX_ALLOCATION_SIZE);
// arrow.vector.max_allocation_bytes is set to 1048576 (1 MiB)
// create a vector with a single 256 KiB string
VarCharVector target = makeVec(1, 256 * 1024, allocator);
// create a vector with a total of 1 KiB
VarCharVector delta = makeVec(1024, 1, allocator);
// we should be able to fit all the strings into a single vector using less than 1 MiB.
// this works
new VectorAppender(delta).visit(target, null);
// this fails
new VectorAppender(target).visit(delta, null);
}
private static VarCharVector makeVec(int nElements, int bytesPerElement, BufferAllocator allocator) {
var v = new VarCharVector("test", allocator);
v.allocateNew(nElements);
for (int i = 0; i < nElements; i++) {
v.setSafe(i, "A".repeat(bytesPerElement).getBytes(StandardCharsets.UTF_8));
}
v.setValueCount(nElements);
return v;
}The example produces the following error
org.apache.arrow.vector.util.OversizedAllocationException: Memory required for vector is (2097152), which is overflow or more than max allowed (1048576). You could consider using LargeVarCharVector/LargeVarBinaryVector for large strings/large bytes types
at org.apache.arrow.vector.BaseVariableWidthVector.checkDataBufferSize(BaseVariableWidthVector.java:435)
at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:542)
at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:520)
at org.apache.arrow.vector.BaseVariableWidthVector.reAlloc(BaseVariableWidthVector.java:497)
at org.apache.arrow.vector.util.VectorAppender.visit(VectorAppender.java:119)
at org.apache.arrow.vector.util.TestVectorSchemaRootAppender.testResizingBug(TestVectorSchemaRootAppender.java:68)
It looks like the issue is also present when appending other variable-length vector types.
I'm happy to post a PR if this looks reasonable.
Component(s)
Java