Skip to content

[Java] Excessive resizing in VectorAppender leads to avoidable OversizedAllocationException #37829

@hrishisd

Description

@hrishisd

Describe the bug, including details regarding any error messages, version, and platform.

Arrow version

Built from main branch as of 9/21/2023

Problem description

When appending two variable length vectors, VectorAppender repeatedly resizes the validity and offset buffers of the target vector until they can hold the combined elements. While doing so, it also resizes the data buffer which can cause the data buffer to exceed the max allocation limit when we append a large number of small elements to a vector with a single large element.

// Body of VectorAppender::visit

// make sure there is enough capacity
while (targetVector.getValueCapacity() < newValueCount) {
  targetVector.reAlloc(); // should only realloc validity and offset buffers.
}
while (targetVector.getDataBuffer().capacity() < newValueCapacity) {
  ((BaseVariableWidthVector) targetVector).reallocDataBuffer();
}

Steps to reproduce

The error can be reproduced using the snippet below.

@Test
public void testResizingBug() {
  var allocator = new RootAllocator();
  System.err.println("max allocation size: " + BaseValueVector.MAX_ALLOCATION_SIZE);
  // arrow.vector.max_allocation_bytes is set to 1048576 (1 MiB)
  // create a vector with a single 256 KiB string
  VarCharVector target = makeVec(1, 256 * 1024, allocator);
  // create a vector with a total of 1 KiB
  VarCharVector delta = makeVec(1024, 1, allocator);
  // we should be able to fit all the strings into a single vector using less than 1 MiB.
  // this works
  new VectorAppender(delta).visit(target, null);
  // this fails
  new VectorAppender(target).visit(delta, null);
}

private static VarCharVector makeVec(int nElements, int bytesPerElement, BufferAllocator allocator) {
  var v = new VarCharVector("test", allocator);
  v.allocateNew(nElements);
  for (int i = 0; i < nElements; i++) {
    v.setSafe(i, "A".repeat(bytesPerElement).getBytes(StandardCharsets.UTF_8));
  }
  v.setValueCount(nElements);
  return v;
}

The example produces the following error

org.apache.arrow.vector.util.OversizedAllocationException: Memory required for vector is (2097152), which is overflow or more than max allowed (1048576). You could consider using LargeVarCharVector/LargeVarBinaryVector for large strings/large bytes types

	at org.apache.arrow.vector.BaseVariableWidthVector.checkDataBufferSize(BaseVariableWidthVector.java:435)
	at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:542)
	at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:520)
	at org.apache.arrow.vector.BaseVariableWidthVector.reAlloc(BaseVariableWidthVector.java:497)
	at org.apache.arrow.vector.util.VectorAppender.visit(VectorAppender.java:119)
	at org.apache.arrow.vector.util.TestVectorSchemaRootAppender.testResizingBug(TestVectorSchemaRootAppender.java:68)

It looks like the issue is also present when appending other variable-length vector types.


I'm happy to post a PR if this looks reasonable.

Component(s)

Java

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions