Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IncrementalIndex.add() barfs when InputRow.getDimensions() has duplicates #63

Closed
cheddar opened this issue Jan 23, 2013 · 7 comments
Closed

Comments

@cheddar
Copy link
Contributor

cheddar commented Jan 23, 2013

If InputRow.getDimensions() has duplicates, IncrementalIndex.add() fails with

java.lang.ArrayIndexOutOfBoundsException: 4
at com.metamx.druid.index.v1.IncrementalIndex.add(IncrementalIndex.java:148)
at com.metamx.druid.realtime.Sink.add(Sink.java:98)
at com.metamx.druid.realtime.RealtimeManager$FireChief.run(RealtimeManager.java:176)
[12:29pm]

@xvrl xvrl removed this from the Druid Release -- Next milestone Apr 14, 2014
cheddar added a commit to cheddar/druid that referenced this issue Jul 1, 2015
…essor

parallel query chunk processing at broker
@michaelschiff
Copy link
Contributor

I've reproduced this in test.

The issue only occurs if the duplicate dimension has never been seen before.

Each time an InputRow is added we initialize a 2D array (1 cell in top level for each dimension, with an array containing the unique values for this dimension) with size = the # of dimensions seen so far.
We then loop through the dimensions, updating the 2d array (or an overflow buffer) with the values for each dimension.

We will see the duplicate dimension twice in this update loop. On the second time, we will find an index for this dimension (set the first time), and attempt to set the new set of values for this dimension to this index in the 2D array. However, this array was initialized based on the number of dimensions seen as of the last call to add (which is 1 too small, since we've never seen this dimension before), and we hit index out of bounds.

In the case that the duplicate dimension had been seen before (in a previous call to add), the 2D array will already be properly sized. We will find an index for the duplicate dimension both times we see it in the update loop, and set it to the same set of values in the 2D array.

It seems to me like duplicate dimensions should be an error, but I would like to clarify what the expected behavior is in this case.

80d8eedcf7422479020fa9388cd66f55dc74230d
this commit illustrates the second condition (the duplicate dimension has been seen before, and the add has no issue). By commenting out the first call to add, you can see it fail when it attempts to access the 2D array.

@nishantmonu51
Copy link
Member

+1 on having an error for duplicate dimension,
however the error needs to be informative and meaningful instead of ArrayIndexOutOfBoundEx.

@michaelschiff
Copy link
Contributor

So assuming it is aggreed that this case is an error I see two approaches.

  1. eliminate the possibility of duplicates by having the InputRow return a Set instead of a List (implications on dimension ordering)

  2. on each call to add, maintain a set of the dimensions seen in this input row, throwing a descriptive error on duplicate dimensions.

The existing lookup (does this dimension have an index) can be used to detect duplicates on the first occurance of the dimension (i.e. Found an index but array is too small, turning what is now an index out of bounds into something more descriptive), however it is not sufficient to detect duplicate dimensions if this dimension was seen on a previous row (the index will be found and the array sufficiently large).

@michaelschiff
Copy link
Contributor

#2017

Essentially, this is alternative 2 without the need for the additional set

@drcrallen
Copy link
Contributor

Is this a hypothetical, or is this occurring somewhere in the wild?

michaelschiff added a commit to tubemogul/druid that referenced this issue Dec 4, 2015
drcrallen added a commit that referenced this issue Dec 4, 2015
@michaelschiff
Copy link
Contributor

We can probably close right? #2017 is merged

@fjy
Copy link
Contributor

fjy commented Dec 7, 2015

Fixed by #2017

@fjy fjy closed this as completed Dec 7, 2015
maytasm referenced this issue in maytasm/druid Feb 26, 2020
[Backport] GREATEST/LEAST post-aggregators in SQL (apache#8719)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants