-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store null columns in the segments #12279
Conversation
To reviewers, my apologies. I created a new PR because of some issue. |
.setNumericColumnSupplier(Suppliers.ofInstance(nullNumericColumn)) | ||
.setDictionaryEncodedColumnSupplier(Suppliers.ofInstance(nullDictionaryEncodedColumn)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't correct, both of these statements both set the column supplier in the builder, I think you probably only need the dictionary encoded column supplier, which i think should behave correctly, and then can drop the NullNumericColumn
?
Also, does this column have no type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch. I planned to fix it but forgot until you pointed it out. I removed NullNumericColumn
. The column type is set by ColumnDescriptor
.
processing/pom.xml
Outdated
@@ -99,6 +99,10 @@ | |||
<groupId>commons-net</groupId> | |||
<artifactId>commons-net</artifactId> | |||
</dependency> | |||
<dependency> | |||
<groupId>org.apache.commons</groupId> | |||
<artifactId>commons-compress</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what uses this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that I added it because Travis complain about it. But it seems no longer needed. Removed it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out I was using a util class in this library to create an array list from an iterator. Fixed it to use Guava instead.
if (merged.hasBitmapIndexes() != otherSnapshot.hasBitmapIndexes()) { | ||
merged.setHasBitmapIndexes(false); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this line the reason that you can't use the merge function of ColumnCapabilities
? if so that seems pretty sad, I wonder if there is a way to make it work for all uses...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually the method that I migrated from ColumnCapabilities.merge()
and modified this line. ColumnCapabilities.merge()
is being used only for merging segments in IndexMergerV9
today, and I don't want other people to mistakenly use it after I modify this line.
b9cd68e
to
9b3f169
Compare
{ | ||
throw new RuntimeException("This method should not be called for null-only columns"); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, I think you should implement makeVectorValueSelector
and makeVectorObjectSelector
here as well to return NilVectorSelector
for missing numeric and complex columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think they are strictly necessary in this PR since numeric columns don't use NullColumnPartSerde
yet even when it's completely empty as it's noted in the "Future work" section of the PR description. If you agree, I'd like to add them in a follow-up PR when I fix numeric columns.
If I understand correctly, this PR fixes #11386 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for some minor comments. Thank you for this contribution, @jihoonson
Assert.assertEquals(1, segments.size()); | ||
// only empty string dimensions are ignored currently | ||
Assert.assertEquals(ImmutableList.of("ts", "valDim"), segments.get(0).getDimensions()); | ||
Assert.assertEquals(ImmutableList.of("valMet"), segments.get(0).getMetrics()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting valMet
to be not stored. did I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Null metrics are always being stored as either 0s (default mode) or nulls (sql-compatible mode).
if (!Objects.equals(merged.getType(), otherSnapshot.getType()) | ||
|| !Objects.equals(merged.getComplexTypeName(), otherSnapshot.getComplexTypeName()) | ||
|| !Objects.equals(merged.getElementType(), otherSnapshot.getElementType())) { | ||
throw new ISE( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add other info such as complex typename and element type name in the exception?
public ImmutableBitmap getBitmapForValue(@Nullable String value) | ||
{ | ||
if (NullHandling.isNullOrEquivalent(value)) { | ||
return bitmapFactory.complement(bitmapFactory.makeEmptyImmutableBitmap(), rowCountSupplier.getAsInt()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be nullBitmapSupplier.get()
I'm merging this PR. Thanks @clintropolis @abhishekagarwal87 for the review! |
* Store null columns in the segments * fix test * remove NullNumericColumn and unused dependency * fix compile failure * use guava instead of apache commons * split new tests * unused imports * address comments
Terminology
Description
Today, null columns are not stored at ingestion time and thus they become unknown columns once the ingestion job is done. The druid native query engine uses segment-level schema and treats unknown columns as if they were null columns; reading unknown columns returns only nulls.
Druid SQL is different. Druid is using Calcite SQL planner which requires valid column information at planning time. The column information is retrieved from datasource-level schema which is dynamically discovered by merging segment schemas. As a result, users cannot query unknown columns using SQL. This is causing a couple of issues. One of the main issues is the SQL queries failing against stream ingestion from time to time. While it creates segments, the realtime task announces a realtime segment that has all columns in the ingestion spec. This realtime task thus reports the broker even null columns for the realtime segment which can be used by the Druid SQL planner. Once the segment is handed off to a historical, it announces a historical segment that does not store any null columns in it. As a result, the same SQL query will no longer work after the segment is handed off.
Proposed solution
To make the SQL planner be aware of null columns, Druid needs to track of them. This PR proposes to store those null columns in the segment just like normal columns.
Feature flag
druid.indexer.task.storeEmptyColumns
is added. This is on by default. A new task context,storeEmptyColumns
, is added which can override the system property.Ingestion tasks
When
storeEmptyColumns
is set, the task stores every column specified inDimensionsSpec
in the segments that it creates. This applies to all kinds of ingestion except for Hadoop ingestion and Tranquility.Segment writes/reads
For null columns, Druid stores a column name, column type, number of rows, and bitmapSerdeFactory. The first two are stored in
ColumnDescriptor
and the last two are stored inNullColumnPartSerde
.NullColumnPartSerde
has a no-op serializer and a deserializer that can dynamically create a bitmap index and a dictionary. Finally, the null column names are stored at the end ofindex.drd
separately from normal columns. This is for the compatibility for older historicals. When they read a segment that has null column stored, they won't be aware of those columns but will just ignore them without exploding.Test plan
Future work
Key changed/added classes in this PR
NullColumnPartSerde
IndexIO
IndexMergerV9
This PR has: