Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store null columns in the segments #12279

Merged
merged 8 commits into from
Mar 23, 2022
Merged

Conversation

jihoonson
Copy link
Contributor

@jihoonson jihoonson commented Feb 24, 2022

Terminology

  • Column: a logical column that can be stored across multiple segments.
  • Null column: a column that has only nulls in it. Druid is aware of this column.
  • Unknown column: a column that Druid is not aware of. In other words, it is a column that Druid is not tracking at all via segment metadata or any other methods.

Description

Today, null columns are not stored at ingestion time and thus they become unknown columns once the ingestion job is done. The druid native query engine uses segment-level schema and treats unknown columns as if they were null columns; reading unknown columns returns only nulls.

Druid SQL is different. Druid is using Calcite SQL planner which requires valid column information at planning time. The column information is retrieved from datasource-level schema which is dynamically discovered by merging segment schemas. As a result, users cannot query unknown columns using SQL. This is causing a couple of issues. One of the main issues is the SQL queries failing against stream ingestion from time to time. While it creates segments, the realtime task announces a realtime segment that has all columns in the ingestion spec. This realtime task thus reports the broker even null columns for the realtime segment which can be used by the Druid SQL planner. Once the segment is handed off to a historical, it announces a historical segment that does not store any null columns in it. As a result, the same SQL query will no longer work after the segment is handed off.

Proposed solution

To make the SQL planner be aware of null columns, Druid needs to track of them. This PR proposes to store those null columns in the segment just like normal columns.

Feature flag

druid.indexer.task.storeEmptyColumns is added. This is on by default. A new task context, storeEmptyColumns, is added which can override the system property.

Ingestion tasks

When storeEmptyColumns is set, the task stores every column specified in DimensionsSpec in the segments that it creates. This applies to all kinds of ingestion except for Hadoop ingestion and Tranquility.

Segment writes/reads

For null columns, Druid stores a column name, column type, number of rows, and bitmapSerdeFactory. The first two are stored in ColumnDescriptor and the last two are stored in NullColumnPartSerde. NullColumnPartSerde has a no-op serializer and a deserializer that can dynamically create a bitmap index and a dictionary. Finally, the null column names are stored at the end of index.drd separately from normal columns. This is for the compatibility for older historicals. When they read a segment that has null column stored, they won't be aware of those columns but will just ignore them without exploding.

Test plan

Future work

  • Currently, null numeric dimensions are always being stored even without this change. I would call this a bug as 1) its behavior doesn't match with string dimensions and 2) it currently stores all nulls in the segment file along with the null bitmap index, and reads them for query processing which is unnecessary and inefficient.
  • Hadoop ingestion may be supported later.

Key changed/added classes in this PR
  • NullColumnPartSerde
  • IndexIO
  • IndexMergerV9

This PR has:

  • been self-reviewed.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

@jihoonson
Copy link
Contributor Author

To reviewers, my apologies. I created a new PR because of some issue.

Comment on lines 116 to 117
.setNumericColumnSupplier(Suppliers.ofInstance(nullNumericColumn))
.setDictionaryEncodedColumnSupplier(Suppliers.ofInstance(nullDictionaryEncodedColumn));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't correct, both of these statements both set the column supplier in the builder, I think you probably only need the dictionary encoded column supplier, which i think should behave correctly, and then can drop the NullNumericColumn?

Also, does this column have no type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch. I planned to fix it but forgot until you pointed it out. I removed NullNumericColumn. The column type is set by ColumnDescriptor.

@@ -99,6 +99,10 @@
<groupId>commons-net</groupId>
<artifactId>commons-net</artifactId>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-compress</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what uses this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that I added it because Travis complain about it. But it seems no longer needed. Removed it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out I was using a util class in this library to create an array list from an iterator. Fixed it to use Guava instead.

Comment on lines +982 to +984
if (merged.hasBitmapIndexes() != otherSnapshot.hasBitmapIndexes()) {
merged.setHasBitmapIndexes(false);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this line the reason that you can't use the merge function of ColumnCapabilities? if so that seems pretty sad, I wonder if there is a way to make it work for all uses...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually the method that I migrated from ColumnCapabilities.merge() and modified this line. ColumnCapabilities.merge() is being used only for merging segments in IndexMergerV9 today, and I don't want other people to mistakenly use it after I modify this line.

{
throw new RuntimeException("This method should not be called for null-only columns");
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I think you should implement makeVectorValueSelector and makeVectorObjectSelector here as well to return NilVectorSelector for missing numeric and complex columns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think they are strictly necessary in this PR since numeric columns don't use NullColumnPartSerde yet even when it's completely empty as it's noted in the "Future work" section of the PR description. If you agree, I'd like to add them in a follow-up PR when I fix numeric columns.

@FrankChen021
Copy link
Member

If I understand correctly, this PR fixes #11386

Copy link
Contributor

@abhishekagarwal87 abhishekagarwal87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for some minor comments. Thank you for this contribution, @jihoonson

Assert.assertEquals(1, segments.size());
// only empty string dimensions are ignored currently
Assert.assertEquals(ImmutableList.of("ts", "valDim"), segments.get(0).getDimensions());
Assert.assertEquals(ImmutableList.of("valMet"), segments.get(0).getMetrics());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting valMet to be not stored. did I miss something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null metrics are always being stored as either 0s (default mode) or nulls (sql-compatible mode).

if (!Objects.equals(merged.getType(), otherSnapshot.getType())
|| !Objects.equals(merged.getComplexTypeName(), otherSnapshot.getComplexTypeName())
|| !Objects.equals(merged.getElementType(), otherSnapshot.getElementType())) {
throw new ISE(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add other info such as complex typename and element type name in the exception?

public ImmutableBitmap getBitmapForValue(@Nullable String value)
{
if (NullHandling.isNullOrEquivalent(value)) {
return bitmapFactory.complement(bitmapFactory.makeEmptyImmutableBitmap(), rowCountSupplier.getAsInt());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be nullBitmapSupplier.get()

@jihoonson
Copy link
Contributor Author

I'm merging this PR. Thanks @clintropolis @abhishekagarwal87 for the review!

@jihoonson jihoonson merged commit b6eeef3 into apache:master Mar 23, 2022
TSFenwick pushed a commit to TSFenwick/druid that referenced this pull request Apr 11, 2022
* Store null columns in the segments

* fix test

* remove NullNumericColumn and unused dependency

* fix compile failure

* use guava instead of apache commons

* split new tests

* unused imports

* address comments
@abhishekagarwal87 abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants