Store null columns in the segments #12279

jihoonson · 2022-02-24T06:52:10Z

Terminology

Column: a logical column that can be stored across multiple segments.
Null column: a column that has only nulls in it. Druid is aware of this column.
Unknown column: a column that Druid is not aware of. In other words, it is a column that Druid is not tracking at all via segment metadata or any other methods.

Description

Today, null columns are not stored at ingestion time and thus they become unknown columns once the ingestion job is done. The druid native query engine uses segment-level schema and treats unknown columns as if they were null columns; reading unknown columns returns only nulls.

Druid SQL is different. Druid is using Calcite SQL planner which requires valid column information at planning time. The column information is retrieved from datasource-level schema which is dynamically discovered by merging segment schemas. As a result, users cannot query unknown columns using SQL. This is causing a couple of issues. One of the main issues is the SQL queries failing against stream ingestion from time to time. While it creates segments, the realtime task announces a realtime segment that has all columns in the ingestion spec. This realtime task thus reports the broker even null columns for the realtime segment which can be used by the Druid SQL planner. Once the segment is handed off to a historical, it announces a historical segment that does not store any null columns in it. As a result, the same SQL query will no longer work after the segment is handed off.

Proposed solution

To make the SQL planner be aware of null columns, Druid needs to track of them. This PR proposes to store those null columns in the segment just like normal columns.

Feature flag

druid.indexer.task.storeEmptyColumns is added. This is on by default. A new task context, storeEmptyColumns, is added which can override the system property.

Ingestion tasks

When storeEmptyColumns is set, the task stores every column specified in DimensionsSpec in the segments that it creates. This applies to all kinds of ingestion except for Hadoop ingestion and Tranquility.

Segment writes/reads

For null columns, Druid stores a column name, column type, number of rows, and bitmapSerdeFactory. The first two are stored in ColumnDescriptor and the last two are stored in NullColumnPartSerde. NullColumnPartSerde has a no-op serializer and a deserializer that can dynamically create a bitmap index and a dictionary. Finally, the null column names are stored at the end of index.drd separately from normal columns. This is for the compatibility for older historicals. When they read a segment that has null column stored, they won't be aware of those columns but will just ignore them without exploding.

Test plan

Unit tests are added in this PR to verify the compatibility for older historicals.
Unit tests are added in this PR to verify that null columns are stored in segments.
Integration tests will be added in Add integration test for querying null-only columns #12268.

Future work

Currently, null numeric dimensions are always being stored even without this change. I would call this a bug as 1) its behavior doesn't match with string dimensions and 2) it currently stores all nulls in the segment file along with the null bitmap index, and reads them for query processing which is unnecessary and inefficient.
Hadoop ingestion may be supported later.

Key changed/added classes in this PR

NullColumnPartSerde
IndexIO
IndexMergerV9

This PR has:

been self-reviewed.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

jihoonson · 2022-02-24T06:54:53Z

To reviewers, my apologies. I created a new PR because of some issue.

clintropolis · 2022-02-25T05:23:50Z

processing/src/main/java/org/apache/druid/segment/serde/NullColumnPartSerde.java

+        .setNumericColumnSupplier(Suppliers.ofInstance(nullNumericColumn))
+        .setDictionaryEncodedColumnSupplier(Suppliers.ofInstance(nullDictionaryEncodedColumn));


this isn't correct, both of these statements both set the column supplier in the builder, I think you probably only need the dictionary encoded column supplier, which i think should behave correctly, and then can drop the NullNumericColumn?

Also, does this column have no type?

Ah good catch. I planned to fix it but forgot until you pointed it out. I removed NullNumericColumn. The column type is set by ColumnDescriptor.

clintropolis · 2022-02-25T05:32:00Z

processing/pom.xml

@@ -99,6 +99,10 @@
            <groupId>commons-net</groupId>
            <artifactId>commons-net</artifactId>
        </dependency>
+        <dependency>
+            <groupId>org.apache.commons</groupId>
+            <artifactId>commons-compress</artifactId>


what uses this?

I remember that I added it because Travis complain about it. But it seems no longer needed. Removed it now.

Turns out I was using a util class in this library to create an array list from an iterator. Fixed it to use Guava instead.

clintropolis · 2022-02-25T05:44:11Z

processing/src/main/java/org/apache/druid/segment/IndexMergerV9.java

+    if (merged.hasBitmapIndexes() != otherSnapshot.hasBitmapIndexes()) {
+      merged.setHasBitmapIndexes(false);
+    }


is this line the reason that you can't use the merge function of ColumnCapabilities? if so that seems pretty sad, I wonder if there is a way to make it work for all uses...

This is actually the method that I migrated from ColumnCapabilities.merge() and modified this line. ColumnCapabilities.merge() is being used only for merging segments in IndexMergerV9 today, and I don't want other people to mistakenly use it after I modify this line.

clintropolis · 2022-03-10T09:28:55Z

processing/src/main/java/org/apache/druid/segment/serde/NullColumnPartSerde.java

+    {
+      throw new RuntimeException("This method should not be called for null-only columns");
+    }
+


oops, I think you should implement makeVectorValueSelector and makeVectorObjectSelector here as well to return NilVectorSelector for missing numeric and complex columns

I don't think they are strictly necessary in this PR since numeric columns don't use NullColumnPartSerde yet even when it's completely empty as it's noted in the "Future work" section of the PR description. If you agree, I'd like to add them in a follow-up PR when I fix numeric columns.

FrankChen021 · 2022-03-15T10:50:35Z

If I understand correctly, this PR fixes #11386

abhishekagarwal87

LGTM except for some minor comments. Thank you for this contribution, @jihoonson

abhishekagarwal87 · 2022-03-21T06:28:06Z

indexing-service/src/test/java/org/apache/druid/indexing/common/task/IndexTaskTest.java

+    Assert.assertEquals(1, segments.size());
+    // only empty string dimensions are ignored currently
+    Assert.assertEquals(ImmutableList.of("ts", "valDim"), segments.get(0).getDimensions());
+    Assert.assertEquals(ImmutableList.of("valMet"), segments.get(0).getMetrics());


I was expecting valMet to be not stored. did I miss something?

Null metrics are always being stored as either 0s (default mode) or nulls (sql-compatible mode).

abhishekagarwal87 · 2022-03-21T09:30:43Z

processing/src/main/java/org/apache/druid/segment/IndexMergerV9.java

+    if (!Objects.equals(merged.getType(), otherSnapshot.getType())
+        || !Objects.equals(merged.getComplexTypeName(), otherSnapshot.getComplexTypeName())
+        || !Objects.equals(merged.getElementType(), otherSnapshot.getElementType())) {
+      throw new ISE(


can we add other info such as complex typename and element type name in the exception?

abhishekagarwal87 · 2022-03-21T09:49:18Z

processing/src/main/java/org/apache/druid/segment/column/BitmapIndexes.java

+      public ImmutableBitmap getBitmapForValue(@Nullable String value)
+      {
+        if (NullHandling.isNullOrEquivalent(value)) {
+          return bitmapFactory.complement(bitmapFactory.makeEmptyImmutableBitmap(), rowCountSupplier.getAsInt());


this could be nullBitmapSupplier.get()

jihoonson · 2022-03-23T23:54:00Z

I'm merging this PR. Thanks @clintropolis @abhishekagarwal87 for the review!

* Store null columns in the segments * fix test * remove NullNumericColumn and unused dependency * fix compile failure * use guava instead of apache commons * split new tests * unused imports * address comments

jihoonson added Design Review Area - SQL Area - Ingestion labels Feb 24, 2022

jihoonson mentioned this pull request Feb 24, 2022

Add integration test for querying null-only columns #12268

Closed

2 tasks

clintropolis added the Area - Segment Format and Ser/De label Feb 25, 2022

clintropolis reviewed Feb 25, 2022

View reviewed changes

jihoonson added 5 commits March 10, 2022 12:50

Store null columns in the segments

6d2f0b4

fix test

9bbbd03

remove NullNumericColumn and unused dependency

4899440

fix compile failure

2b6835d

use guava instead of apache commons

9b3f169

jihoonson force-pushed the null-only-column2 branch from b9cd68e to 9b3f169 Compare March 10, 2022 03:52

clintropolis reviewed Mar 10, 2022

View reviewed changes

jihoonson added 2 commits March 15, 2022 11:59

split new tests

45f4459

unused imports

0dd5b7a

clintropolis approved these changes Mar 15, 2022

View reviewed changes

abhishekagarwal87 approved these changes Mar 21, 2022

View reviewed changes

address comments

f0ba1a6

jihoonson merged commit b6eeef3 into apache:master Mar 23, 2022

jihoonson mentioned this pull request Mar 24, 2022

Add an integration test for null-only columns #12365

Merged

2 tasks

vtlim mentioned this pull request Apr 1, 2022

Document config for ingesting null columns #12389

Merged

2 tasks

abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022

abhishekagarwal87 mentioned this pull request May 25, 2022

[Draft] 0.23.0 Release notes #12510

Closed

gianm mentioned this pull request May 29, 2022

remake column indexes and query processing of filters #12388

Merged

5 tasks

clintropolis mentioned this pull request Jul 29, 2022

add missing selectors for explicit null columns #12834

Merged

3 tasks

clintropolis mentioned this pull request Mar 30, 2023

lower segment heap footprint and fix bug with expression type coercion #14002

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store null columns in the segments #12279

Store null columns in the segments #12279

jihoonson commented Feb 24, 2022 •

edited

Loading

jihoonson commented Feb 24, 2022

clintropolis Feb 25, 2022

jihoonson Feb 26, 2022

clintropolis Feb 25, 2022

jihoonson Feb 26, 2022

jihoonson Feb 28, 2022

clintropolis Feb 25, 2022

jihoonson Feb 26, 2022

clintropolis Mar 10, 2022

jihoonson Mar 15, 2022

FrankChen021 commented Mar 15, 2022

abhishekagarwal87 left a comment

abhishekagarwal87 Mar 21, 2022

jihoonson Mar 23, 2022

abhishekagarwal87 Mar 21, 2022

abhishekagarwal87 Mar 21, 2022

jihoonson commented Mar 23, 2022

		.setNumericColumnSupplier(Suppliers.ofInstance(nullNumericColumn))
		.setDictionaryEncodedColumnSupplier(Suppliers.ofInstance(nullDictionaryEncodedColumn));

Store null columns in the segments #12279

Store null columns in the segments #12279

Conversation

jihoonson commented Feb 24, 2022 • edited Loading

Terminology

Description

Proposed solution

Feature flag

Ingestion tasks

Segment writes/reads

Test plan

Future work

Key changed/added classes in this PR

jihoonson commented Feb 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FrankChen021 commented Mar 15, 2022

abhishekagarwal87 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson commented Mar 23, 2022

jihoonson commented Feb 24, 2022 •

edited

Loading