Rows.objectToNumber: Accept decimals with output type LONG. #15999

gianm · 2024-02-28T17:58:47Z

PR #15615 added an optimization to avoid parsing numbers twice in cases where we know that they should definitely be longs or definitely be doubles. Rather than try parsing as long first, and then try parsing as double, it would use only the parsing routine specific to the requested outputType.

This caused a bug: previously, we would accept decimals like "1.0" or "1.23" as longs, by truncating them to "1". After that patch, we would treat such decimals as nulls when the outputType is set to LONG.

This patch retains the short-circuit for doubles: if outputType is DOUBLE, we only parse the string as a double. But for outputType LONG, this patch restores the old behavior: try to parse as long first, then double.

PR apache#15615 added an optimization to avoid parsing numbers twice in cases where we know that they should definitely be longs or definitely be doubles. Rather than try parsing as long first, and then try parsing as double, it would use only the parsing routine specific to the requested outputType. This caused a bug: previously, we would accept decimals like "1.0" or "1.23" as longs, by truncating them to "1". After that patch, we would treat such decimals as nulls when the outputType is set to LONG. This patch retains the short-circuit for doubles: if outputType is DOUBLE, we only parse the string as a double. But for outputType LONG, this patch restores the old behavior: try to parse as long first, then double.

gianm · 2024-02-28T18:07:57Z

This bug is triggered when Rows.objectToNumber is called with outputType set to LONG and with a decimal string input.

The only call to Rows.objectToNumber that uses nonnull outputType is RowBasedColumnSelectorFactory, in a situation where its columnInspector has type info. Most usages of RowBasedColumnSelectorFactory don't satisfy both conditions for this bug: they either don't have type info, or they do have type info but they would not provide a string with type LONG.

AFAICT, the only call sites that meet both these criteria are ExternalSegment (used for EXTERN in SQL-based ingest) and InlineSegmentWrangler (used for inline datasources). So this bug could be triggered by an EXTERN call with a column that is declared as LONG or BIGINT but is naturally string-typed (such as a column from a TSV or CSV file); or an inline datasource where a field is declared as LONG but is provided as a string in the JSON.

pranavbhole · 2024-02-29T01:07:28Z

need to fix tests: "notanumber" is now 0.0 instead of 0 which should fine i guess.

Error: org.apache.druid.data.input.impl.RowsTest.test_objectToNumber_typeDouble_noThrow Error: Run 1: RowsTest.test_objectToNumber_typeDouble_noThrow:224 notanumber (nothrow) expected:<0> but was:<0.0> [INFO] Error: org.apache.druid.data.input.impl.RowsTest.test_objectToNumber_typeFloat_noThrow Error: Run 1: RowsTest.test_objectToNumber_typeFloat_noThrow:189 notanumber (nothrow) expected:<0> but was:<0.0>

gianm · 2024-02-29T08:38:13Z

Yes, right, the test cases in replace-with-default mode were not looking for the right thing. That's fixed now. Thanks.

abhishekagarwal87 · 2024-03-04T10:13:12Z

processing/src/main/java/org/apache/druid/data/input/Rows.java

+    } else if (outputType == ValueType.DOUBLE) {
+      return asNumber.doubleValue();
+    } else {
+      throw new ISE("Cannot read number as type[%s]", outputType);


Suggested change

throw new ISE("Cannot read number as type[%s]", outputType);

throw new ISE("Cannot read number as type[%s] for field [%s]", outputType, name);

abhishekagarwal87 · 2024-03-04T16:30:59Z

merged it since the comment was not a blocker.

…5999) * Rows.objectToNumber: Accept decimals with output type LONG. PR apache#15615 added an optimization to avoid parsing numbers twice in cases where we know that they should definitely be longs or definitely be doubles. Rather than try parsing as long first, and then try parsing as double, it would use only the parsing routine specific to the requested outputType. This caused a bug: previously, we would accept decimals like "1.0" or "1.23" as longs, by truncating them to "1". After that patch, we would treat such decimals as nulls when the outputType is set to LONG. This patch retains the short-circuit for doubles: if outputType is DOUBLE, we only parse the string as a double. But for outputType LONG, this patch restores the old behavior: try to parse as long first, then double.

…16062) * Rows.objectToNumber: Accept decimals with output type LONG. PR #15615 added an optimization to avoid parsing numbers twice in cases where we know that they should definitely be longs or definitely be doubles. Rather than try parsing as long first, and then try parsing as double, it would use only the parsing routine specific to the requested outputType. This caused a bug: previously, we would accept decimals like "1.0" or "1.23" as longs, by truncating them to "1". After that patch, we would treat such decimals as nulls when the outputType is set to LONG. This patch retains the short-circuit for doubles: if outputType is DOUBLE, we only parse the string as a double. But for outputType LONG, this patch restores the old behavior: try to parse as long first, then double.

gianm added Bug Area - Ingestion labels Feb 28, 2024

gianm added this to the 29.0.1 milestone Feb 28, 2024

gianm added 2 commits February 28, 2024 10:08

Style.

8435960

Style.

4cdb8c3

Fix tests.

48aa47e

abhishekagarwal87 reviewed Mar 4, 2024

View reviewed changes

abhishekagarwal87 approved these changes Mar 4, 2024

View reviewed changes

abhishekagarwal87 merged commit 376a41f into apache:master Mar 4, 2024
83 checks passed

gianm deleted the otn-lenient branch March 4, 2024 16:58

gianm mentioned this pull request Mar 6, 2024

[Backport] Rows.objectToNumber: Accept decimals with output type LONG. (#15999) #16062

Merged

cryptoe mentioned this pull request Mar 21, 2024

[Draft] 29.0.1 Release Notes #16183

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rows.objectToNumber: Accept decimals with output type LONG. #15999

Rows.objectToNumber: Accept decimals with output type LONG. #15999

gianm commented Feb 28, 2024

gianm commented Feb 28, 2024

pranavbhole commented Feb 29, 2024

gianm commented Feb 29, 2024

abhishekagarwal87 Mar 4, 2024

abhishekagarwal87 commented Mar 4, 2024

	throw new ISE("Cannot read number as type[%s]", outputType);
	throw new ISE("Cannot read number as type[%s] for field [%s]", outputType, name);

Rows.objectToNumber: Accept decimals with output type LONG. #15999

Rows.objectToNumber: Accept decimals with output type LONG. #15999

Conversation

gianm commented Feb 28, 2024

gianm commented Feb 28, 2024

pranavbhole commented Feb 29, 2024

gianm commented Feb 29, 2024

abhishekagarwal87 Mar 4, 2024

Choose a reason for hiding this comment

abhishekagarwal87 commented Mar 4, 2024