Skip to content

The records are not aligned between spark orc reader/writer and generic orc reader/writer. #1269

@openinx

Description

@openinx

I tried to write an unit test: it generate few generic Record firstly, then write to an orc file1. Another spark reader will open this file and read it , finally write to another orc file2.

There seems be some bugs there because the spark reader failed to get the same result with the record reader. It will throw an exception like this:

Value should match expected: schema.dec_11_2 expected:<623.9> but was:<62.39>
Expected :623.9
Actual   :62.39
<Click to see difference>

java.lang.AssertionError: Value should match expected: schema.dec_11_2 expected:<623.9> but was:<62.39>
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotEquals(Assert.java:834)
	at org.junit.Assert.assertEquals(Assert.java:118)
	at org.apache.iceberg.spark.data.TestHelpers.assertEquals(TestHelpers.java:631)
	at org.apache.iceberg.spark.data.TestHelpers.assertEquals(TestHelpers.java:641)
	at org.apache.iceberg.spark.data.TestHelpers.assertEquals(TestHelpers.java:612)
	at org.apache.iceberg.spark.data.TestHelpers.assertEquals(TestHelpers.java:599)
	at org.apache.iceberg.spark.data.TestSparkRecordOrcReaderWriter.writeAndValidate(TestSparkRecordOrcReaderWriter.java:86)
	at org.apache.iceberg.spark.data.AvroDataTest.testSimpleStruct(AvroDataTest.java:67)
	at java.lang.Thread.run(Thread.java:748)

After checking the iceberg code, I found that the hive decimal will decrease its decimal scale by removing its trailing zero (Pls see here ) while our GenericOrcWriter and SparkOrcWriter did not consider this case, so we messed up the scale of the decimal.

The unit test is here.

FYI @rdsr @rdblue @shardulm94

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions