Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPCH connector return different results on prestissimo for l_quantity #24010

Closed
kewang1024 opened this issue Nov 12, 2024 · 12 comments
Closed

TPCH connector return different results on prestissimo for l_quantity #24010

kewang1024 opened this issue Nov 12, 2024 · 12 comments
Assignees
Labels

Comments

@kewang1024
Copy link
Collaborator

select l_quantity, l_orderkey from tpch.sf1.lineitem where l_orderkey = 321030;

Java

_quantity | l_orderkey 
------------+------------
       0.04 |     321030 
       0.17 |     321030 
       0.04 |     321030 
       0.04 |     321030 
       0.48 |     321030

Prestissimo

l_quantity | l_orderkey
------------+------------
        4.0 |     321030
       17.0 |     321030
        4.0 |     321030
        4.0 |     321030
       48.0 |     321030
@kewang1024 kewang1024 added the bug label Nov 12, 2024
@github-project-automation github-project-automation bot moved this to 🆕 Unprioritized in Bugs and support requests Nov 12, 2024
@kewang1024 kewang1024 changed the title TPCH table scan return different results TPCH table scan return different results on prestissimo Nov 12, 2024
@kewang1024
Copy link
Collaborator Author

kewang1024 commented Nov 12, 2024

cc: @amitkdutta, @aditi-pandit

@amitkdutta
Copy link
Contributor

CC: @pramodsatya

I just had a chat with @aditi-pandit, she confirmed its a data generation issue - rather than a table reader bug. Its coming from here
https://github.com/facebookincubator/velox/blob/main/velox/tpch/gen/TpchGen.cpp#L80

@kewang1024 kewang1024 changed the title TPCH table scan return different results on prestissimo TPCH connector return different results on prestissimo for l_quantity Nov 12, 2024
@pramodsatya
Copy link
Contributor

Yes this is a data generation issue and the velox Tpch connector seems to be generating incorrect data (when compared with Presto). For the column l_quantity, this seems to be because of the decimalToDouble function. The data generated with the Prestissimo Tpch connector for other columns in lineitem such as l_partkey and l_suppkey, and columns in other tables such as part, also do not match with that generated by Presto so this needs further investigation. We need to compare the data generated by the velox connector against the specs and against Duckdb to ascertain if the issue is with Presto or the Duckdb dbgen tool used by the velox connector. Will update the issue with my findings from this comparison.

Also as per the Tpch specs, l_quantity and other columns which are of type double in Presto should have type Decimal instead (found an old issue reg this: #3557). So the table schema generated by Presto Tpch connector also needs to be modified. I noticed that class TpchColumnType, from the airlift dependency used by the Presto Tpch connector, doesn't support Decimal types so also need to look at how Decimal support would be added here.

@amitkdutta
Copy link
Contributor

Thanks @pramodsatya. Is this one similar issue as well? #24011

@pramodsatya
Copy link
Contributor

Thanks @pramodsatya. Is this one similar issue as well? #24011

Yes @amitkdutta, the checksum mismatch here appears to be because of data mismatch between Presto and Prestissimo Tpch connectors for column o_comment, this mismatch is seen in Varchar columns from other tables as well.
@majetideepak, could you please share if the data mismatch for Varchar columns is expected and is a known issue?

@majetideepak
Copy link
Collaborator

@pramodsatya, I know the Varchar data is mismatched as well. I think it should be the same as Java. Can you confirm this against DuckDB as well?

@majetideepak
Copy link
Collaborator

majetideepak commented Nov 12, 2024

decimalToDouble

@pramodsatya can we fix this to bring parity with Presto Java? I think double would be okay since the decimal scale seems to be atmost 2.

@pramodsatya
Copy link
Contributor

@pramodsatya, I know the Varchar data is mismatched as well. I think it should be the same as Java. Can you confirm this against DuckDB as well?

@majetideepak, yes the Varchar data is mismatched too and it differs from Presto Java, this is from checking data with SF1 for following columns from three tables: p_comment in part, o_comment in orders and l_comment in lineitem. The data generated by Presto Java matches that of the spec, and the data generated by Prestissimo matches that of Duckdb; so it looks like the data generated by Prestissimo (and Duckdb) is incorrect.

@pramodsatya
Copy link
Contributor

pramodsatya commented Nov 12, 2024

decimalToDouble

@pramodsatya can we fix this to bring parity with Presto Java? I think double would be okay since the decimal scale seems to be atmost 2.

Sure, I opened a PR in velox to fix the data generated for column l_quantity. Could you please help review the change?
facebookincubator/velox#11511

Upon checking the Tpch SF1 data generated by Prestissimo and Presto and comparing them against Duckdb and the spec, it looks like there is a parity only for Varchar columns and the column l_quantity. The data mismatch for other columns in lineitem noticed before was with tiny scale factor and is not seen with SF1.

@majetideepak
Copy link
Collaborator

The data generated by Presto Java matches that of the spec, and the data generated by Prestissimo matches that of Duckdb; so it looks like the data generated by Prestissimo (and Duckdb) is incorrect.

@pramodsatya Can you clarify this further? How different are the Java and Prestissimo results after your fix? All SF should have the same data.

@pramodsatya
Copy link
Contributor

pramodsatya commented Nov 12, 2024

The data generated by Presto Java matches that of the spec, and the data generated by Prestissimo matches that of Duckdb; so it looks like the data generated by Prestissimo (and Duckdb) is incorrect.

@pramodsatya Can you clarify this further? How different are the Java and Prestissimo results after your fix? All SF should have the same data.

@majetideepak, apologies for the confusion. This comment was only with respect to the Varchar columns containing comments. After this fix, except for the Varchar columns with the field comment, rest of the columns generated with Presto and Prestissimo match for SF1. The incorrect Varchar columns generated by Duckdb's dbgen would still need to be fixed (for both SF1 and SF 0.01).
For tiny SF, there is a parity between Presto and Prestissimo for the Varchar columns with the field comment, and for other lineitem columns such as l_partkey and l_suppkey. However, the data generated with Presto Java matches the spec and the mismatch in Prestissimo appears to be because of Duckdb's dbgen tool. Tiny SF doesn't seem to be supported in Duckdb's Tpch extension; the data generated for tiny SF by Prestissimo is the same as the SF1 data for columns such as l_partkey and l_suppkey but with reduced row count. While working on the Tpcds connector as well, we had noticed the minimum scale factor supported by Duckdb was 1 and changes had to be made in Duckdb's dsdgen to generate data with tiny SF. Likely similar changes are needed for dbgen too.

Is it fine if we take up all these dbgen fixes in a separate PR?

@aditi-pandit
Copy link
Contributor

@pramodsatya : Yes, lets handle the varchar issues in #24011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants