Skip to content

[optimization] Optimize hash table build #5300

@stdpain

Description

@stdpain

Describe
In the original logic, Hashtable uses a vector-like structure to store actual data. When constructing the hash table, there may be about a quarter of the time copying data continuously. Especially in the case of building more columns, it will take more time. So I changed this to a raw pointer to avoid extra copy overhead. There will be good results in the hash table construction phase

Here is my test case, LINE_ORDER and LINE_ORDER_V2 is from SSB datasets:

SELECT count(*) FROM LINE_ORDER t1 join LINE_ORDER_V2 t2 WHERE t1.LO_ORDERKEY=t2.LO_ORDERKEY;
Type Right Table Rows Build Time Probe Time Time Cost (s)
After 6001215 658.288ms 1s451ms 4.07
Before 6001215 1s428ms 1s512ms 4.69

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions