-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add external node table #4105
base: master
Are you sure you want to change the base?
Add external node table #4105
Conversation
4e36d27
to
e2e77a5
Compare
Benchmark ResultMaster commit hash:
|
// Bind physical create node table info | ||
auto pkDefinition = getDefinition(propertyDefinitions, extraInfo.pkName); | ||
std::vector<PropertyDefinition> physicalPropertyDefinitions; | ||
physicalPropertyDefinitions.push_back(pkDefinition.copy()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to copy
here? getDefinition
returns a copied one already.
@@ -124,18 +125,13 @@ std::shared_ptr<Expression> Binder::createPath(const std::string& pathName, | |||
} | |||
|
|||
static std::vector<std::string> getPropertyNames(const std::vector<TableCatalogEntry*>& entries) { | |||
std::vector<std::string> result; | |||
std::unordered_set<std::string> propertyNamesSet; | |||
auto distinctVector = DistinctVector<std::string>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DistinctVector usage seems unnecessarily complicated here, why not just use a set and copy to a vector at the end?
@@ -260,7 +260,7 @@ void Planner::planNodeIDScan(uint32_t nodePos) { | |||
auto newSubgraph = context.getEmptySubqueryGraph(); | |||
newSubgraph.addQueryNode(nodePos); | |||
auto plan = std::make_unique<LogicalPlan>(); | |||
appendScanNodeTable(node->getInternalID(), node->getTableIDs(), {}, *plan); | |||
appendScanNodeTable(node->getInternalID(), {}, node->getEntries(), *plan); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better add an inline comment to {}
.
@@ -296,7 +296,7 @@ void Planner::planRelScan(uint32_t relPos) { | |||
auto plan = std::make_unique<LogicalPlan>(); | |||
auto [boundNode, nbrNode] = getBoundAndNbrNodes(*rel, direction); | |||
const auto extendDirection = getExtendDirection(*rel, *boundNode); | |||
appendScanNodeTable(boundNode->getInternalID(), boundNode->getTableIDs(), {}, *plan); | |||
appendScanNodeTable(boundNode->getInternalID(), {}, boundNode->getEntries(), *plan); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto on adding an inline comment to {}
.
3a97fc6
to
b386c41
Compare
Benchmark ResultMaster commit hash:
|
Description
This is our initial PR to support directly execute cypher on relational database. It contains logic for node table only. Major changes including
Nested catalog entry.
We add a new catalog entry type
ExternalNodeTable
which has a nested entry structure. At parent level, it maintains the logic view of properties which aligns with the columns in relational tables. At child level, it maintains another catalog entry which contains the primary key property only. This child entry aligns with our physical storage.In the current design, we still need to materialize primary key and use it as join condition when we try to read a property that does not exist in storage.
Scan external table
When we run
MATCH (a:label)
where label points to an external relational table, we need to scan external relational table and the primary key column materialized in our storage and then perform a join on primary key.Some sanity benchmark numbers are
Setup
LDBC10 Comment table storing in DuckDB database. 8 Threads.
DuckDB native scanning: 0.3s.
Kuzu scanning DuckDB: 2s.
Scanning external database: copy primary key (6s) + join (3s) = 9s.
Slower than DuckDB is expected as we need to first materialize DuckDB's result and then re-scan it. The major overhead is in us scanning DuckDB result which @acquamarin should see if we can further optimize this.
Another bottleneck is the copy of primary key. I'm fairly confident I can bring the time to ~2s with some optimization.
Fixes # (issue)
Contributor agreement