-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use FDW to query multiple servers as shards #320
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hiiii, @oppenheimer01 welcome!🎊 Thanks for taking the effort to make our project better! 🙌 Keep making such awesome contributions!
Hi, @oppenheimer01 thanks for your contribution! |
b0cf507
to
86b8658
Compare
c8ace14
to
26eaad7
Compare
1b41640
to
1e2383e
Compare
ic-singlenode-test some errors. |
1e2383e
to
2e0bf63
Compare
Got it. |
031a85d
to
3626184
Compare
c01e084
to
63e4d46
Compare
174c8e6
to
e71c550
Compare
e71c550
to
888bf13
Compare
This reverts commit afbeb68.
This commit mainly meets the needs of users to query multiple clusters as external shards. fdw treats each data source as a whole without knowing its internal structure. It keeps requesters and data sources properly decoupled to maintain generality. Add a new catalog table pg_foreign_table_seg to enable multiple shards in foreign table. The foreign table should be treated as a shard table with strewn locus. Each QE scanning the foreign should got a shard from pg_foreign_table_seg. Considering that the size of the computing cluster and the number of shards of the foreign table may be inconsistent. Use flexible gang to generate the same number of scan nodes as foreign table shards. Considering that the data bandwidth between different data centers is limited, we need to reduce the data transmission of fdw as much as possible. Pushing the execution node down to the remote end as much as possible can reduce data transmission. If all tables of a subtree are distributed in the same foreign server collection, It can be pushed down. But in mpp-fdw, we should consider if a table only joinning the shared in same foreign server. So a new system attribute gp_foreign_server was add to the foreign table. If the customer add "t1.gp_foreign_server = t2.gp_foreign_server" to join condition. It should be pushed down. We can only push down the first stage of the two-stage aggregation. Multi-stage aggregation will use some intermediate types. Some of these intermediate types are external types that can be output externally, such as count, min, max, and sum. The intermediate and final types of these types are identical. Others are more complex internal types, such as avg, whose intermediate type is inconsistent with the final type and must be converted using a final function. Since the local node in FDW serves as a standard client to exchange data with the remote server, these internal types cannot be transmitted. So some of the aggregate functions such as "avg" should not be pushed down now...
This commit mainly meets the needs of users to query multiple clusters as external shards. fdw treats each data source as a whole without knowing its internal structure. It keeps requesters and data sources properly decoupled to maintain generality.
Add a new catalog table pg_foreign_table_seg to enable multiple shards in foreign table. The foreign table should be treated as a shard table with strewn locus. Each QE scanning the foreign should got a shard from pg_foreign_table_seg.
Considering that the size of the computing cluster and the number of shards of the foreign table may be inconsistent. Use flexible gang to generate the same number of scan nodes as foreign table shards.
Considering that the data bandwidth between different data centers is limited, we need to reduce the data transmission of fdw as much as possible. Pushing the execution node down to the remote end as much as possible can reduce data transmission.
If all tables of a subtree are distributed in the same foreign server collection, It can be pushed down. But in mpp-fdw, we should consider if a table only joinning the shared in same foreign server. So a new system attribute gp_foreign_server was add to the foreign table. If the customer add "t1.gp_foreign_server = t2.gp_foreign_server" to join condition. It should be pushed down.
We can only push down the first stage of the two-stage aggregation. Multi-stage aggregation will use some intermediate types. Some of these intermediate types are external types that can be output externally, such as count, min, max, and sum. The intermediate and final types of these types are identical. Others are more complex internal types, such as avg, whose intermediate type is inconsistent with the final type and must be converted using a final function. Since the local node in FDW serves as a standard client to exchange data with the remote server, these internal types cannot be transmitted. So some of the aggregate functions such as "avg" should not be pushed down now.
fix #ISSUE_Number
Change logs
Describe your change clearly, including what problem is being solved or what feature is being added.
If it has some breaking backward or forward compatibility, please clary.
Why are the changes needed?
Describe why the changes are necessary.
Does this PR introduce any user-facing change?
If yes, please clarify the previous behavior and the change this PR proposes.
How was this patch tested?
Please detail how the changes were tested, including manual tests and any relevant unit or integration tests.
Contributor's Checklist
Here are some reminders and checklists before/when submitting your pull request, please check them:
make installcheck
make -C src/test installcheck-cbdb-parallel
cloudberrydb/dev
team for review and approval when your PR is ready🥳