You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This roadmap outlines the key milestones and focuses on the open-gpdb in 2024/2025. Our main goal is:
stability/fix coredumps and bugs
backport existing features from other projects
adopt pg/gp extensions
Features and Enhancements
Add Rows Out statistics to Explain Analyse for calculate skew for plan steps
Add a lot of additional debug info to understand what's going on during query execution
Fast Temporary tables for GP - do not create catalog items for temporary objects
GpShrink - like expand but for decreasing cluster size
try_cast as in MSSQL
Stability fixes
Do not fail during restore unlogged tables
SIGSEGV in write_message_to_server_log()
Disable FAULT_INJECTOR in production build
Cloud fixes
Yezzey now supports non-default TABLESPACE
Cloud database roles mdb_admin, mdb_superuser - need for cloud solutions where one could not use gpadmin
Prometheus metrics for yezzey/yproxy
Extensions
pg_cron
pgaudit
sr_plan - similar to oracle outline
relation access statistics
database table size statistics
pg_hint_plan
clever list of candidates for analyze/vacuum - how much data has been changed since last analyze/vacuum action, should the relation be analyzed/vacuumed or not
Waiting queue
Auto Scaling: Currently, cdbhash() uses the number of segments as a hash parameter. This leads to scaling issues with gp_expand on a large cluster. We would like to implement an access method similar to AO, which materializes the table metadata and decouples hash range from particular segment. That will allow us to bind hash ranges to segments instantly, offloading the file to S3 and declaring it bound to the new segment. It can even be left in the cache on the original segment until it is evicted.
As part of our autoscaling strategy, we will be deprecating non-catalog tables in the HEAP data type. This will allow for zero-cost table addition and removal.
Coordination service for sharing tables between clusters, fully S3-backed. Enables reading and writing from multiple clusters, eliminates the need for standby clusters, and opens up new data usage opportunities. Metadata is served by the coordination service when the source cluster is unavailable. Also it maintains writer locks during writes.
Materialized views incrementally maintained (a.k.a projections). One of the key strengths of GP is its ability to join large tables. However, distributed hash joins usually lead to significant data network utilization. To mitigate this effect, we can consider materialized views using different distribution keys, which can make the Motion unnecessary. To enable this feature, we will need to introduce a new relation type, the projection, a specific kind of materialized view, and instruct the planner to include it in its considerations. See also https://github.com/sraoss/pg_ivm[Feature] Dynamic Tables apache/cloudberry#706
Caching. Currently, Yezzey relies fully on object storage cache, but some S3 implementations charge customers according to the number of GET requests. To reduce these numbers, we need proper local caching in place.
ANN - ANN integration for AI. Currently, PG_Vector using HNSW has taken over the world. We need to integrate this feature into GP as well.
QUIC - QUIC Motion, or at least Motion Compression
Coordination-less mode - Given the metadata table service from (2), we should be able to plan queries on each cluster node
Backup, Offload, and Table-Sharing Storage - Currently, Yezzey stores data in different object storage buckets. However, we can think about creating a backup bucket and using offload buckets in the same way, as homogeneous parts of the storage system. This would allow us to convert a local table to a Yezzey table instantly.
Built-in Time Travel - SELECTing a table at a specific point in the near past should be no problem, unless the visibility map is heavily modified. Also, dropping a table should be no problem; the table is stored in backup in exactly the same format, making table drops equivalent to table evictions from the cache.
Graceful segment shutdown. Do not break running transaction/queries to planned switchover primary segment to mirror. Instead, save the locks, wait for running query completion, switchover to mirror, open up connection, and transfer locks to new established connection. Then continue to launch queries in a current transaction.
Standby cluster. Need to support transfer stream of changes from primary GP cluster to Standby GP cluster
More than one mirror for a segment. This is especially true for large clusters, where failures occur more frequently and one segment could be completely lost.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
This roadmap outlines the key milestones and focuses on the open-gpdb in 2024/2025. Our main goal is:
Features and Enhancements
Stability fixes
Cloud fixes
Extensions
Waiting queue
Auto Scaling: Currently, cdbhash() uses the number of segments as a hash parameter. This leads to scaling issues with gp_expand on a large cluster. We would like to implement an access method similar to AO, which materializes the table metadata and decouples hash range from particular segment. That will allow us to bind hash ranges to segments instantly, offloading the file to S3 and declaring it bound to the new segment. It can even be left in the cache on the original segment until it is evicted.
As part of our autoscaling strategy, we will be deprecating non-catalog tables in the HEAP data type. This will allow for zero-cost table addition and removal.
Coordination service for sharing tables between clusters, fully S3-backed. Enables reading and writing from multiple clusters, eliminates the need for standby clusters, and opens up new data usage opportunities. Metadata is served by the coordination service when the source cluster is unavailable. Also it maintains writer locks during writes.
Materialized views incrementally maintained (a.k.a projections). One of the key strengths of GP is its ability to join large tables. However, distributed hash joins usually lead to significant data network utilization. To mitigate this effect, we can consider materialized views using different distribution keys, which can make the Motion unnecessary. To enable this feature, we will need to introduce a new relation type, the projection, a specific kind of materialized view, and instruct the planner to include it in its considerations. See also https://github.com/sraoss/pg_ivm [Feature] Dynamic Tables apache/cloudberry#706
Caching. Currently, Yezzey relies fully on object storage cache, but some S3 implementations charge customers according to the number of GET requests. To reduce these numbers, we need proper local caching in place.
ANN - ANN integration for AI. Currently, PG_Vector using HNSW has taken over the world. We need to integrate this feature into GP as well.
QUIC - QUIC Motion, or at least Motion Compression
Coordination-less mode - Given the metadata table service from (2), we should be able to plan queries on each cluster node
Backup, Offload, and Table-Sharing Storage - Currently, Yezzey stores data in different object storage buckets. However, we can think about creating a backup bucket and using offload buckets in the same way, as homogeneous parts of the storage system. This would allow us to convert a local table to a Yezzey table instantly.
Built-in Time Travel - SELECTing a table at a specific point in the near past should be no problem, unless the visibility map is heavily modified. Also, dropping a table should be no problem; the table is stored in backup in exactly the same format, making table drops equivalent to table evictions from the cache.
Graceful segment shutdown. Do not break running transaction/queries to planned switchover primary segment to mirror. Instead, save the locks, wait for running query completion, switchover to mirror, open up connection, and transfer locks to new established connection. Then continue to launch queries in a current transaction.
Standby cluster. Need to support transfer stream of changes from primary GP cluster to Standby GP cluster
More than one mirror for a segment. This is especially true for large clusters, where failures occur more frequently and one segment could be completely lost.
Beta Was this translation helpful? Give feedback.
All reactions