Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: propose new features for the Postgres matrix #4

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions postgres/spec/feature_matrix.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
| cpccf | Connection pool custom config | boolean | The operator allows the user to supply a custom connection pool configuration for the connection pool service. | Only applies if [pgcl/conpl] is true.<br /> | |
| tlssp | TLS Support | boolean | PostgreSQL connections can be secured with Postgres SSL/TLS support. | | ✓ |
| tlscu | TLS user-provided certificates | boolean | Operators may chose by default to generate self-signed SSL certificates.<br />They may also offer the option to specify the CA and certificates that users want Postgres clusters to use. | | |
| mtlsrep | Mutual TLS support for PostgreSQL replicas | boolean | Operators exclusively rely on TLS certificate authentication and authorization to connect the managed replicas in the cluster. | | |
| crtmg | CertManager integration | boolean | The operator integrates with CertManager in order to generate the certificate to be used with Postgres. | Only applies if [pgcl/tlscu] is true.<br /> | |
| insql | Initialization from SQL scripts | boolean | After the database cluster creation, the operator will run automatically one or more user-supplied scripts for initial DDL or data (possibly limited in size) creation.<br />The operator must properly inform the user of the execution result of the scripts. | | |
| inext | Initialization from external source | boolean | After the database cluster creation, the operator will automatically connect to an external data source (like an object storage or a public repo) and fetch the DDL/data.<br />The operator must properly inform the user of the execution result of the scripts. | | |
Expand Down Expand Up @@ -117,6 +118,7 @@
| cudas | Custom dashboards | boolean | In order to display the captured Postgres metrics, the operator provides specialized Postgres dashboards for the users. | | |
| cuale | Custom alerts | boolean | The operator provides bundled specific Postgres alerts to be triggered on the Postgres metrics processed.<br />E.g. there is an alert for transaction wraparound or for unused replication slots. | | |
| exdel | Exposed decorated logs | boolean | The operator provides a mechanism to expose all the logs of the managed Postgres instances to a centralized logging tool.<br />The logs must be decorated with extra metadata in order to provide semantic meaning, including the Pod name and namespace, the cluster name, the role of the Postgres instance (e.g. primary, replica, standby-leader, etc.) and the timestamp that will be available to be used to filter logs entries.<br />There is no need to configure the tool in order to obtain required extra metadata from the logs. | | |
| audit | Audit logs | boolean | The operator provides an integrated way to seamlessly export logs for auditing purposes. | Provide more information about the technology (e.g. pgaudit extension) used for this purpose.<br /> | |
| explg | Export logs | boolean | The operator allows the user to configure an external sink for the Postgres logs (e.g. a SaaS service). | | |
| oo11y | Operator Observability | boolean | The operator is itself a source of telemetry data, potentially including metrics, traces and logs, about its own performance. | | |

Expand All @@ -130,17 +132,20 @@
| isbom | Software Bill of Materials | boolean | The operator releases include the SBOM (Software Bill of Materials), a detailed description of all the components, modules, and their dependencies. | SBOM is expected to be in accordance to the [Kubernetes SIG BOM](https://github.com/kubernetes-sigs/bom).<br /> | |
| fgopp | Fine-grained RBAC permissions | boolean | The operator uses a separate serviceaccount that has RBAC permissions that only require the access that is actually needed to create and manage the Kubernetes resources, not more. | | |
| noprm | No or justified privileged mode | boolean | The operator-provided containers do not require privileged mode.<br />The container processes do not run as root. | Reasonable exceptions to this rule can be made for features that require or do not diminish the container's security, e.g. when using eBPF.<br /> | |
| rofs | Read-only file system for image containers | boolean | The root file system of the image containers provided by the operator are read-only, enforcing immutability of the binaries. | | |


## Day 2 Operations – [day2](#day2)
| **ID** | **NAME** | **TYPE** | **DESCRIPTION** | **VENDOR COMPLIANCE** | **MAIN CATEGORY** |
|---|---|---|---|---|---|
| amiup | Automated minor upgrades | boolean | The operator can perform a minor version upgrade of a Postgres cluster automatically.<br />This operation can be managed by the user declaratively. | The operator must provide proper information to the user as to the status and final result of the operation.<br />The operator should provide ongoing status information, and perform the operation with the minimum downtime required.<br />Provide information about the update strategy (i.e. restart of the pods or rolling update followed by a switchover or a restart).<br /> | ✓ |
| cfgchg | PostgreSQL configuration changes | boolean | The operator automatically handles reloads and, where required, restarts of PostgreSQL following any change to the configuration requested by the user.<br />This includes seamless coordination of restarts of the instances after changes to the [hot-standby sensitive parameters](https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-ADMIN), namely `max_connections`, `max_prepared_transactions`, `max_locks_per_transaction`, `max_wal_senders`, and `max_worker_processes`. | The operator must provide proper information to the user as to the status and final result of the operation, as well as provide ways for the user to control when and how to trigger those changes.<br />Declare how changes of hot-standby sensitive parameters are handled by the operator.<br /> | ✓ |
| amaup | Automated major upgrades | boolean | The operator can perform a major version upgrade of a Postgres cluster automatically.<br />This operation can be managed by the user declaratively. | The operator must provide proper information to the user as to the status and final result of the operation.<br />The operator should provide ongoing status information, and perform the operation with the minimum downtime required.<br /> | ✓ |
| crest | Controlled cluster restart | boolean | Sometimes Postgres needs to be restarted (e.g. changing of a parameter that requires restart).<br />The operator provides means to perform this operation automatically and in a controlled manner (rolling restart) so that the cluster faces a minimal downtime only. | | |
| ociup | Container images upgrade | boolean | Similarly to the controlled restart operation, the operator is capable of updating the running container images (which require a pod restart) automatically and with minimal cluster impact. | | |
| swtch | Switchover | boolean | If HA capabilities are provided, the operator also provides a mechanism for manual switchover.<br />The user may specify the configuration declaratively and the operator will perform the desired switchover automatically, by demoting the current primary, promoting the a replica, and updating the endpoints/services as required. | | |
| sqlmi | SQL Migrations | boolean | The operator provides managed SQL migration capabilities.<br />The user may specify SQL scripts that contain migrations (DDL changes, etc) to be deployed to a given database, having the operator apply them automatically. | The operator must report back to the user detailed information about the results of the execution(s) of the script(s) provided by the user.<br /> | |
| fence | Fencing of Postgres instances | boolean | The operator provides a way to stop PostgreSQL instances while keeping the pods running in order to enable investigation of the content of the data directories.<br />This can be very useful for production support and diagnostics, especially in data corruption due to storage issues.<br />Fencing can be requested on a single instance, a set of them or the entire cluster, in a declarative way. The operator must provide a way to resume. | | |
| oday2 | Other Day 2 Operations | string_array | The operator provides support for other managed Day 2 operations. | All the mentioned additional day 2 operations need to be possible via declarative configuration and the operator to fully execute them without further user intervention.<br /> | |


Expand Down
39 changes: 39 additions & 0 deletions postgres/spec/feature_matrix.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,11 @@ categories:
description: |
Operators may chose by default to generate self-signed SSL certificates.
They may also offer the option to specify the CA and certificates that users want Postgres clusters to use.
- id: mtlsrep
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While not crucially important, so far the convention of having four character ids for categories and five character ids for the features was used. There's no strong reason to keep it like this --simply there isn't-- but if you believe the IDs like this one and others as part of this commit can be fitted under this convention, it would be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can call it sslr

name: Mutual TLS support for PostgreSQL replicas
type: boolean
description: |
Operators exclusively rely on TLS certificate authentication and authorization to connect the managed replicas in the cluster.
- id: crtmg
name: CertManager integration
type: boolean
Expand Down Expand Up @@ -630,6 +635,18 @@ categories:
The operator provides a mechanism to expose all the logs of the managed Postgres instances to a centralized logging tool.
The logs must be decorated with extra metadata in order to provide semantic meaning, including the Pod name and namespace, the cluster name, the role of the Postgres instance (e.g. primary, replica, standby-leader, etc.) and the timestamp that will be available to be used to filter logs entries.
There is no need to configure the tool in order to obtain required extra metadata from the logs.
- id: stdout
name: JSON logs in stdout
type: boolean
description: |
Each container in a pod should directly export logs in JSON format to the standard output channel as recommended and expected by Kubernetes.
- id: audit
name: Audit logs
type: boolean
description: |
The operator provides an integrated way to seamlessly export logs for auditing purposes.
vendor_compliance: |
Provide more information about the technology (e.g. pgaudit extension) used for this purpose.
- id: explg
name: Export logs
type: boolean
Expand Down Expand Up @@ -682,6 +699,11 @@ categories:
The container processes do not run as root.
vendor_compliance: |
Reasonable exceptions to this rule can be made for features that require or do not diminish the container's security, e.g. when using eBPF.
- id: rofs
name: Read-only file system for image containers
type: boolean
description: |
The root file system of the image containers provided by the operator are read-only, enforcing immutability of the binaries.
- id: day2
name: Day 2 Operations
features:
Expand All @@ -696,6 +718,16 @@ categories:
The operator should provide ongoing status information, and perform the operation with the minimum downtime required.
Provide information about the update strategy (i.e. restart of the pods or rolling update followed by a switchover or a restart).
main: true
- id: cfgchg
name: PostgreSQL configuration changes
type: boolean
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd propose to make this feature a bit more generic (broader scope). Changes for restart may be for many more reasons that not only hot-standby parameters change (e.g. a frequent case is adding an extension to shared_preload_libraries). For this reason, I'd vouch to remove any reference to particular parameters and just keep the description generic to "changes to configuration that may require reload or restart" (not a proposed wording, just to illustrate the idea).

Current description also raises some doubts (at least for me) on whether this implies that restarts may happen (or not) as soon as a configuration change is triggered --which is debatable whether that's a good thing or not, leaving more control for the user. For this reason, I'd argue that the description is reworded more in the terms of describing whether the user provides a fully automated way to proceed with a configuration reload or restart, providing the adequate information to the user, without fully qualifying whether that's automatically or user triggered --however a great place to add that information is the comments field on the vendor submission.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if these thoughts are taking into account, this feature would be quite close to day2/crest, becoming possibly a duplicate. Maybe all these ideas here can be merged into a single one, potentially improving the actual day2/crest?

Copy link
Contributor Author

@gbartolini gbartolini May 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree here. Ensuring and coordinating the restart within a cluster is a feature, an important feature that removes the need for a human operator. For example, if you want to raise max_connections and you have replicas, you need to ensure that this operation is performed first on the standbys, and then - as last - on the primary. If you decrease the value, it is the opposite.

Ideally, a Kubernetes operator should simulate what a human being would do in this case, but do it in an automated and reliable way, without requiring human intervention - in order to prioritize self-healing and high availability. Then, obviously, you can configure and request human intervention, but I believe that these features should be highlighted, as configuration changes should be as transparent as they can possibly be to the end user and an operator should handle that as part of the resource lifecycle.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think we agree on the intent (to show an operator provides capabilities associated with controlling different aspects of the cluster's lifecycle, like a controlled restart operation) but maybe we're confused by the wording.

In particular my concern is not with the fact that automation should be used to perform a careful and correct restart; but rather than with the fact that it is triggered "automatically" (whenever a change is requested) rather than giving the operator of when running it (but then it runs automatically). In other words: I may want to keep a restart operation on hold until a better time (e.g. 3am, on a valley of traffic) rather than being launched as soon as I edit my max_connections parameter (which I may or may not realize may immediately trigger that event).

So I agree it's a good thing and this feature this reflect that the restart operation is fully automated; but I encourage wording to not assume that automated operation is triggered immediately, and that it is even offered as an option or that the default is the opposite.

But I'll leave the final decision to you, having my opinion here I'll just merge the latest patch that you send :) as your criteria should be well represented here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok to soften the wording and suggest that users have the choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check

description: |
The operator automatically handles reloads and, where required, restarts of PostgreSQL following any change to the configuration requested by the user.
This includes seamless coordination of restarts of the instances after changes to the [hot-standby sensitive parameters](https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-ADMIN), namely `max_connections`, `max_prepared_transactions`, `max_locks_per_transaction`, `max_wal_senders`, and `max_worker_processes`.
vendor_compliance: |
The operator must provide proper information to the user as to the status and final result of the operation.
Declare how changes of hot-standby sensitive parameters are handled by the operator.
main: true
- id: amaup
name: Automated major upgrades
type: boolean
Expand Down Expand Up @@ -731,6 +763,13 @@ categories:
The user may specify SQL scripts that contain migrations (DDL changes, etc) to be deployed to a given database, having the operator apply them automatically.
vendor_compliance: |
The operator must report back to the user detailed information about the results of the execution(s) of the script(s) provided by the user.
- id: fence
name: Fencing of Postgres instances
type: boolean
description: |
The operator provides a way to stop PostgreSQL instances while keeping the pods running in order to enable investigation of the content of the data directories.
This can be very useful for production support and diagnostics, especially in data corruption due to storage issues.
Fencing can be requested on a single instance, a set of them or the entire cluster, in a declarative way. The operator must provide a way to resume.
- id: oday2
name: Other Day 2 Operations
type: string_array
Expand Down