Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup routine using table partitioning #659

Open
eskebab opened this issue Feb 11, 2025 · 7 comments
Open

Cleanup routine using table partitioning #659

eskebab opened this issue Feb 11, 2025 · 7 comments
Assignees

Comments

@eskebab
Copy link
Contributor

eskebab commented Feb 11, 2025

Problem
We want to keep data for as long as necessary to do proper analysis. When log entries are considered useless for troubleshooting purposes, they should be deleted or partitioned

Solution
We implement database table partitioning using built-in partitioning in postgres. We can use existing cron.schedule function IF we are able to get next month from context

@eskebab eskebab changed the title cleanup routine Cleanup routine Feb 13, 2025
@eskebab eskebab changed the title Cleanup routine Cleanup routine using table partitioning Feb 25, 2025
@eskebab eskebab self-assigned this Feb 28, 2025
@eskebab
Copy link
Contributor Author

eskebab commented Mar 4, 2025

Following these guidelines for creating a partitioned table:
https://www.postgresql.org/docs/current/ddl-partitioning.html

primary key must include partitioned column

To create a unique or primary key constraint on a partitioned table, the partition keys must not include any expressions or function calls and the constraint's columns must include all of the partition key columns. This limitation exists because the individual indexes making up the constraint can only directly enforce uniqueness within their own partitions; therefore, the partition structure itself must guarantee that there are not duplicates in different partitions.

create index on range column

Create an index on the key column(s), as well as any other indexes you might want, on the partitioned table. (The key index is not strictly necessary, but in most scenarios it is helpful.) This automatically creates a matching index on each partition, and any partitions you create or attach later will also have such an index. An index or unique constraint declared on a partitioned table is “virtual” in the same way that the partitioned table is: the actual data is in child indexes on the individual partition tables.

Ensure that the enable_partition_pruning configuration parameter is not disabled in postgresql.conf. If it is, queries will not be optimized as desired.

@eskebab
Copy link
Contributor Author

eskebab commented Mar 4, 2025

CREATE TABLE events.trace_log_y2024m03 PARTITION OF events.trace_log
FOR VALUES FROM ('2025-03-01') TO ('2025-04-01');

creates a range partition where this insert will fall into partition:
INSERT INTO events.trace_log(
cloudeventid, resource, eventtype, consumer, "time", subscriptionid, responsecode, subscriberendpoint, activity)
VALUES ('ec10976a-e507-4ef3-9ee8-1e44a52501be', 'resource', 'eventtype', 'consumer', '2025-03-01', 1, 200, 'http://localhost', 'TEST');

while this entry will create an error, given that the above partition is the only available partition:
INSERT INTO events.trace_log(
cloudeventid, resource, eventtype, consumer, "time", subscriptionid, responsecode, subscriberendpoint, activity)
VALUES ('ec10976a-e507-4ef3-9ee8-1e44a52501be', 'resource', 'eventtype', 'consumer', '2025-04-01', 1, 200, 'http://localhost', 'TEST');

ERROR: no partition of relation "trace_log" found for row
Partition key of the failing row contains ("time") = (2025-04-01 00:00:00+02).

@eskebab
Copy link
Contributor Author

eskebab commented Mar 4, 2025

If one tries to create a partition that overlaps with another, this error will be given:
partition "trace_log_y2024m04" would overlap partition "trace_log_y2024m03"

@eskebab
Copy link
Contributor Author

eskebab commented Mar 11, 2025

since creating a table partition is an idempotent operation, using a background service in .NET could be a good alternative, because we would then have control over the range of the coming month/partition.

@eskebab
Copy link
Contributor Author

eskebab commented Mar 17, 2025

@SandGrainOne

I tried creating some partitions of various sizes. First, I created a partition with a duration of 1 month. Then I created one that has a six month range. I then created one only 24 hours long. This worked fine as long as there was no overlap and the names were unique.

I then tested inserting data into all three partitions with success

example executions

1 day:
CREATE TABLE events.trace_log_y2024m10d01 PARTITION OF events.trace_log
FOR VALUES FROM ('2025-10-01') TO ('2025-10-02');

6 months:
CREATE TABLE events.trace_log_y2024m04m10 PARTITION OF events.trace_log
FOR VALUES FROM ('2025-04-01') TO ('2025-10-01');

1 month:
CREATE TABLE events.trace_log_y2024m04 PARTITION OF events.trace_log
FOR VALUES FROM ('2025-03-01') TO ('2025-04-01')

@eskebab
Copy link
Contributor Author

eskebab commented Mar 19, 2025

Decided: look at alertManager or exception handling to prevent failure from happening silently.
Create partition for remainder of 2025.

@eskebab
Copy link
Contributor Author

eskebab commented Mar 19, 2025

funksjon pg_partition_tree() i Postgres for å hente ut partisjoner som tilhører en tabell:

select * from pg_partition_tree('events.trace_log'::regclass) where parentrelid is not null;

relid parentrelid isleaf level
"events.trace_log_y2024m03" "events.trace_log" true 1
"events.trace_log_y2024m0410" "events.trace_log" true 1
"events.trace_log_y2024m10d01" "events.trace_log" true 1

Med riktig navnekonvensjon kan vi da lage en spørring som kan sjekke om vi har en partisjon klar for neste periode. Dette kunne ha vært en background service som kjører ved ny deploy og lager en metrikk for om neste periode er klar eller ikke og eventuelt en counter for antall partisjoner, hvis det er interessant.

for mer info:
https://www.postgresql.org/docs/current/functions-admin.html

jeg antar at isleaf og level er mest interessant hvis man har flere nivåer, partisjoner av partisjonene osv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant