-
Notifications
You must be signed in to change notification settings - Fork 17
Understand slow upgrade process #306
Comments
In analysing the logs provided to us by the user, we determined that it is unclear exactly why the upgrade process is slow. One hypothesis is that the background jobs which were running simultaneously to the upgrade are slowing down the upgrade process. The theory is that the background jobs take locks on tables which the upgrade process will also need to lock, so lock contention makes everything slow. We know that the more objects the database contains, the longer the upgrade script will take. Benchmarking setupWe set up a test system on an AWS m5.2xlarge with a 100GB gp3 volume with IOPS and throughput set to the max (16'000 IOPS and 1000MB/s). On this system we installed the latest timescaledb (2.7.0), set up a database in it ( We then created 1000 metrics, each with 1 series, and for each series inserted 250 time series values with timestamps in the far future, such that each value ends up in a different chunk. The timestamp in the far future is to ensure that compression is not applied to the values. This results in 250'000 chunks, and ~500'000 objects in the database in total. -- SQL script to create 250'000 chunks, each with 1 value
begin;
select _prom_catalog.get_or_create_metric_table_name(format('my_metric_%s', m))
from generate_series(1, 1000) m;
commit;
call _prom_catalog.finalize_metric_creation();
do $block$
declare
_metric text;
_series_id bigint;
begin
for _metric in
(
select format('my_metric_%s', m)
from generate_series(1, 1000) m
)
loop
-- create 1 series per metric
RAISE LOG 'creating series for %', _metric;
select _prom_catalog.get_or_create_series_id(
format('{"__name__": "%s", "namespace":"dev", "node": "brain"}', _metric)::jsonb
) into strict _series_id
;
RAISE LOG 'inserting future data for %', _metric;
-- in the future - not compressed
execute format(
$$
insert into prom_data.%I
select
'2035-01-01'::timestamptz + (interval '9 hour' * x),
x + 0.1,
%s
from generate_series(1, 250) x
$$, _metric, _series_id
);
commit;
end loop;
end;
$block$; Simple upgradeWe ran Promscale 0.11.0 against the database (this time with the
We see that the "takeover" process takes ~3 minutes, and the Upgrade with background job contention, but no compressionWe reset the database to the previous "fresh" state (by making a file-system level copy of the data directory before running any benchmarks, and restoring this - literally:
We then restarted postgres, let the Promscale maintenance jobs run, and triggered the promscale 0.11.0 upgrade process. The resulting Promscale Connector logs can be seen here:
Not much different to before. These maintenance jobs were not doing any actual work, because all chunks are in the distant future. |
The next benchmark would like to determine the impact of having maintenance jobs running with actual compression work while the upgrade process runs. In order to more accurately measure this, we require a dataset with more data per chunk, so that the compression process has some work to do. This benchmark is on the same machine as before. In order to produce the test data, we first disabled promscale maintenance jobs with: SELECT alter_job(job_id, scheduled => false) FROM timescaledb_informations.jobs WHERE 'proc_name' = 'execute_maintenance_job'; and then ran the following script: do $block$
declare
_metric text;
_series_id bigint;
begin
for _metric in
(
select format('my_metric_%s', m)
from generate_series(1, 1000) m
)
loop
-- create 1 series per metric
RAISE LOG 'creating series for %', _metric;
select _prom_catalog.get_or_create_series_id(
format('{"__name__": "%s", "namespace":"dev", "node": "brain"}', _metric)::jsonb
) into strict _series_id
;
RAISE LOG 'inserting data for %', _metric;
-- in the past - will be compressed by background job
execute format(
$$
insert into prom_data.%I
select
'2022-01-01'::timestamptz + (interval '30 seconds' * x),
x + 0.1,
%s
from generate_series(1, 250000) x
$$, _metric, _series_id
);
commit;
end loop;
end;
$block$; This results in the following chunk and object counts:
BaselineAs a baseline, we would like to understand how long the upgrade takes with this data set, and without benchmark contention. The following logs show how long the upgrade process took:
With maintenance job contentionIf we start the maintenance jobs before triggering the upgrade, we see the following in the promscale connector logs:
If we start the upgrade first, just before the maintenance jobs are going to trigger, we see the following:
|
Upgrade during different phases of maintenance jobsThe promscale maintenance jobs go through two main phases:
It is possible that when triggering the upgrade in different phases, the contention behaviour could be different. Upgrade triggered during retention
Upgrade triggered during compression
|
In #274 a user provided a report of the upgrade from Promscale 0.10.0 to 0.11.0 taking ~60 minutes. This is longer than expected.
In this issue I will gather findings in trying to understand why it is slow.
The text was updated successfully, but these errors were encountered: