Understand slow upgrade process #306

JamesGuthrie · 2022-05-25T15:15:14Z

In #274 a user provided a report of the upgrade from Promscale 0.10.0 to 0.11.0 taking ~60 minutes. This is longer than expected.

In this issue I will gather findings in trying to understand why it is slow.

JamesGuthrie · 2022-05-25T15:45:02Z

In analysing the logs provided to us by the user, we determined that it is unclear exactly why the upgrade process is slow. One hypothesis is that the background jobs which were running simultaneously to the upgrade are slowing down the upgrade process. The theory is that the background jobs take locks on tables which the upgrade process will also need to lock, so lock contention makes everything slow.

We know that the more objects the database contains, the longer the upgrade script will take.

Benchmarking setup

We set up a test system on an AWS m5.2xlarge with a 100GB gp3 volume with IOPS and throughput set to the max (16'000 IOPS and 1000MB/s).

On this system we installed the latest timescaledb (2.7.0), set up a database in it (test with superuser test), installed the timescaledb extension (no promscale extension), and started promscale 0.10.0 against it in -startup.only mode to get all SQL objects installed.

We then created 1000 metrics, each with 1 series, and for each series inserted 250 time series values with timestamps in the far future, such that each value ends up in a different chunk. The timestamp in the far future is to ensure that compression is not applied to the values. This results in 250'000 chunks, and ~500'000 objects in the database in total.

-- SQL script to create 250'000 chunks, each with 1 value
begin;
select _prom_catalog.get_or_create_metric_table_name(format('my_metric_%s', m))
from generate_series(1, 1000) m;
commit;

call _prom_catalog.finalize_metric_creation();

do $block$
    declare
        _metric text;
        _series_id bigint;
    begin
        for _metric in
            (
                select format('my_metric_%s', m)
                from generate_series(1, 1000) m
            )
            loop
                -- create 1 series per metric
                RAISE LOG 'creating series for %', _metric;
                select _prom_catalog.get_or_create_series_id(
                               format('{"__name__": "%s", "namespace":"dev", "node": "brain"}', _metric)::jsonb
                           ) into strict _series_id
                ;

                RAISE LOG 'inserting future data for %', _metric;

                -- in the future - not compressed
                execute format(
                        $$
        insert into prom_data.%I
        select
            '2035-01-01'::timestamptz + (interval '9 hour' * x),
            x + 0.1,
            %s
        from generate_series(1, 250) x
        $$, _metric, _series_id
                    );
                commit;
            end loop;
    end;
$block$;

Simple upgrade

We ran Promscale 0.11.0 against the database (this time with the promscale extension installed), and observed that the promscale background workers were not running. The following are the Promscale connector logs:

level=info ts=2022-05-25T07:53:34.780Z caller=new_migrate.go:62 msg="Installing extension at version '0.0.0'"
level=info ts=2022-05-25T07:56:47.152Z caller=new_migrate.go:135 msg="upgrading old version of Promscale to version 0.11.0, the Promscale database extension is required moving forward"
level=info ts=2022-05-25T07:59:51.759Z caller=extension.go:276 msg="successfully updated extension" extension_name=promscale old_version=0.0.0 new_version=0.5.0
level=info ts=2022-05-25T07:59:51.896Z caller=client.go:85 msg="Migration successful, exiting"

We see that the "takeover" process takes ~3 minutes, and the 0.0.0 -> 0.5.0 upgrade process takes ~3 minutes, for a total of ~6 minutes to upgrade.

Upgrade with background job contention, but no compression

We reset the database to the previous "fresh" state (by making a file-system level copy of the data directory before running any benchmarks, and restoring this - literally:

cp -a main main.bak
run benchmark
rm -rf main, cp -a main.bak main)

We then restarted postgres, let the Promscale maintenance jobs run, and triggered the promscale 0.11.0 upgrade process. The resulting Promscale Connector logs can be seen here:

level=info ts=2022-05-25T08:18:41.255Z caller=new_migrate.go:62 msg="Installing extension at version '0.0.0'"
level=info ts=2022-05-25T08:21:56.980Z caller=new_migrate.go:135 msg="upgrading old version of Promscale to version 0.11.0, the Promscale database extension is required moving forward"
level=info ts=2022-05-25T08:25:03.766Z caller=extension.go:276 msg="successfully updated extension" extension_name=promscale old_version=0.0.0 new_version=0.5.0
level=info ts=2022-05-25T08:25:03.898Z caller=client.go:85 msg="Migration successful, exiting"

Not much different to before.

These maintenance jobs were not doing any actual work, because all chunks are in the distant future.

JamesGuthrie · 2022-05-25T17:01:16Z

The next benchmark would like to determine the impact of having maintenance jobs running with actual compression work while the upgrade process runs.

In order to more accurately measure this, we require a dataset with more data per chunk, so that the compression process has some work to do.

This benchmark is on the same machine as before.

In order to produce the test data, we first disabled promscale maintenance jobs with:

SELECT alter_job(job_id, scheduled => false) FROM timescaledb_informations.jobs WHERE 'proc_name' = 'execute_maintenance_job';

and then ran the following script:

do $block$
    declare
        _metric text;
        _series_id bigint;
    begin
        for _metric in
            (
                select format('my_metric_%s', m)
                from generate_series(1, 1000) m
            )
            loop
                -- create 1 series per metric
                RAISE LOG 'creating series for %', _metric;
                select _prom_catalog.get_or_create_series_id(
                               format('{"__name__": "%s", "namespace":"dev", "node": "brain"}', _metric)::jsonb
                           ) into strict _series_id
                ;

                RAISE LOG 'inserting data for %', _metric;

                -- in the past - will be compressed by background job
                execute format(
                        $$
                        insert into prom_data.%I
                        select
                            '2022-01-01'::timestamptz + (interval '30 seconds' * x),
                            x + 0.1,
                            %s
                        from generate_series(1, 250000) x
                        $$, _metric, _series_id
                    );
                commit;
            end loop;
    end;
$block$;

This results in the following chunk and object counts:

test=# SELECT count(*) FROM timescaledb_information.chunks;
 count
--------
 261413
(1 row)

test=# SELECT count(*) AS locked_object_count
    FROM pg_class
    JOIN pg_namespace n ON n.oid = pg_class.relnamespace
    WHERE
          n.nspname <> 'pg_catalog'
      AND n.nspname !~ '^pg_toast'
      AND n.nspname <> 'information_schema'
;
 locked_object_count
---------------------
              534206
(1 row)

Baseline

As a baseline, we would like to understand how long the upgrade takes with this data set, and without benchmark contention.

The following logs show how long the upgrade process took:

level=info ts=2022-05-25T15:57:45.466Z caller=new_migrate.go:62 msg="Installing extension at version '0.0.0'"
level=info ts=2022-05-25T16:01:09.550Z caller=new_migrate.go:135 msg="upgrading old version of Promscale to version 0.11.0, the Promscale database extension is required moving forward"
level=info ts=2022-05-25T16:04:26.631Z caller=extension.go:276 msg="successfully updated extension" extension_name=promscale old_version=0.0.0 new_version=0.5.0

With maintenance job contention

If we start the maintenance jobs before triggering the upgrade, we see the following in the promscale connector logs:

level=info ts=2022-05-25T16:46:20.713Z caller=new_migrate.go:62 msg="Installing extension at version '0.0.0'"
level=error ts=2022-05-25T16:46:22.124Z caller=runner.go:116 msg="aborting startup due to error" err="migration error: error while installing promscale extension version 0.0.0: error installing Promscale extension at version 0.0.0: ERROR: deadlock detected (SQLSTATE 40P01)"

If we start the upgrade first, just before the maintenance jobs are going to trigger, we see the following:

level=info ts=2022-05-25T16:51:47.246Z caller=new_migrate.go:62 msg="Installing extension at version '0.0.0'"
level=info ts=2022-05-25T16:54:24.630Z caller=new_migrate.go:135 msg="upgrading old version of Promscale to version 0.11.0, the Promscale database extension is required moving forward"
level=info ts=2022-05-25T16:57:08.744Z caller=extension.go:276 msg="successfully updated extension" extension_name=promscale old_version=0.0.0 new_version=0.5.0

JamesGuthrie · 2022-05-26T08:47:43Z

Upgrade during different phases of maintenance jobs

The promscale maintenance jobs go through two main phases:

Applying retention policies
Compressing data

It is possible that when triggering the upgrade in different phases, the contention behaviour could be different.

Upgrade triggered during retention

level=info ts=2022-05-26T07:34:02.556Z caller=new_migrate.go:62 msg="Installing extension at version '0.0.0'"
level=info ts=2022-05-26T07:37:26.541Z caller=new_migrate.go:135 msg="upgrading old version of Promscale to version 0.11.0, the Promscale database extension is required moving forward"
level=info ts=2022-05-26T07:40:41.023Z caller=extension.go:276 msg="successfully updated extension" extension_name=promscale old_version=0.0.0 new_version=0.5.0

Upgrade triggered during compression

level=info ts=2022-05-26T08:40:51.582Z caller=new_migrate.go:62 msg="Installing extension at version '0.0.0'"
level=info ts=2022-05-26T08:42:08.639Z caller=new_migrate.go:135 msg="upgrading old version of Promscale to version 0.11.0, the Promscale database extension is required moving forward"
level=info ts=2022-05-26T08:43:53.484Z caller=extension.go:276 msg="successfully updated extension" extension_name=promscale old_version=0.0.0 new_version=0.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understand slow upgrade process #306

Understand slow upgrade process #306

JamesGuthrie commented May 25, 2022

JamesGuthrie commented May 25, 2022

JamesGuthrie commented May 25, 2022

JamesGuthrie commented May 26, 2022

Understand slow upgrade process #306

Understand slow upgrade process #306

Comments

JamesGuthrie commented May 25, 2022

JamesGuthrie commented May 25, 2022

Benchmarking setup

Simple upgrade

Upgrade with background job contention, but no compression

JamesGuthrie commented May 25, 2022

Baseline

With maintenance job contention

JamesGuthrie commented May 26, 2022

Upgrade during different phases of maintenance jobs

Upgrade triggered during retention

Upgrade triggered during compression