Document dual-write and backfill migration strategy

timescale · Aug 29, 2023 · a7d38fc · a7d38fc
1 parent da234c0
commit a7d38fc
Show file tree

Hide file tree

Showing 3 changed files with 288 additions and 6 deletions.
diff --git a/use-timescale/migration/dual-write-and-backfill.md b/use-timescale/migration/dual-write-and-backfill.md
@@ -0,0 +1,244 @@
+---
+title: Low-downtime migrations with dual-write and backfill
+excerpt: Migrate a hypertable or entire database with low downtime
+products: [cloud, self_hosted]
+keywords: [backups, restore]
+tags: [recovery, logical backup, pg_dump, pg_restore]
+---
+
+# Dual-write and backfill
+
+Dual-write and backfill is a migration strategy to move a large amount of
+time-series data (100GB-4TB+) with low downtime (on the order of minutes).
+Roughly, it consists of three steps:
+
+1. Clone schema and relational data from source to target
+1. Dual-write to source and target
+1. Backfill time-series data
+
+Dual-write and backfill can be used for any source database type, as long as it
+can provide data in csv format. It can be used to move data from a Postgres
+source, and from TimescaleDB to TimescaleDB. If the source and target databases
+are Postgres, they can be of different versions, as long as the target is
+greater than the source. If both source and target use TimescaleDB, the version
+of TimescaleDB must be the same.
+
+Dual-write and backfill works well when:
+1. The bulk of the (on-disk) data is in time-series tables.
+1. Writes by the application do not reference historical time-series data.
+1. There is no requirement for transactional consistency (i.e. it is possible
+   to filter the time-series data by time and retain data integrity).
+1. No `UPDATE` or `DELETE` queries will be run on time-series data in the
+   source database during the migration process (or if they are, it happens in
+   a controlled manner, such that it's possible to either ignore, or
+   re-backfill).
+1. Either the relational (non-time-series) data is small enough to be copied
+   from source to target in an acceptable amount of time for this to be done
+   with downtime, or the relational data can be copied asynchronously while the
+   application continues to run (i.e. changes relatively infrequently).
+
+## Migration process
+
+In detail, the migration process consists of the following steps:
+1. Set up a second database
+1. Modify the application to write to a secondary database
+1. Migrate schema and relational data from source to target
+1. Start the application in dual-write mode
+1. Determine the consistency time `T`
+1. Backfill time-series data from source to target
+1. Enable retention and compression policies
+1. Validate that all data is present in target database
+1. Validate that target database can handle production load
+1. Switch application to treat target database as primary (potentially continuing to write into source database, as a backup)
+
+### 1. Set up a second database
+
+[Create a database service in Timescale][create-service].
+
+[create-service]:  /use-timescale/:currentVersion:/services/create-a-service/
+
+### 2. Modify the application to write to a secondary database 
+
+How exactly to do this is dependent on the language that your application is
+written in, and on how exactly your ingestion and application function. In the
+simplest case, you simply execute two inserts in parallel. In the general case,
+you will need to think about how to handle the failure to write to either the
+old or new database, and what mechanism you want to or can build to recover
+from such a failure.
+
+You may also want to execute the same read queries on the old and new database,
+in order to evaluate the correctness and performance of the results which the
+queries deliver. Bear in mind that the new database will spend a certain amount
+of time without all data being present, so you should expect that the results
+are not the same for some period (potentially a number of days).
+
+### 3. Set up schema and migrate relational data to new database
+
+How exactly you perform this is dependent on whether you're migrating from
+plain PostgreSQL, TimescaleDB, or some other database.
+
+#### From TimescaleDB
+
+Dump the database roles from the source database:
+
+```
+pg_dumpall -d "$SOURCE" \
+  --quote-all-identifiers \
+  --roles-only \
+  --file=roles.sql
+```
+
+Fix up the dumped roles:
+
+```
+sed -i -E \
+-e '/CREATE ROLE "postgres";/d' \
+-e '/ALTER ROLE "postgres"/d' \
+-e 's/(NO)*SUPERUSER//g' \
+-e 's/(NO)*REPLICATION//g' \
+-e 's/(NO)*BYPASSRLS//g' \
+-e 's/GRANTED BY "[^"]*"//g' \
+roles.sql
+```
+
+Dump all plain tables and the TimescaleDB catalog from the source database:
+
+```
+pg_dump -d "$SOURCE" \
+  --format=plain \
+  --quote-all-identifiers \
+  --no-tablespaces \
+  --no-owner \
+  --no-privileges \
+  --exclude-table-data='_timescaledb_internal.*' \
+  --file=dump.sql
+```
+
+a. `--no-tablespaces` is required because Timescale does not support
+   tablespaces other than the default. This is a limitation.
+b. `--no-owner` is required because tsdbadmin is not a superuser and cannot
+   assign ownership in all cases. This flag means that everything will be
+   owned by the tsdbadmin user in the target regardless of ownership in the
+   source. This is a limitation.
+c. `--no-privileges` is required because tsdbadmin is not a superuser and
+   cannot assign privileges in all cases. This flag means that privileges
+   assigned to other users will need to be reassigned in the target
+   database as a manual clean-up task. This is a limitation.
+d. `--exclude-table-data='_timescaledb_internal.*'` will dump the structure
+   of the hypertable chunks, but not the data. This will create empty
+   chunks on the target ready for the backfill process.
+
+1. If the source database has the timescaledb extension installed in a schema
+   other than "public" it will cause issues on Timescale. Edit the dump file to
+   remove any references to the non-public schema. We need the extension in the
+   "public" schema on Timescale. This is a limitation.
+1. If any background jobs are owned by the "postgres" user, they need to be
+   owned by "tsdbadmin" on the target database. Edit the dump file accordingly.
+
+Load the roles and schema into the target database, and disable all background jobs.
+TODO: why do we disable all background jobs?
+
+```
+psql -X -d "$TARGET" \
+  -v ON_ERROR_STOP=1 \
+  --echo-errors \
+  -f roles.sql \
+  -c 'select public.timescaledb_pre_restore();' \
+  -f dump.sql \
+  -f - <<'EOF'
+begin;
+select public.timescaledb_post_restore();
+
+-- disable all background jobs
+select public.alter_job(id::integer, scheduled=>false)
+from _timescaledb_config.bgw_job
+where id >= 1000
+;
+commit;
+EOF
+```
+
+#### From plain PostgreSQL
+
+TODO
+
+#### From some other database
+
+TODO
+
+### 4. Start application in dual-write mode
+
+With the target database set up, your application can now be started in
+dual-write mode.
+
+### 5. Determine the consistency time `T`
+
+After dual-writes have been executing for a while, the target hypertable will
+contain data in three time ranges: missing writes, late-arriving data, and the
+"consistency" range
+
+#### Missing writes
+
+If the application is made up of multiple writers, and these writers did not
+all simultaneously start writing into the target hypertable, there is a period
+of time in which not all writes have made it into the target hypertable. This
+period starts when the first writer begins dual-writing, and ends when the last
+writer begins dual-writing.
+
+#### Late-arriving data
+
+Some applications have late-arriving data: measurements which have a timestamp
+in the past, but which weren't written yet (e.g. from devices which had
+intermittent connectivity issues). The window of late-arriving data is between
+the present moment, and the maximum lateness.
+
+#### Consistency range
+
+The consistency range is the range in which there are no missing writes, and in
+which all data has arrived, i.e. between the end of the missing writes range
+and the beginning of the late-arriving data range.
+
+The length of these ranges is defined by the properties of the application,
+there is no one-size-fits-all way to determine what they are.
+The consistency time `T` is an arbitrarily chosen time in the consistency range.
+
+### 6. Backfill data from source to target
+
+If your source database is using TimescaleDB, we recommend using our backfill
+tool `timescaledb-backfill`.
+
+If your source database is not using TimescaleDB, we recommend dumping the data
+from your source database on a per-table basis into CSV format, and restoring
+those CSVs into the target database using the `timescaledb-parallel-copy` tool.
+
+### 7. Enable retention and compression policies
+
+Reenable all retention and compression policies.
+If the backfill process took long enough for there to be significant
+retention/compression work to be done, it may be preferable to run the jobs
+manually in order to have control over the pacing of the work until it is
+caught up before reenabling.
+
+### 8. Validate that all data is present in target database
+
+One possible approach to validating this is to compare row counts on a
+chunk-by-chunk basis. One way to do so is to run `select count(*) ...` which is
+exact but potentially costly. Another way to do it would be to run analyze on
+both the source and target chunk and then look at the `reltuples` column of the
+`pg_class` table for the chunks' rows. This would not be exact but would be
+less costly.
+
+### 9. Validate that target database can handle production load
+
+Assuming dual-writes have been in place, the target database should be holding
+up to production write traffic. Now would be the right time to determine if the
+new database can serve all production traffic (both reads _and_ writes). How
+exactly this is done is application-specific and up to you to determine.
+
+### 10. Switch production workload to new database
+
+Once you've validated that all the data is present, and that the new database
+can handle the production workload, the final step is to switch to the new
+database as your primary. You may want to continue writing to the old database
+for a period, until you are certain that the new database is holding up to all
+production traffic.
diff --git a/use-timescale/migration/index.md b/use-timescale/migration/index.md
@@ -8,12 +8,44 @@ tags: [ingest, migrate, RDS]
 
 # Migrate your data to Timescale
 
-You can migrate data from another database into Timescale using the PostgreSQL
-`pg_dump` and `pg_restore` commands. You can also use these tools to migrate
-your data from Managed Service for TimescaleDB, from a self-hosted Timescale
-instance, or from another PostgreSQL database, including Amazon RDS.
+There are a number of different ways to migrate your data to Timescale. Which
+option you choose depends on a few different factors, the most important of
+which are:
 
-If you want to import data from another format, such as a `.csv` file, into a
-new Timescale service, see the [data ingest section][data-ingest].
+- How much downtime can you afford (minutes, or hours?)
+- How much data are you migrating (megabytes, or terabytes?)
+- Where will you be migrating your data from (Postgres, TimescaleDB, Influx, or MySQL?)
+
+If you are using Postgres or TimescaleDB and can afford to take your application
+offline for a few hours, the simplest option is to migrate data from another
+database into Timescale using PostgreSQL's `pg_dump` and `pg_restore` commands.
+You can also use these tools to migrate your data from Managed Service for
+TimescaleDB, from a self-hosted TimescaleDB instance, or from another
+PostgreSQL database, including Amazon RDS. Consult our guide on [migrating with
+pg_dump and pg_restore][pg-dump-restore].
+
+If you are looking for a low-downtime alternative (downtime on the order of
+minutes), are willing to modify your ingestion code, and the bulk of your data
+is stored in time-series tables, you can use the [dual-write and backfill][dual-write]
+strategy for a low-downtime migration. This strategy also works if your source
+database is not PostgreSQL-based.
+
+If you're using PostgreSQL, you may also have heard of logical replication
+being the recommended strategy for migrations with low downtime. Currently,
+TimescaleDB doesn't work with logical replication, so this is not a viable
+option, but we are actively working on making this possible.
+
+If you're looking for a zero-downtime migration method, please let us know when
+you find it. We're looking for it too!
+
+If you're migrating from some other source than PostgreSQL, and don't want to
+use the dual-write and backfill approach, then the easiest way to move your
+data to Timescale is by exporting the data from your existing database as a
+`.csv` file, and importing it with [timescaledb-parallel-copy][parallel-copy].
+
+For other ingestion methods, see the [data ingest section][data-ingest].
 
 [data-ingest]: /use-timescale/:currentVersion:/ingest-data/
+[dual-write]: /use-timescale/:currentVersion:/migration/dual-write-and-backfill/
+[pg-dump-restore]: /use-timescale/:currentVersion:/migration/pg-dump-and-restore/
+[parallel-copy]: /use-timescale/:currentVersion:/ingest-data/import-csv/
diff --git a/use-timescale/page-index/page-index.js b/use-timescale/page-index/page-index.js
@@ -356,6 +356,12 @@ module.exports = [
             excerpt:
               "Migrate a hypertable or entire database with native PostgreSQL commands",
           },
+          {
+            title: "Dual-write and backfill",
+            href: "dual-write-and-backfill",
+            excerpt:
+              "Migrate a large database with low downtime",
+          },
         ],
       },
       {