Replication is failing after first schema migration #919

mlissner · 2018-12-09T06:55:01Z

Not sure why I missed this at first, but as usual, the WAL logs are piling up and we've got issues.

So...what do we have?

The publisher just has this over and over

2018-12-08 22:09:08.007 PST [2178] replica@courtlistener LOG:  could not send data to client: Connection reset by peer
2018-12-08 22:09:08.007 PST [2178] replica@courtlistener CONTEXT:  slot "replicasubscription", output plugin "pgoutput", in the commit callback, associated LSN 710/DA9D8F58
2018-12-08 22:09:13.202 PST [2193] replica@courtlistener LOG:  starting logical decoding for slot "replicasubscription"
2018-12-08 22:09:13.202 PST [2193] replica@courtlistener DETAIL:  streaming transactions committing after 710/DA85E448, reading WAL from 710/DA7C49A0
2018-12-08 22:09:13.202 PST [2193] replica@courtlistener LOG:  logical decoding found consistent point at 710/DA7C49A0
2018-12-08 22:09:13.202 PST [2193] replica@courtlistener DETAIL:  There are no running transactions.
2018-12-08 22:09:18.436 PST [2202] replica@courtlistener LOG:  starting logical decoding for slot "replicasubscription"
2018-12-08 22:09:18.436 PST [2202] replica@courtlistener DETAIL:  streaming transactions committing after 710/DA85E448, reading WAL from 710/DA7C49A0
2018-12-08 22:09:18.437 PST [2202] replica@courtlistener LOG:  logical decoding found consistent point at 710/DA7C49A0
2018-12-08 22:09:18.437 PST [2202] replica@courtlistener DETAIL:  There are no running transactions.

The subscriber has this over and over

2018-12-09 05:59:45 UTC::@:[13373]:LOG: logical replication apply worker for subscription "replicasubscription" has started
2018-12-09 05:59:45 UTC::@:[13373]:ERROR: null value in column "recap_sequence_number" violates not-null constraint
2018-12-09 05:59:45 UTC::@:[13373]:DETAIL: Failing row contains (48064261, 2018-12-07 04:48:40.388377+00, 2018-12-07 04:48:40.388402+00, null, 576, , 4571214, null, null).
2018-12-09 05:59:45 UTC::@:[6342]:LOG: worker process: logical replication worker for subscription 18390 (PID 13373) exited with exit code 1

And the migration was:

BEGIN;
ALTER TABLE "search_docketentry" ADD COLUMN "pacer_sequence_number" smallint NULL;
ALTER TABLE "search_docketentry" ALTER COLUMN "pacer_sequence_number" DROP DEFAULT;
ALTER TABLE "search_docketentry" ADD COLUMN "recap_sequence_number" varchar(50) DEFAULT '' NOT NULL;
ALTER TABLE "search_docketentry" ALTER COLUMN "recap_sequence_number" DROP DEFAULT;
ALTER TABLE "search_docketentry" ALTER COLUMN "entry_number" DROP NOT NULL;
ALTER TABLE "search_recapdocument" ALTER COLUMN "document_number" SET DEFAULT '';
ALTER TABLE "search_recapdocument" ALTER COLUMN "document_number" DROP DEFAULT;
ALTER TABLE "search_docketentry" DROP CONSTRAINT "search_docketentry_docket_id_12fd448b9aa007ca_uniq";
CREATE INDEX "search_docketentry_recap_sequence_number_1c82e51988e2d89f_idx" ON "search_docketentry" ("recap_sequence_number", "entry_number");
CREATE INDEX "search_docketentry_eb19fcf7" ON "search_docketentry" ("pacer_sequence_number");
CREATE INDEX "search_docketentry_bff4d47b" ON "search_docketentry" ("recap_sequence_number");
CREATE INDEX "search_docketentry_recap_sequence_number_d700f0391e8213a_like" ON "search_docketentry" ("recap_sequence_number" varchar_pattern_ops);

COMMIT;

BEGIN;
ALTER TABLE "search_docketentry"
ALTER COLUMN "pacer_sequence_number" TYPE integer;

COMMIT;

So it's not altogether clear what went wrong here. The error says that we've got a null value coming in for the recap_sequence_number field, which I guess could happen if the replica finished migrating before the master and the master sent data for the rest of that row (minus the new column).

The documentation isn't clear on this point. First it says:

Logical replication is robust when schema definitions change in a live database: When the schema is changed on the publisher and replicated data starts arriving at the subscriber but does not fit into the table schema, replication will error until the schema is updated.

Which kind of hints that you should do a migration on the publisher first, and only later on the subscriber. But then it says:

In many cases, intermittent errors can be avoided by applying additive schema changes to the subscriber first.

Our changes were mostly additive, and the ones to recap_sequence_number certainly were. So...WTF. I guess that second sentence should be ignored henceforth.

Solutions...

After talking through this with the wise folks on IRC, there were two suggestions:

Skip the failing row as described in the conflict documentation.

This seems fine, except for the fact that there were probably a lot of failing rows. How many should be skipped, and furthermore, if we do skip them, how badly effed up does the DB become?
Make the column nullable temporarily to allow the data to be ingested. Then, once the replication is fixed, migrate the schema back into sync.

This feels more surgical, but I think it's the right solution. The data won't get messed up and the we won't have to skip who-knows-how-many rows. Elders on IRC agree.

Going to try the second route.

The text was updated successfully, but these errors were encountered:

mlissner · 2018-12-09T06:59:09Z

So, the changes I'll make are:

ALTER TABLE search_docketentry ALTER COLUMN recap_sequence_number DROP NOT NULL;

Then, I expect the replica to catch up, which could take a while. Once caught up, I run:

UPDATE search_docketentry SET recap_sequence_number = '' where recap_sequence_number IS NULL;

That'll get things into the right spot data-wise. Then, I fix the schema again with:

ALTER TABLE search_docketentry ALTER COLUMN recap_sequence_number SET NOT NULL;

Off we go.

mlissner · 2018-12-09T07:04:43Z

OK, made it nullable. CPU is spiking on the replica, and errors seem to have gone away. The command completed nearly instantly.

mlissner · 2018-12-09T08:38:15Z

Ran the update, it seemed fine:

courtlistener=> select count(*) from search_docketentry where recap_sequence_number is null;
 count 
-------
   118
(1 row)
courtlistener=> UPDATE search_docketentry SET recap_sequence_number = '' where recap_sequence_number IS NULL;
UPDATE 118

mlissner · 2018-12-09T08:49:48Z

Ran the final alter command. It took a while to run (setting a null constraint requires a table scan and lock). Anyway, it worked.

ALTER TABLE search_docketentry ALTER COLUMN recap_sequence_number SET NOT NULL;

Seems we're all good here.

mlissner closed this as completed Dec 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication is failing after first schema migration #919

Replication is failing after first schema migration #919

mlissner commented Dec 9, 2018 •

edited

Loading

mlissner commented Dec 9, 2018 •

edited

Loading

mlissner commented Dec 9, 2018

mlissner commented Dec 9, 2018

mlissner commented Dec 9, 2018

Replication is failing after first schema migration #919

Replication is failing after first schema migration #919

Comments

mlissner commented Dec 9, 2018 • edited Loading

The publisher just has this over and over

The subscriber has this over and over

And the migration was:

Solutions...

mlissner commented Dec 9, 2018 • edited Loading

mlissner commented Dec 9, 2018

mlissner commented Dec 9, 2018

mlissner commented Dec 9, 2018

mlissner commented Dec 9, 2018 •

edited

Loading

mlissner commented Dec 9, 2018 •

edited

Loading