-
-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replication is failing after first schema migration #919
Comments
So, the changes I'll make are: ALTER TABLE search_docketentry ALTER COLUMN recap_sequence_number DROP NOT NULL; Then, I expect the replica to catch up, which could take a while. Once caught up, I run: UPDATE search_docketentry SET recap_sequence_number = '' where recap_sequence_number IS NULL; That'll get things into the right spot data-wise. Then, I fix the schema again with: ALTER TABLE search_docketentry ALTER COLUMN recap_sequence_number SET NOT NULL; Off we go. |
OK, made it nullable. CPU is spiking on the replica, and errors seem to have gone away. The command completed nearly instantly. |
Ran the update, it seemed fine: courtlistener=> select count(*) from search_docketentry where recap_sequence_number is null;
count
-------
118
(1 row)
courtlistener=> UPDATE search_docketentry SET recap_sequence_number = '' where recap_sequence_number IS NULL;
UPDATE 118 |
Ran the final alter command. It took a while to run (setting a null constraint requires a table scan and lock). Anyway, it worked.
Seems we're all good here. |
Not sure why I missed this at first, but as usual, the WAL logs are piling up and we've got issues.
So...what do we have?
The publisher just has this over and over
The subscriber has this over and over
And the migration was:
So it's not altogether clear what went wrong here. The error says that we've got a null value coming in for the
recap_sequence_number
field, which I guess could happen if the replica finished migrating before the master and the master sent data for the rest of that row (minus the new column).The documentation isn't clear on this point. First it says:
Which kind of hints that you should do a migration on the publisher first, and only later on the subscriber. But then it says:
Our changes were mostly additive, and the ones to
recap_sequence_number
certainly were. So...WTF. I guess that second sentence should be ignored henceforth.Solutions...
After talking through this with the wise folks on IRC, there were two suggestions:
Skip the failing row as described in the conflict documentation.
This seems fine, except for the fact that there were probably a lot of failing rows. How many should be skipped, and furthermore, if we do skip them, how badly effed up does the DB become?
Make the column nullable temporarily to allow the data to be ingested. Then, once the replication is fixed, migrate the schema back into sync.
This feels more surgical, but I think it's the right solution. The data won't get messed up and the we won't have to skip who-knows-how-many rows. Elders on IRC agree.
Going to try the second route.
The text was updated successfully, but these errors were encountered: