Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test deleting column from logical replication #1164

Closed
mlissner opened this issue Jan 30, 2020 · 6 comments
Closed

Test deleting column from logical replication #1164

mlissner opened this issue Jan 30, 2020 · 6 comments

Comments

@mlissner
Copy link
Member

The documentation is entirely unclear on how to delete a logically replicated column. Do you do it:

  • on the publisher first

    You'll be sending incomplete tuples to the subscriber. The subscriber will attempt to add these to the DB, but wont' be able to unless it has a default value for the not-yet-deleted column, so it'll probably start failing and flailing around.

  • on the subscriber first

    If you do it on the subscriber first, you'll be getting tuples with extra values that the subscriber doesn't know what to do with. Maybe it'll just ignore them? Unclear. The docs say:

    Columns of a table are also matched by name. The order of columns in the subscriber table does not need to match that of the publisher.

    But who knows what to make of that.

    Note that I think it's best practice to make additive changes on the subscriber first.

So this will follow the approach of tests that I did in #1115. Basically, I'll create two DBs with docker. Make them replicate, and then try making these tweaks while a watch command is busily adding data to the system.

@mlissner
Copy link
Member Author

mlissner commented Jan 30, 2020

OK, the first test was to set up two replicating docker images on ports 5432 and 5433 and then use a watch command to send them INSERT commands every quarter second. The setup with nothing fancy, and pretty much mirrored what I did in #1115.

Once set up, I began by deleting the column from the subscriber. Once deleted, I started getting this log error on the subscriber:

2020-01-30 23:07:01.323 UTC [144] ERROR:  logical replication target relation "public.test_table" is missing some replicated columns
2020-01-30 23:07:01.325 UTC [1] LOG:  background worker "logical replication worker" (PID 144) exited with exit code 1
2020-01-30 23:07:01.327 UTC [152] LOG:  logical replication apply worker for subscription "test_sub" has started
2020-01-30 23:07:01.333 UTC [152] ERROR:  logical replication target relation "public.test_table" is missing some replicated columns

And mirroring it on the publisher:

2020-01-30 23:07:01.332 UTC [5764] LOG:  starting logical decoding for slot "test_sub"
2020-01-30 23:07:01.332 UTC [5764] DETAIL:  Streaming transactions committing after 0/16C2228, reading WAL from 0/16BEFD0.
2020-01-30 23:07:01.332 UTC [5764] LOG:  logical decoding found consistent point at 0/16BEFD0
2020-01-30 23:07:01.332 UTC [5764] DETAIL:  There are no running transactions.

No surprises there, I suppose. So then I deleted the column from the publisher. With that done, the watch command started erroring (fine, no surprise), and this message started showing up in the subscriber logs:

2020-01-30 23:08:57.090 UTC [1] LOG:  background worker "logical replication worker" (PID 179) exited with exit code 1
2020-01-30 23:09:02.096 UTC [181] LOG:  logical replication apply worker for subscription "test_sub" has started
2020-01-30 23:09:02.102 UTC [181] ERROR:  logical replication target relation "public.test_table" is missing some replicated columns
2020-01-30 23:09:02.104 UTC [1] LOG:  background worker "logical replication worker" (PID 181) exited with exit code 1

So yeah, that shit didn't work. It's pretty clear that although we're sending good data now, the fact that we accumulated data on the publisher while the subscriber lacked the column means that that data needs to get flushed. And the only way to flush it is to have all the right places to put it in the subscriber.

Fixing this can be done by adding the column back to the subscriber, waiting for the changes to flush, and then removing it again.

@mlissner
Copy link
Member Author

OK, the next test was to delete the column from the publisher first. This worked better than I thought it would. Weirdly, the subscriber had no issues and the publisher was fine. What's weird though is to look at the column on the subscriber that is no longer getting data from the publisher. It was a varchar(n) field and it just contains blanks. What? So replication is setting those values???

I'm going to try this with a non-nullable integer field and see what tomfoolery it gets up to in that instance.

@mlissner
Copy link
Member Author

OK, this test took a few tweaks to get right, but I began by adding a new column on both sides with:

alter table test_table add  column test_int_col INTEGER DEFAULT 1 NOT NULL;

And then running on both:

alter table test_table alter column test_int_col drop default;

I then set up the watch to do the following inside the 5432 docker image:

watch -n 0.25 'psql -U postgres -d test -c "insert into test_table values (DEFAULT, '\''2'\'');"'

That got things rolling, so I dropped the column:

alter table test_table drop column test_int_col;

Then we got this in the publisher's logs:

2020-01-30 23:24:44.210 UTC [18787] ERROR:  INSERT has more expressions than target columns at character 41
2020-01-30 23:24:44.210 UTC [18787] STATEMENT:  insert into test_table values (DEFAULT, '2');

So I killed the watch and started a fresh one to run:

watch -n 0.25 'psql -U postgres -d test -c "insert into test_table default values;"'

That just updates the PK field at this point.

At that point, I got the following errors on the subscriber:

2020-01-30 23:25:02.354 UTC [230] ERROR:  null value in column "test_int_col" violates not-null constraint
2020-01-30 23:25:02.354 UTC [230] DETAIL:  Failing row contains (1639, null).

Which is fair. We had a NOT NULL column on the subscriber that was getting null values from the publisher. I was then able to fix these by dropping the column on the subscriber as well.

After that point things flushed and we were in business.

@mlissner
Copy link
Member Author

So...takeaways:

  1. If you drop it at the subscriber first, you're looking for trouble. The publisher will keep sending data with those tuples and the subscriber won't know what to do until it can flush the tuples, which it can't do without adding the column back.

  2. If you drop at the publisher first, you can sometimes keep sending data to the subscriber, if all the columns have default values or are nullable. If not, then things will fail until you drop the correct columns on the subscriber as well.

Welp, this is annoying because adding columns should be done at the subscriber first, and dropping should be done at the publisher first. Great.

@dhaenoor
Copy link

If you will delete columns from tables with REPLICA IDENTITY FULL, then you will turn into troubles. If at master tuple will be like this:
|01|first_text_column|second_clmn|
And at the replica string will be like this:
|01|first_text_column|
You'll catch error "missing columns"
In other case with tuple at master equal this:
|01|first_text_column|
And at replica like this:
|01|first_text_column|second_clmn|
You will never see any UPDATE or DELETEresults at replica.
Try to do that with log_min_messages=debug1 and than look out log-file for that:
DEBUG: logical replication did not find row for update in replication target relation "your_table_name"
or that:
DEBUG: logical replication could not find row for delete in replication target relation "your_table_name"

Before deleting columns, you must set PRIMARY KEY or REPLICA IDENTITY USING INDEX.

@mlissner
Copy link
Member Author

Yeah, that makes sense, but sure looks nasty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants