Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check before logicalrep for rebalancer, error if needed #6754

Merged
merged 5 commits into from
Mar 21, 2023

Conversation

agedemenli
Copy link
Contributor

DESCRIPTION: Check before logicalrep for rebalancer, error if needed

Check if we can use logical replication or not, in case of shard transfer mode = auto, before executing the shard moves. If we can't, error out. Before this PR, we used to error out in the middle of shard moves:

set citus.shard_count = 4; -- just to get the error sooner
select citus_remove_node('localhost',9702);

create table t1 (a int primary key);
select create_distributed_table('t1','a');
create table t2 (a bigint);
select create_distributed_table('t2','a');

select citus_add_node('localhost',9702);
select rebalance_table_shards();
NOTICE:  Moving shard 102008 from localhost:9701 to localhost:9702 ...
NOTICE:  Moving shard 102009 from localhost:9701 to localhost:9702 ...
NOTICE:  Moving shard 102012 from localhost:9701 to localhost:9702 ...
ERROR:  cannot use logical replication to transfer shards of the relation t2 since it doesn't have a REPLICA IDENTITY or PRIMARY KEY

Now we check and error out in the beginning, without moving the shards.

fixes: #6727

@agedemenli agedemenli requested a review from JelteF March 9, 2023 10:49
@onderkalaci
Copy link
Member

Can we also squeeze in this https://github.com/citusdata/citus-enterprise/issues/394? I have seen customers hit that

@JelteF
Copy link
Contributor

JelteF commented Mar 9, 2023

@onderkalaci I think that's different enough that it should have a separate fix. Main reason is that force_logical will not work for UNLOGGED tables either. And also because it should be done in multiple places, e.g. also for create_distributed_table_concurrently.

@onderkalaci
Copy link
Member

@onderkalaci I think that's different enough that it should have a separate fix. Main reason is that force_logical will not work for UNLOGGED tables either. And also because it should be done in multiple places, e.g. also for create_distributed_table_concurrently.

Yeah, I tried anyway as you are touching relevant places :p

Comment on lines +1844 to +1859
if (transferMode == TRANSFER_MODE_AUTOMATIC)
{
/*
* If the shard transfer mode is set to auto, we should check beforehand
* if we are able to use logical replication to transfer shards or not.
* We throw an error if any of the tables do not have a replica identity, which
* is required for logical replication to replicate UPDATE and DELETE commands.
*/
PlacementUpdateEvent *placementUpdate = NULL;
foreach_ptr(placementUpdate, placementUpdateList)
{
Oid relationId = RelationIdForShard(placementUpdate->shardId);
List *colocatedTableList = ColocatedTableList(relationId);
VerifyTablesHaveReplicaIdentity(colocatedTableList);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check should be done even before planning the moves and before calling EnsureReferenceTablesExistOnAllNodesExtended. We do this already for the background rebalancer:

const char shardTransferMode = LookupShardTransferMode(shardReplicationModeOid);
List *colocatedTableList = NIL;
Oid relationId = InvalidOid;
foreach_oid(relationId, options->relationIdList)
{
colocatedTableList = list_concat(colocatedTableList,
ColocatedTableList(relationId));
}
Oid colocatedTableId = InvalidOid;
foreach_oid(colocatedTableId, colocatedTableList)
{
EnsureTableOwner(colocatedTableId);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we actually want to check is the move list itself. Otherwise we would be throwing the logical replication error even for the balanced tables (if they don't have a replica identity or primary key), but we actually were not going to move their shards. If we move the check to somewhere before planning the moves, then we might even throw that error instead of "no moves available" notice message. I think we should do the same for the background rebalancer. We should call VerifyTablesHaveReplicaIdentity for each element in placementUpdateList.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem useful to indicate to users that their future rebalances will fail by throwing the error even when there are no moves, because otherwise they will find out when they actually do need to rebalance. I don't feel strongly about this though.

@codecov
Copy link

codecov bot commented Mar 9, 2023

Codecov Report

Merging #6754 (3457f89) into main (aa465b6) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 3457f89 differs from pull request most recent head 3bda153. Consider uploading reports for the commit 3bda153 to get more accurate results

@@           Coverage Diff            @@
##             main    #6754    +/-   ##
========================================
  Coverage   93.16%   93.16%            
========================================
  Files         260      259     -1     
  Lines       56103    55924   -179     
========================================
- Hits        52267    52102   -165     
+ Misses       3836     3822    -14     

@agedemenli agedemenli force-pushed the check-beforehand-if-logicalrep branch from 925a70b to 92779b9 Compare March 9, 2023 16:49
@agedemenli
Copy link
Contributor Author

@JelteF @marcocitus shall we merge this?

@agedemenli agedemenli requested review from JelteF and marcocitus March 15, 2023 08:06
@agedemenli agedemenli enabled auto-merge (squash) March 21, 2023 13:25
@agedemenli agedemenli merged commit 2713e01 into main Mar 21, 2023
@agedemenli agedemenli deleted the check-beforehand-if-logicalrep branch March 21, 2023 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check shard replication mode before moving shards
4 participants