-
Notifications
You must be signed in to change notification settings - Fork 668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve isolation tests for metadata syncing #5682
Conversation
Hi @hanefi, Thanks, the tests you added looks good. On top of the missing items you have as TODO, we could consider the following as well: Few more ideas could be:
|
762fc55
to
8c340fa
Compare
Parsed test spec with 3 sessions | ||
|
||
starting permutation: s1-begin s2-begin s1-start-metadata-sync s2-alter-table s1-rollback s2-rollback s3-compare-snapshot | ||
create_distributed_function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not a critical thing, but seeing create_distributed_function
on each permutation felt unexpected at first. I guess, it shows up because it is the last step of the setup. Can we somehow avoid this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went over the statements in the setup phase, and the outputs are more clear now.
SELECT create_distributed_table('new_dist_table', 'id'); | ||
<waiting ...> | ||
step s1-rollback: | ||
ROLLBACK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we ROLLBACK in the tests? Wouldn't it be better to COMMIT
and s3-compare-snapshot
? Such that we ensure the current operation is done successfully? '
Valid for all rollback steps in the test -- unless there are any specific ROLLBACKs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commit everything now. It makes more sense to check consistent metadata snapshots after committing all the changes.
Why do we ROLLBACK in the tests?
The teardown phase were missing some statements that made it quite hard to test everything.
For example a type that was not dropped in a permutation or in the teardown section breaks future permutations. However I have an extensive list of objects to drop in the teardown section now and it is all good.
|
||
step "s3-compare-snapshot" | ||
{ | ||
SELECT bool_and(result::text[] @> activate_node_snapshot() AND result::text[] <@ activate_node_snapshot()) AS same_snapshot_across_cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if my below suggestion is simpler than this, but at least I find it easier to read and seems does a full line by line comparison across all nodes:
SELECT count(*) = 0
FROM
(
(
SELECT unnest(activate_node_snapshot())
EXCEPT
SELECT unnest(string_to_array(RESULT, 'CANNOTBEUSEDSTRING')) AS unnested_result
FROM run_command_on_workers($$SELECT array_to_string(activate_node_snapshot(), 'CANNOTBEUSEDSTRING')$$)
GROUP BY unnested_result
)
UNION
(
SELECT unnest(string_to_array(RESULT, 'CANNOTBEUSEDSTRING')) AS unnested_result
FROM run_command_on_workers($$SELECT array_to_string(activate_node_snapshot(), 'CANNOTBEUSEDSTRING')$$)
GROUP BY unnested_result
EXCEPT
SELECT unnest(activate_node_snapshot())
)
) AS foo;
One difference I found is that the coordinator generates more lines for things like:
SET citus.enable_ddl_propagation TO 'on'
SET citus.enable_ddl_propagation TO 'off'
SET ROLE onderkalaci
I think that is a separate discussion, the above query still does a full control for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified your suggested queries. Let me know if you think they are ok.
Basically, I removed conversions from text[] array into string, and vice-versa. We can already use text[] values returned by the run_command_on_workers()
UDF.
permutation "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-create-type" "s1-commit" "s2-commit" "s3-compare-snapshot" "s2-drop-type" | ||
permutation "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-create-dist-func" "s1-commit" "s2-commit" "s3-compare-snapshot" "s2-drop-dist-func" | ||
|
||
// the following operation creates a distributed deadlock and gets cancelled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added one more deadlock case when dropping a distributed function concurrent with a start_metadata_sync
operation.
I now have a broken isolation test I am sharing all the lines before the diff so that I can show the information on the whole permutation. starting permutation: s1-add-node-2 s2-begin s2-create-append-table s1-remove-node-2 s2-commit s2-select
node_name|node_port
---------------------------------------------------------------------
localhost| 57637
(1 row)
step s1-add-node-2:
SELECT 1 FROM master_add_node('localhost', 57638);
?column?
---------------------------------------------------------------------
1
(1 row)
step s2-begin:
BEGIN;
step s2-create-append-table:
SET citus.shard_replication_factor TO 1;
CREATE TABLE dist_table (x int, y int);
SELECT create_distributed_table('dist_table', 'x', 'append');
SELECT 1 FROM master_create_empty_shard('dist_table');
create_distributed_table
---------------------------------------------------------------------
(1 row)
?column?
---------------------------------------------------------------------
1
(1 row)
step s1-remove-node-2:
SELECT * FROM master_remove_node('localhost', 57638);
<waiting ...>
step s2-commit:
COMMIT;
step s1-remove-node-2: <... completed>
-ERROR: cannot remove or disable the node localhost:xxxxx because because it contains the only shard placement for shard xxxxx
+master_remove_node
+---------------------------------------------------------------------
+
+(1 row)
+
step s2-select:
SELECT * FROM dist_table;
x|y
---------------------------------------------------------------------
(0 rows)
master_remove_node
---------------------------------------------------------------------
-(2 rows)
+(1 row)
|
ALTER SEQUENCE pg_catalog.pg_dist_node_nodeid_seq RESTART 123000; | ||
ALTER SEQUENCE pg_catalog.pg_dist_shardid_seq RESTART 123000; | ||
|
||
SELECT 1 FROM master_add_node('localhost', 57637); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we add / remove nodes in the setup / tear down? Are they strictly necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No they are not. I used to have one node with metadata, and one without. This is no longer the case now.
My original plan was to test citus_add_node
, start_metadata_sync_to_node
, trigger_metadata_sync
and I no longer need to add and drop the nodes at every permutation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you agree with my comments, we can merge and continue with DROP bug(s). Otherwise, let's discuss further
permutation "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-create-type" "s1-commit" "s2-commit" "s3-compare-snapshot" | ||
permutation "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-create-dist-func" "s1-commit" "s2-commit" "s3-compare-snapshot" | ||
|
||
// the following operations create a distributed deadlock and gets cancelled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my suggestion is to remove these two tests from this PR and merge this. And then dive into why DROP objects fails in general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference, here are the permutations that caused deadlocks that I removed from the file:
// the following operations create a distributed deadlock and gets cancelled
permutation "s2-create-type" "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-drop-type" "s1-commit" "s2-commit" "s3-compare-snapshot"
permutation "s2-create-dist-func" "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-drop-dist-func" "s1-commit" "s2-commit" "s3-compare-snapshot"
|
||
step "s2-attach-partition" | ||
{ | ||
ALTER TABLE dist_partitioned_table ATTACH PARTITION dist_partitioned_table_p1 FOR VALUES FROM (1) TO (9); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one final request: Can you also please add CREATE PARTITION OF
test? That goes from a slightly different code-path, so would be good to have that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already cover this. See s2-create-partition-of
step.
permutation "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-drop-table" "s1-commit" "s2-commit" "s3-compare-snapshot" | ||
permutation "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-create-dist-table" "s1-commit" "s2-commit" "s3-compare-snapshot" | ||
permutation "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-create-ref-table" "s1-commit" "s2-commit" "s3-compare-snapshot" | ||
permutation "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-attach-partition" "s1-commit" "s2-commit" "s3-compare-snapshot" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add "s1-start-metadata-sync" vs "s2-start-metadata-sync" as well? We might have such test in another file. If so, we can skip addinng that here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created another section above for these tests.
This commit introduces several test cases for concurrent operations that change metadata, and a concurrent metadata sync operation. The overall structure is as follows: - Session#1 starts metadata syncing in a transaction block - Session#2 does an operation that change metadata - Both sessions are committed - Another session checks whether the metadata are the same accross all nodes in the cluster.
8868398
to
2e5ca8b
Compare
@onderkalaci I addressed your comments, and I did some more changes:
Please take another quick look before I merge this one. |
thanks for these further simplifications, we are good to merge |
This PR aims to cover some concurrent operations with metadata syncing.
The list of operations are tracked in #1199
I added tests that capture the following classes of queries
start_metadata_syncing
that are blockedALTER TABLE
DROP TABLE
create_distributed_table()
create_reference_table()
ATTACH PARTITION
DETACH PARTITION
CREATE TABLE .. PARTITION OF
DROP TYPE
commands for distributed types.TODO: