Improve isolation tests for metadata syncing #5682

hanefi · 2022-02-04T03:28:19Z

This PR aims to cover some concurrent operations with metadata syncing.

The list of operations are tracked in #1199

I added tests that capture the following classes of queries

TODO:

Investigate the root cause of the deadlocks above
Try to reproduce start_metadata_sync_to_node concurrently with DROP TABLE may error #4366 and document findings.

onderkalaci · 2022-02-07T14:02:17Z

Thanks, the tests you added looks good. On top of the missing items you have as TODO, we could consider the following as well:

Few more ideas could be:

create distributed function
create foreign key to a distributed/reference table
Can you check if this is fixed: start_metadata_sync_to_node concurrently with DROP TABLE may error #4366 and if so, add test

src/test/regress/spec/isolation_metadata_sync_vs_all.spec

onderkalaci · 2022-02-09T13:09:49Z

src/test/regress/expected/isolation_metadata_sync_vs_all.out

+Parsed test spec with 3 sessions
+
+starting permutation: s1-begin s2-begin s1-start-metadata-sync s2-alter-table s1-rollback s2-rollback s3-compare-snapshot
+create_distributed_function


I think not a critical thing, but seeing create_distributed_function on each permutation felt unexpected at first. I guess, it shows up because it is the last step of the setup. Can we somehow avoid this?

I went over the statements in the setup phase, and the outputs are more clear now.

onderkalaci · 2022-02-09T13:11:36Z

src/test/regress/expected/isolation_metadata_sync_vs_all.out

+ SELECT create_distributed_table('new_dist_table', 'id');
+ <waiting ...>
+step s1-rollback: 
+    ROLLBACK;


Why do we ROLLBACK in the tests? Wouldn't it be better to COMMIT and s3-compare-snapshot? Such that we ensure the current operation is done successfully? '

Valid for all rollback steps in the test -- unless there are any specific ROLLBACKs

I commit everything now. It makes more sense to check consistent metadata snapshots after committing all the changes.

Why do we ROLLBACK in the tests?

The teardown phase were missing some statements that made it quite hard to test everything.
For example a type that was not dropped in a permutation or in the teardown section breaks future permutations. However I have an extensive list of objects to drop in the teardown section now and it is all good.

onderkalaci · 2022-02-09T13:37:42Z

src/test/regress/spec/isolation_metadata_sync_vs_all.spec

+
+step "s3-compare-snapshot"
+{
+	SELECT bool_and(result::text[] @> activate_node_snapshot() AND result::text[] <@ activate_node_snapshot()) AS same_snapshot_across_cluster


Not sure if my below suggestion is simpler than this, but at least I find it easier to read and seems does a full line by line comparison across all nodes:

SELECT count(*) = 0 FROM ( ( SELECT unnest(activate_node_snapshot()) EXCEPT SELECT unnest(string_to_array(RESULT, 'CANNOTBEUSEDSTRING')) AS unnested_result FROM run_command_on_workers($$SELECT array_to_string(activate_node_snapshot(), 'CANNOTBEUSEDSTRING')$$) GROUP BY unnested_result ) UNION ( SELECT unnest(string_to_array(RESULT, 'CANNOTBEUSEDSTRING')) AS unnested_result FROM run_command_on_workers($$SELECT array_to_string(activate_node_snapshot(), 'CANNOTBEUSEDSTRING')$$) GROUP BY unnested_result EXCEPT SELECT unnest(activate_node_snapshot()) ) ) AS foo;

One difference I found is that the coordinator generates more lines for things like:

SET citus.enable_ddl_propagation TO 'on' SET citus.enable_ddl_propagation TO 'off' SET ROLE onderkalaci

I think that is a separate discussion, the above query still does a full control for

I simplified your suggested queries. Let me know if you think they are ok.

Basically, I removed conversions from text[] array into string, and vice-versa. We can already use text[] values returned by the run_command_on_workers() UDF.

onderkalaci · 2022-02-09T13:39:41Z

src/test/regress/spec/isolation_metadata_sync_vs_all.spec

+permutation "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-create-type" "s1-commit" "s2-commit" "s3-compare-snapshot" "s2-drop-type"
+permutation "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-create-dist-func" "s1-commit" "s2-commit" "s3-compare-snapshot" "s2-drop-dist-func"
+
+// the following operation creates a distributed deadlock and gets cancelled


I'll check this

I added one more deadlock case when dropping a distributed function concurrent with a start_metadata_sync operation.

hanefi · 2022-02-10T01:34:29Z

I now have a broken isolation test isolation_create_table_vs_add_remove_node that I am trying to fix.

I am sharing all the lines before the diff so that I can show the information on the whole permutation.

starting permutation: s1-add-node-2 s2-begin s2-create-append-table s1-remove-node-2 s2-commit s2-select
node_name|node_port
---------------------------------------------------------------------
localhost|    57637
(1 row)

step s1-add-node-2:
 SELECT 1 FROM master_add_node('localhost', 57638);

?column?
---------------------------------------------------------------------
       1
(1 row)

step s2-begin:
 BEGIN;

step s2-create-append-table:
 SET citus.shard_replication_factor TO 1;
 CREATE TABLE dist_table (x int, y int);
 SELECT create_distributed_table('dist_table', 'x', 'append');
 SELECT 1 FROM master_create_empty_shard('dist_table');

create_distributed_table
---------------------------------------------------------------------

(1 row)

?column?
---------------------------------------------------------------------
       1
(1 row)

step s1-remove-node-2:
 SELECT * FROM master_remove_node('localhost', 57638);
 <waiting ...>
step s2-commit: 
 COMMIT;


 step s1-remove-node-2: <... completed>
-ERROR:  cannot remove or disable the node localhost:xxxxx because because it contains the only shard placement for shard xxxxx
+master_remove_node
+---------------------------------------------------------------------
+
+(1 row)
+

step s2-select:
 SELECT * FROM dist_table;

x|y
---------------------------------------------------------------------
(0 rows)

master_remove_node
---------------------------------------------------------------------

-(2 rows)
+(1 row)

onderkalaci · 2022-02-10T10:27:02Z

src/test/regress/spec/isolation_metadata_sync_vs_all.spec

+	ALTER SEQUENCE pg_catalog.pg_dist_node_nodeid_seq RESTART 123000;
+	ALTER SEQUENCE pg_catalog.pg_dist_shardid_seq RESTART 123000;
+
+	SELECT 1 FROM master_add_node('localhost', 57637);


why do we add / remove nodes in the setup / tear down? Are they strictly necessary?

No they are not. I used to have one node with metadata, and one without. This is no longer the case now.

My original plan was to test citus_add_node, start_metadata_sync_to_node, trigger_metadata_sync and I no longer need to add and drop the nodes at every permutation.

onderkalaci

If you agree with my comments, we can merge and continue with DROP bug(s). Otherwise, let's discuss further

onderkalaci · 2022-02-10T10:28:18Z

src/test/regress/spec/isolation_metadata_sync_vs_all.spec

+permutation "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-create-type" "s1-commit" "s2-commit" "s3-compare-snapshot"
+permutation "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-create-dist-func" "s1-commit" "s2-commit" "s3-compare-snapshot"
+
+// the following operations create a distributed deadlock and gets cancelled


my suggestion is to remove these two tests from this PR and merge this. And then dive into why DROP objects fails in general

For future reference, here are the permutations that caused deadlocks that I removed from the file:

// the following operations create a distributed deadlock and gets cancelled permutation "s2-create-type" "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-drop-type" "s1-commit" "s2-commit" "s3-compare-snapshot" permutation "s2-create-dist-func" "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-drop-dist-func" "s1-commit" "s2-commit" "s3-compare-snapshot"

onderkalaci · 2022-02-10T10:29:30Z

src/test/regress/spec/isolation_metadata_sync_vs_all.spec

+
+step "s2-attach-partition"
+{
+	ALTER TABLE dist_partitioned_table ATTACH PARTITION dist_partitioned_table_p1 FOR VALUES FROM (1) TO (9);


one final request: Can you also please add CREATE PARTITION OF test? That goes from a slightly different code-path, so would be good to have that

We already cover this. See s2-create-partition-of step.

onderkalaci · 2022-02-10T10:32:26Z

src/test/regress/spec/isolation_metadata_sync_vs_all.spec

+permutation "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-drop-table" "s1-commit" "s2-commit" "s3-compare-snapshot"
+permutation "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-create-dist-table" "s1-commit" "s2-commit" "s3-compare-snapshot"
+permutation "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-create-ref-table" "s1-commit" "s2-commit" "s3-compare-snapshot"
+permutation "s1-begin" "s2-begin" "s1-start-metadata-sync" "s2-attach-partition" "s1-commit" "s2-commit" "s3-compare-snapshot"


should we add "s1-start-metadata-sync" vs "s2-start-metadata-sync" as well? We might have such test in another file. If so, we can skip addinng that here

I created another section above for these tests.

This commit introduces several test cases for concurrent operations that change metadata, and a concurrent metadata sync operation. The overall structure is as follows: - Session#1 starts metadata syncing in a transaction block - Session#2 does an operation that change metadata - Both sessions are committed - Another session checks whether the metadata are the same accross all nodes in the cluster.

hanefi · 2022-02-10T23:08:16Z

@onderkalaci I addressed your comments, and I did some more changes:

I moved the test to later in the schedule as I no longer need to add or remove nodes.
I added more permutations with comments so that it will be easier to understand how to fix the tests if they fail. For example, I check for metadata consistency before running any of the queries so if an earlier test failed and broke metadata, it is easier to understand that.

Please take another quick look before I merge this one.

onderkalaci · 2022-02-11T11:29:36Z

Please take another quick look before I merge this one.

thanks for these further simplifications, we are good to merge

onderkalaci reviewed Feb 8, 2022

View reviewed changes

src/test/regress/spec/isolation_metadata_sync_vs_all.spec Outdated Show resolved Hide resolved

hanefi force-pushed the metadata-iso-tests branch from 762fc55 to 8c340fa Compare February 9, 2022 12:52

onderkalaci reviewed Feb 9, 2022

View reviewed changes

onderkalaci reviewed Feb 10, 2022

View reviewed changes

onderkalaci approved these changes Feb 10, 2022

View reviewed changes

hanefi force-pushed the metadata-iso-tests branch from 8868398 to 2e5ca8b Compare February 10, 2022 22:59

hanefi marked this pull request as ready for review February 10, 2022 23:04

hanefi requested a review from onderkalaci February 10, 2022 23:04

onderkalaci approved these changes Feb 11, 2022

View reviewed changes

hanefi merged commit 986d8cf into master Feb 11, 2022

hanefi deleted the metadata-iso-tests branch February 11, 2022 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve isolation tests for metadata syncing #5682

Improve isolation tests for metadata syncing #5682

hanefi commented Feb 4, 2022 •

edited

Loading

onderkalaci commented Feb 7, 2022

onderkalaci Feb 9, 2022

hanefi Feb 10, 2022

onderkalaci Feb 9, 2022

hanefi Feb 10, 2022 •

edited

Loading

onderkalaci Feb 9, 2022

hanefi Feb 10, 2022

onderkalaci Feb 9, 2022

hanefi Feb 10, 2022

hanefi commented Feb 10, 2022 •

edited

Loading

onderkalaci Feb 10, 2022

hanefi Feb 10, 2022

onderkalaci left a comment

onderkalaci Feb 10, 2022

hanefi Feb 10, 2022

onderkalaci Feb 10, 2022

hanefi Feb 10, 2022

onderkalaci Feb 10, 2022

hanefi Feb 10, 2022

hanefi commented Feb 10, 2022

onderkalaci commented Feb 11, 2022

Improve isolation tests for metadata syncing #5682

Improve isolation tests for metadata syncing #5682

Conversation

hanefi commented Feb 4, 2022 • edited Loading

onderkalaci commented Feb 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanefi Feb 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanefi commented Feb 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onderkalaci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanefi commented Feb 10, 2022

onderkalaci commented Feb 11, 2022

hanefi commented Feb 4, 2022 •

edited

Loading

hanefi Feb 10, 2022 •

edited

Loading

hanefi commented Feb 10, 2022 •

edited

Loading