Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove dependency on placement caches from DROP #6401

Closed
wants to merge 1 commit into from

Conversation

hanefi
Copy link
Member

@hanefi hanefi commented Oct 6, 2022

When dropping objects, citus_drop_trigger should not depend on the validity of the caches for placements as it causes may cause error messages from time to time. With this commit we remove the depencency on the validity on caches for shard placements.

However, citus_drop_trigger still depends on WorkerNodeCache in 3 different codepaths listed below:

- master_remove_distributed_table_metadata_from_workers
 -> MasterRemoveDistributedTableMetadataFromWorkers
  -> SendCommandToWorkersWithMetadata
   -> SendCommandToMetadataWorkersParams
    -> TargetWorkerSetNodeList
     -> ActivePrimaryNonCoordinatorNodeList
      -> FilterActiveNodeListFunc
       -> GetWorkerNodeHash

- citus_drop_all_shards
 -> DropShards
  -> DropTaskList
   -> ShardPlacementListIncludingOrphanedPlacementsViaCatalog
    -> LookupNodeForGroup
     -> PrepareWorkerNodeCache

- master_remove_partition_metadata
 -> DeleteColocationGroupIfNoTablesBelong
  -> DeleteColocationGroup
   -> SyncDeleteColocationGroupToNodes
    -> SendCommandToWorkersWithMetadataViaSuperUser
     -> SendCommandToMetadataWorkersParams
      -> TargetWorkerSetNodeList
       -> ActivePrimaryNonCoordinatorNodeList
        -> FilterActiveNodeListFunc
         -> GetWorkerNodeHash

Related: #5372 #6041

When dropping objects, citus_drop_trigger should not depend on the
validity of the caches for placements as it causes error messages from
time to time. With this commit we remove the depencency on the validity
on caches for shard placements.

However, citus_drop_trigger still depends on WorkerNodeCache in 3
different codepaths listed below:

- master_remove_distributed_table_metadata_from_workers
 -> MasterRemoveDistributedTableMetadataFromWorkers
  -> SendCommandToWorkersWithMetadata
   -> SendCommandToMetadataWorkersParams
    -> TargetWorkerSetNodeList
     -> ActivePrimaryNonCoordinatorNodeList
      -> FilterActiveNodeListFunc
       -> GetWorkerNodeHash

- citus_drop_all_shards
 -> DropShards
  -> DropTaskList
   -> ShardPlacementListIncludingOrphanedPlacementsViaCatalog
    -> LookupNodeForGroup
     -> PrepareWorkerNodeCache

- master_remove_partition_metadata
 -> DeleteColocationGroupIfNoTablesBelong
  -> DeleteColocationGroup
   -> SyncDeleteColocationGroupToNodes
    -> SendCommandToWorkersWithMetadataViaSuperUser
     -> SendCommandToMetadataWorkersParams
      -> TargetWorkerSetNodeList
       -> ActivePrimaryNonCoordinatorNodeList
        -> FilterActiveNodeListFunc
         -> GetWorkerNodeHash
@hanefi
Copy link
Member Author

hanefi commented Oct 6, 2022

Apparently I broke DROP commands here. Let me post some updates on the failures.

DROP Commands complete just fine, with some extra notice messages that say shards do not exist.

 DROP TABLE data_load_test1, data_load_test2;

 ... redacted some output ...

+NOTICE:  table "data_load_test1_360021" does not exist, skipping
+CONTEXT:  SQL statement "SELECT citus_drop_all_shards(v_obj.objid, v_obj.schema_name, v_obj.object_name, drop_shards_metadata_only := false)"
+PL/pgSQL function citus_drop_trigger() line 25 at PERFORM
+NOTICE:  table "data_load_test1_360021" does not exist, skipping
+CONTEXT:  SQL statement "SELECT citus_drop_all_shards(v_obj.objid, v_obj.schema_name, v_obj.object_name, drop_shards_metadata_only := false)"
+PL/pgSQL function citus_drop_trigger() line 25 at PERFORM
+NOTICE:  table "data_load_test1_360022" does not exist, skipping
+CONTEXT:  SQL statement "SELECT citus_drop_all_shards(v_obj.objid, v_obj.schema_name, v_obj.object_name, drop_shards_metadata_only := false)"
+PL/pgSQL function citus_drop_trigger() line 25 at PERFORM

Later in the same test we see that those shards are still in worker nodes, and they are not dropped.

 -- There should be no table on the worker node
 \c - - :public_worker_1_host :worker_1_port
 SELECT relname FROM pg_class WHERE relname LIKE 'data_load_test%';
         relname         
 ------------------------
-(0 rows)
+ data_load_test1_360021
+ data_load_test1_360022
+ data_load_test1_360023
+ data_load_test1_360024
+ data_load_test2_360025
+ data_load_test2_360026
+ data_load_test2_360027
+ data_load_test2_360028
+(8 rows)

I guess I sent the DROP commands for shards to the coordinator somehow instead of the correct worker node. I continue investigating.

Test failures are from this CI run

@hanefi
Copy link
Member Author

hanefi commented Oct 6, 2022

This PR also aims to fix the errors that we see in the codepath shared in #5780 (comment)

I haven't tested it fully, but I will try the repro steps and other examples shared on the issue I linked above.

@hanefi
Copy link
Member Author

hanefi commented Feb 2, 2023

Superseded by #6687

@hanefi hanefi closed this Feb 2, 2023
@hanefi hanefi deleted the remove-cache-usage-from-drops branch April 19, 2023 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant