Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start_metadata_sync_to_node concurrently with DROP TABLE may error #4366

Closed
Tracked by #5304
marcocitus opened this issue Nov 30, 2020 · 3 comments
Closed
Tracked by #5304

start_metadata_sync_to_node concurrently with DROP TABLE may error #4366

marcocitus opened this issue Nov 30, 2020 · 3 comments
Labels

Comments

@marcocitus
Copy link
Member

marcocitus commented Nov 30, 2020

Regression test output shows a concurrency bug:

--- /home/circleci/project/src/test/regress/expected/master_copy_shard_placement.out.modified	2020-11-30 18:25:03.511311567 +0000
+++ /home/circleci/project/build-13/src/test/regress/results/master_copy_shard_placement.out.modified	2020-11-30 18:25:03.515311569 +0000
@@ -100,25 +100,21 @@
 SELECT count(*) FROM history;
  count 
 -------
      2
 (1 row)
 
 -- test we can not replicate MX tables
 SET citus.shard_replication_factor TO 1;
 SET citus.replication_model TO 'streaming';
 SELECT start_metadata_sync_to_node('localhost', :worker_1_port);
- start_metadata_sync_to_node
----------------------------------------------------------------------
-
-(1 row)
-
+ERROR:  relation with OID 34467 does not exist
@marcocitus marcocitus added the bug label Nov 30, 2020
@marcocitus
Copy link
Member Author

stack trace:

#0  0x00007f9eb45d091e in GetCitusTableCacheEntry (distributedRelationId=<optimized out>) at metadata/metadata_cache.c:931
931                             ereport(ERROR, (errmsg("relation with OID %u does not exist",
(gdb) bt
#0  0x00007f9eb45d091e in GetCitusTableCacheEntry (distributedRelationId=<optimized out>) at metadata/metadata_cache.c:931
#1  0x00007f9eb45d099f in CitusTableList () at metadata/metadata_cache.c:467
#2  0x00007f9eb45d20f9 in DetachPartitionCommandList () at metadata/metadata_sync.c:1453
#3  0x00007f9eb45d2521 in MetadataDropCommands () at metadata/metadata_sync.c:592
#4  0x00007f9eb45d3161 in SyncMetadataSnapshotToNode (workerNode=workerNode@entry=0x559e1f209ab8, raiseOnError=raiseOnError@entry=true) at metadata/metadata_sync.c:266
#5  0x00007f9eb45d3722 in StartMetadataSyncToNode (nodeNameString=0x559e1f209640 "localhost", nodePort=nodePort@entry=57637) at metadata/metadata_sync.c:161
#6  0x00007f9eb45d3767 in start_metadata_sync_to_node (fcinfo=<optimized out>) at metadata/metadata_sync.c:97
#7  0x0000559e1ccfe45d in ExecInterpExpr (state=0x559e1f19ba90, econtext=0x559e1f19b790, isnull=0x7ffe1eed9437) at execExprInterp.c:699
#8  0x0000559e1ccfb13e in ExecInterpExprStillValid (state=0x559e1f19ba90, econtext=0x559e1f19b790, isNull=0x7ffe1eed9437) at execExprInterp.c:1802
#9  0x0000559e1cd378be in ExecEvalExprSwitchContext (isNull=0x7ffe1eed9437, econtext=0x559e1f19b790, state=0x559e1f19ba90) at ../../../src/include/executor/executor.h:316
#10 ExecProject (projInfo=0x559e1f19ba88) at ../../../src/include/executor/executor.h:350
#11 ExecResult (pstate=<optimized out>) at nodeResult.c:136
#12 0x0000559e1cd0b6f9 in ExecProcNodeFirst (node=0x559e1f19b678) at execProcnode.c:450
#13 0x0000559e1cd034c7 in ExecProcNode (node=0x559e1f19b678) at ../../../src/include/executor/executor.h:248
#14 ExecutePlan (estate=estate@entry=0x559e1f19b440, planstate=0x559e1f19b678, use_parallel_mode=<optimized out>, operation=operation@entry=CMD_SELECT, 
    sendTuples=sendTuples@entry=true, numberTuples=numberTuples@entry=0, direction=ForwardScanDirection, dest=0x559e1f258f48, execute_once=true) at execMain.c:1646
#15 0x0000559e1cd041c9 in standard_ExecutorRun (queryDesc=queryDesc@entry=0x559e1f14aeb0, direction=direction@entry=ForwardScanDirection, count=count@entry=0, 
    execute_once=execute_once@entry=true) at execMain.c:364
#16 0x00007f9eb45c87d9 in CitusExecutorRun (queryDesc=0x559e1f14aeb0, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at executor/multi_executor.c:217
#17 0x00007f9eb379968b in pgss_ExecutorRun (queryDesc=0x559e1f14aeb0, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at pg_stat_statements.c:1043
#18 0x0000559e1cd0428c in ExecutorRun (queryDesc=queryDesc@entry=0x559e1f14aeb0, direction=direction@entry=ForwardScanDirection, count=count@entry=0, 
    execute_once=<optimized out>) at execMain.c:306
#19 0x0000559e1ceae72f in PortalRunSelect (portal=portal@entry=0x559e1f05f7f0, forward=forward@entry=true, count=0, count@entry=9223372036854775807, 
    dest=dest@entry=0x559e1f258f48) at pquery.c:912
#20 0x0000559e1ceaff05 in PortalRun (portal=portal@entry=0x559e1f05f7f0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, 
    dest=dest@entry=0x559e1f258f48, altdest=altdest@entry=0x559e1f258f48, qc=0x7ffe1eed99d0) at pquery.c:756
#21 0x0000559e1ceac00b in exec_simple_query (query_string=query_string@entry=0x559e1ef6fe60 "SELECT start_metadata_sync_to_node('localhost', 57637);") at postgres.c:1239
#22 0x0000559e1ceae017 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x559e1f0180c8, dbname=<optimized out>, username=<optimized out>) at postgres.c:4315
#23 0x0000559e1ce176d2 in BackendRun (port=port@entry=0x559e1f00e120) at postmaster.c:4536
#24 0x0000559e1ce1aa01 in BackendStartup (port=port@entry=0x559e1f00e120) at postmaster.c:4220
#25 0x0000559e1ce1ac4a in ServerLoop () at postmaster.c:1739
#26 0x0000559e1ce1c189 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at postmaster.c:1412
#27 0x0000559e1cd614fb in main (argc=5, argv=0x559e1ef68d90) at main.c:210

@marcocitus marcocitus changed the title Random failures in start_metadata_sync_to_node in regression test start_metadata_sync_to_node concurrently with DROP TABLE may error Nov 30, 2020
@marcocitus marcocitus added the mx label May 12, 2021
@onderkalaci
Copy link
Member

I think nowadays this is became a deadlock:

ERROR:  deadlock detected
DETAIL:  Process 5011 waits for ShareLock on relation 16458 of database 13236; blocked by process 5022.
Process 5022 waits for AccessShareLock on relation 16937 of database 13236; blocked by process 5011.
HINT:  See server log for query details.
CONTEXT:  SQL statement "SELECT master_remove_distributed_table_metadata_from_workers(v_obj.objid, v_obj.schema_name, v_obj.object_name)"
PL/pgSQL function citus_drop_trigger() line 15 at PERFORM

@onderkalaci
Copy link
Member

this is fixed by via #5730, with an isolation test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants