Evaluate restricting router planner to a particular database model #692

marcocitus · 2016-08-01T18:11:46Z

Currently the router planner can plan practically anything that can be executed by changing table names into shard names provided that the shards are in the same physical location. This is very powerful, but has significant usability implications.

First-time users often start with 2 worker nodes and a replication factor of 2. In this case, the router planner can run practically any SQL query that hits a single shard in each table. When users start expanding their cluster and moving shards or changing filter values, this may no longer be the case, potentially causing and outage or forcing the user to redesign their application.

Evaluate restricting the router planner to a particular data model by default such that it only permits queries on shards that are guaranteed to be plannable given the database model (e.g. ensure that all tables in the query are co-located and have the same distribution column filter). This would prevent queries from breaking when the physical layout of shards or the specific values of filters change. We can keep the current, unrestricted behaviour as an option, as we do with subquery pushdown.

An alternative approach could be to add some logic to the workers to always allow router queries. For example, if each shard that is not physically located on a worker had a postgres_fdw instead, then we can effectively remove the restriction that shards need to be on the same worker. This would have more general benefits, such as supporting broadcast joins, but comes with its own usability and performance implications.

Some examples of confusing behaviour:

The following query requires task-tracker when changing one value because we're joining tables that are not always co-located:

postgres=# SELECT * FROM review JOIN customer USING (customer_id) WHERE product_id = 1 AND customer_id = 1;
 customer_id | product_id | score | name 
-------------+------------+-------+------
(0 rows)

postgres=# SELECT * FROM review JOIN customer USING (customer_id) WHERE product_id = 1 AND customer_id = 2;
ERROR:  cannot use real time executor with repartition jobs
HINT:  Set citus.task_executor_type to "task-tracker".

Union is simultaneously supported and "unsupported" (we should revise our error messages):

postgres=# SELECT * FROM customer WHERE customer_id = 1 UNION SELECT * FROM customer WHERE customer_id = 5;
 customer_id | name 
-------------+------
(0 rows)

postgres=# SELECT * FROM customer WHERE customer_id = 1 UNION SELECT * FROM customer WHERE customer_id = 4;
ERROR:  cannot perform distributed planning on this query
DETAIL:  Union, Intersect, or Except are currently unsupported

Running shard rebalancer might cause the same query to suddenly become "unsupported":

postgres=# SELECT * FROM customer WHERE customer_id = 1 UNION SELECT * FROM customer WHERE customer_id = 7;
 customer_id | name
-------------+------
(0 rows)
postgres=# SELECT rebalance_table_shards('customer');
NOTICE:  Moving shard 102032 from localhost:9701 to localhost:9700 ...
 rebalance_table_shards 
------------------------

(1 row)

postgres=# SELECT * FROM customer WHERE customer_id = 1 UNION SELECT * FROM customer WHERE customer_id = 7;
ERROR:  cannot perform distributed planning on this query
DETAIL:  Union, Intersect, or Except are currently unsupported

The text was updated successfully, but these errors were encountered:

anarazel · 2016-08-01T18:16:41Z

I don't think that's really a problem we can solve via code for now. With CTEs, subquery joins, and such, as needed by the use-cases targeted, we can't insist on a join on a partition column without preventing the use-case in the first place. I think all we can do for now is to improve our docs and error messages.

marcocitus · 2016-08-01T18:23:46Z

An example would be if all tables have a tenant_id column by which the tables are partitioned. We could then require all CTEs, subqueries, etc. to have the same restriction on tenant_id.

anarazel · 2016-08-01T18:27:43Z

On 2016-08-01 11:23:48 -0700, Marco Slot wrote:

An example would be if all tables have a tenant_id column by which the tables are partitioned. We could then require all CTEs, subqueries, etc. to have the same restriction on tenant_id.

That'll prevent "manually broadcasted"/reference type/1-shard tables
from being involved in a router query.

marcocitus · 2016-08-01T18:32:52Z

That'll prevent "manually broadcasted"/reference type/1-shard tables from being involved in a router query.

There's a question of whether the default database model should allow such tables (e.g., Citus Cloud doesn't). In any case, it seems this decision can be made per table and we don't need to require a filter on single shard tables. We may need to make replicated, single shard tables and co-located tables explicit in the metadata to properly implement these restrictions.

metdos · 2016-08-05T12:15:23Z

It is hard to understand what is supported and what is not supported.

Such as, these queries work;

SELECT l_orderkey FROM lineitem where l_orderkey = 2 or l_orderkey = 32

SELECT l_orderkey FROM lineitem WHERE l_orderkey in 
(SELECT l_orderkey FROM lineitem where l_orderkey = 2);

But this one fails;

SELECT l_orderkey FROM lineitem WHERE l_orderkey in
 (SELECT l_orderkey FROM lineitem where l_orderkey = 2 or l_orderkey = 32);

Also, the error message is not very meaningful:

ERROR:  cannot perform distributed planning on this query
DETAIL:  Join types other than inner/outer joins are currently unsupported

mtuncer · 2016-08-16T17:27:13Z

having a join on partition column requirement looks very restrictive and difficult to check when you consider all sorts of queries we want to support.

mtuncer · 2016-08-16T18:37:22Z

We discussed this with @ozgune , @anarazel and @marcocitus here are the notes

@mtuncer stated there are 2 outstanding issues with usability.
1 - user runs a query successfully, then changes something in the system and query is not supported anymore. One example is the shard rebalancer changing shard locations.

2 - user runs a query, makes a slight change in the filter, query is not supported anymore and the error message does not provide any help to suggest what has happened. @metdos `s last comment describes this.

@anarazel said he is for better error/warning messages instead of changing(reducing) query coverage.

@marcocitus suggested an alternative to run a heuristics to determine if this query could still be supported despite an environmental change.

We can use co-location property of pruned out shards to determine if we can reliably support this query. If not we can display a warning message something meaning you got lucky this time, this query might not be supported next time.

Next Steps:

outline heuristics algoritms
list heuristics coverage, state what is/not covered.
provide high level design and time estimates.

@samay-sharma , @begriffs , @metdos please feel free to join the discussion here. We will hold another session when all parties are present.

mtuncer · 2016-09-06T11:57:29Z

@marcocitus you were going to outline the heuristics algorithm if I recall correctly. Any updates on that ?

Soon I will be doing some changes related to #692 in router planner and those changes require updating ~5/6 tests related to router planning. And to make those test files runnable by run_test.py multiple times, we need to make some other tests (that they're run in parallel / they badly depend on) ready for run_test.py too.

…ed tables via router planner (#6793) Today we allow planning the queries that reference non-colocated tables if the shards that query targets are placed on the same node. However, this may not be the case, e.g., after rebalancing shards because it's not guaranteed to have those shards on the same node anymore. This commit adds citus.enable_non_colocated_router_query_pushdown GUC that can be used to disallow planning such queries via router planner, when it's set to false. Note that the default value for this GUC will be "true" for 11.3, but we will alter it to "false" on 12.0 to not introduce a breaking change in a minor release. Closes #692. Even more, allowing such queries to go through router planner also causes generating an incorrect plan for the DML queries that reference distributed tables that are sharded based on different replication factor settings. For this reason, #6779 can be closed after altering the default value for this GUC to "false", hence not now. DESCRIPTION: Adds `citus.enable_non_colocated_router_query_pushdown` GUC to ensure generating a consistent distributed plan for the queries that reference non-colocated distributed tables (when set to "false", the default is "true").

ozgune mentioned this issue Aug 1, 2016

Display informative and consistent error messages after router executor changes #693

Closed

ozgune mentioned this issue Oct 7, 2016

It is hard to debug distributed modifications must target exactly one shard messages #750

Closed

marcocitus mentioned this issue Dec 19, 2016

Evaluate functions in INSERT..SELECT #1010

Merged

marcocitus mentioned this issue Jul 19, 2020

false promise with joins on non-colocated distributed tables #4038

Open

This was referenced Mar 20, 2023

Router planning for modify queries fails to detect shard replication factor mismatch #6779

Closed

Make 8 more tests runnable multiple times via run_test.py #6791

Merged

onurctirtir mentioned this issue Mar 27, 2023

Add a GUC to disallow planning the queries that reference non-colocated tables via router planner #6793

Merged

onurctirtir closed this as completed in #6793 Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate restricting router planner to a particular database model #692

Evaluate restricting router planner to a particular database model #692

marcocitus commented Aug 1, 2016 •

edited

Loading

anarazel commented Aug 1, 2016 via email

marcocitus commented Aug 1, 2016

anarazel commented Aug 1, 2016

marcocitus commented Aug 1, 2016 •

edited

Loading

metdos commented Aug 5, 2016

mtuncer commented Aug 16, 2016

mtuncer commented Aug 16, 2016

mtuncer commented Sep 6, 2016

Evaluate restricting router planner to a particular database model #692

Evaluate restricting router planner to a particular database model #692

Comments

marcocitus commented Aug 1, 2016 • edited Loading

anarazel commented Aug 1, 2016 via email

marcocitus commented Aug 1, 2016

anarazel commented Aug 1, 2016

marcocitus commented Aug 1, 2016 • edited Loading

metdos commented Aug 5, 2016

mtuncer commented Aug 16, 2016

mtuncer commented Aug 16, 2016

mtuncer commented Sep 6, 2016

marcocitus commented Aug 1, 2016 •

edited

Loading

marcocitus commented Aug 1, 2016 •

edited

Loading