Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate restricting router planner to a particular database model #692

Closed
marcocitus opened this issue Aug 1, 2016 · 8 comments · Fixed by #6793
Closed

Evaluate restricting router planner to a particular database model #692

marcocitus opened this issue Aug 1, 2016 · 8 comments · Fixed by #6793

Comments

@marcocitus
Copy link
Member

marcocitus commented Aug 1, 2016

Currently the router planner can plan practically anything that can be executed by changing table names into shard names provided that the shards are in the same physical location. This is very powerful, but has significant usability implications.

First-time users often start with 2 worker nodes and a replication factor of 2. In this case, the router planner can run practically any SQL query that hits a single shard in each table. When users start expanding their cluster and moving shards or changing filter values, this may no longer be the case, potentially causing and outage or forcing the user to redesign their application.

Evaluate restricting the router planner to a particular data model by default such that it only permits queries on shards that are guaranteed to be plannable given the database model (e.g. ensure that all tables in the query are co-located and have the same distribution column filter). This would prevent queries from breaking when the physical layout of shards or the specific values of filters change. We can keep the current, unrestricted behaviour as an option, as we do with subquery pushdown.

An alternative approach could be to add some logic to the workers to always allow router queries. For example, if each shard that is not physically located on a worker had a postgres_fdw instead, then we can effectively remove the restriction that shards need to be on the same worker. This would have more general benefits, such as supporting broadcast joins, but comes with its own usability and performance implications.

Some examples of confusing behaviour:

The following query requires task-tracker when changing one value because we're joining tables that are not always co-located:

postgres=# SELECT * FROM review JOIN customer USING (customer_id) WHERE product_id = 1 AND customer_id = 1;
 customer_id | product_id | score | name 
-------------+------------+-------+------
(0 rows)

postgres=# SELECT * FROM review JOIN customer USING (customer_id) WHERE product_id = 1 AND customer_id = 2;
ERROR:  cannot use real time executor with repartition jobs
HINT:  Set citus.task_executor_type to "task-tracker".

Union is simultaneously supported and "unsupported" (we should revise our error messages):

postgres=# SELECT * FROM customer WHERE customer_id = 1 UNION SELECT * FROM customer WHERE customer_id = 5;
 customer_id | name 
-------------+------
(0 rows)

postgres=# SELECT * FROM customer WHERE customer_id = 1 UNION SELECT * FROM customer WHERE customer_id = 4;
ERROR:  cannot perform distributed planning on this query
DETAIL:  Union, Intersect, or Except are currently unsupported

Running shard rebalancer might cause the same query to suddenly become "unsupported":

postgres=# SELECT * FROM customer WHERE customer_id = 1 UNION SELECT * FROM customer WHERE customer_id = 7;
 customer_id | name
-------------+------
(0 rows)
postgres=# SELECT rebalance_table_shards('customer');
NOTICE:  Moving shard 102032 from localhost:9701 to localhost:9700 ...
 rebalance_table_shards 
------------------------

(1 row)

postgres=# SELECT * FROM customer WHERE customer_id = 1 UNION SELECT * FROM customer WHERE customer_id = 7;
ERROR:  cannot perform distributed planning on this query
DETAIL:  Union, Intersect, or Except are currently unsupported
@anarazel
Copy link
Contributor

anarazel commented Aug 1, 2016 via email

@marcocitus
Copy link
Member Author

An example would be if all tables have a tenant_id column by which the tables are partitioned. We could then require all CTEs, subqueries, etc. to have the same restriction on tenant_id.

@anarazel
Copy link
Contributor

anarazel commented Aug 1, 2016

On 2016-08-01 11:23:48 -0700, Marco Slot wrote:

An example would be if all tables have a tenant_id column by which the tables are partitioned. We could then require all CTEs, subqueries, etc. to have the same restriction on tenant_id.

That'll prevent "manually broadcasted"/reference type/1-shard tables
from being involved in a router query.

@marcocitus
Copy link
Member Author

marcocitus commented Aug 1, 2016

That'll prevent "manually broadcasted"/reference type/1-shard tables from being involved in a router query.

There's a question of whether the default database model should allow such tables (e.g., Citus Cloud doesn't). In any case, it seems this decision can be made per table and we don't need to require a filter on single shard tables. We may need to make replicated, single shard tables and co-located tables explicit in the metadata to properly implement these restrictions.

@metdos
Copy link
Contributor

metdos commented Aug 5, 2016

It is hard to understand what is supported and what is not supported.

Such as, these queries work;

SELECT l_orderkey FROM lineitem where l_orderkey = 2 or l_orderkey = 32

SELECT l_orderkey FROM lineitem WHERE l_orderkey in 
(SELECT l_orderkey FROM lineitem where l_orderkey = 2);

But this one fails;

SELECT l_orderkey FROM lineitem WHERE l_orderkey in
 (SELECT l_orderkey FROM lineitem where l_orderkey = 2 or l_orderkey = 32);

Also, the error message is not very meaningful:

ERROR:  cannot perform distributed planning on this query
DETAIL:  Join types other than inner/outer joins are currently unsupported

@mtuncer
Copy link
Member

mtuncer commented Aug 16, 2016

having a join on partition column requirement looks very restrictive and difficult to check when you consider all sorts of queries we want to support.

@mtuncer
Copy link
Member

mtuncer commented Aug 16, 2016

We discussed this with @ozgune , @anarazel and @marcocitus here are the notes

@mtuncer stated there are 2 outstanding issues with usability.
1 - user runs a query successfully, then changes something in the system and query is not supported anymore. One example is the shard rebalancer changing shard locations.

2 - user runs a query, makes a slight change in the filter, query is not supported anymore and the error message does not provide any help to suggest what has happened. @metdos `s last comment describes this.

@anarazel said he is for better error/warning messages instead of changing(reducing) query coverage.

@marcocitus suggested an alternative to run a heuristics to determine if this query could still be supported despite an environmental change.

We can use co-location property of pruned out shards to determine if we can reliably support this query. If not we can display a warning message something meaning you got lucky this time, this query might not be supported next time.

Next Steps:

  • outline heuristics algoritms
  • list heuristics coverage, state what is/not covered.
  • provide high level design and time estimates.

@samay-sharma , @begriffs , @metdos please feel free to join the discussion here. We will hold another session when all parties are present.

@mtuncer
Copy link
Member

mtuncer commented Sep 6, 2016

@marcocitus you were going to outline the heuristics algorithm if I recall correctly. Any updates on that ?

onurctirtir added a commit that referenced this issue Mar 27, 2023
Soon I will be doing some changes related to #692 in router planner
and those changes require updating ~5/6 tests related to router
planning. And to make those test files runnable by run_test.py
multiple times, we need to make some other tests (that they're
run in parallel / they badly depend on) ready for run_test.py too.
onurctirtir added a commit that referenced this issue Mar 28, 2023
…ed tables via router planner (#6793)

Today we allow planning the queries that reference non-colocated tables
if the shards that query targets are placed on the same node. However,
this may not be the case, e.g., after rebalancing shards because it's
not guaranteed to have those shards on the same node anymore.
This commit adds citus.enable_non_colocated_router_query_pushdown GUC
that can be used to disallow  planning such queries via router planner,
when it's set to false. Note that the default value for this GUC will be
"true" for 11.3, but we will alter it to "false" on 12.0 to not
introduce
a breaking change in a minor release.

Closes #692.

Even more, allowing such queries to go through router planner also
causes
generating an incorrect plan for the DML queries that reference
distributed
tables that are sharded based on different replication factor settings.
For
this reason, #6779 can be closed after altering the default value for
this
GUC to "false", hence not now.

DESCRIPTION: Adds `citus.enable_non_colocated_router_query_pushdown` GUC
to ensure generating a consistent distributed plan for the queries that
reference non-colocated distributed tables (when set to "false", the
default is "true").
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants