Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong results of Left Anti Semi (Not-In) Join in nestloop and hashjoin #50

Closed
my-ship-it opened this issue Jul 24, 2023 · 3 comments · Fixed by #130
Closed

Wrong results of Left Anti Semi (Not-In) Join in nestloop and hashjoin #50

my-ship-it opened this issue Jul 24, 2023 · 3 comments · Fixed by #130
Assignees

Comments

@my-ship-it
Copy link
Contributor

--
-- Test left anti semi (not-in) join
-- Join on the distribution key of both sides.
--
begin;
\pset null '<NULL>'
create table t1_lasj(c1 int) distributed by (c1);
create table t2_lasj_has_null(c1n int) distributed by (c1n);
insert into t1_lasj values (generate_series (1,10));
insert into t2_lasj_has_null values (1), (2), (3), (null), (5), (6), (7);
analyze t1_lasj;
analyze t2_lasj_has_null;
select c1n from t2_lasj_has_null where c1n is null or c1n > 0;
  c1n   
--------
      2
      3
 <NULL>
      7
      1
      5
      6
(7 rows)

-- Hash left anti semi (not-in) join
explain(costs off) select c1 from t1_lasj where c1 not in (select c1n from t2_lasj_has_null where c1n is null or c1n > 0) and c1 is not null;
                        QUERY PLAN                        
----------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   ->  Hash Left Anti Semi (Not-In) Join
         Hash Cond: (t1_lasj.c1 = t2_lasj_has_null.c1n)
         ->  Seq Scan on t1_lasj
               Filter: (c1 IS NOT NULL)
         ->  Hash
               ->  Seq Scan on t2_lasj_has_null
                     Filter: ((c1n IS NULL) OR (c1n > 0))
 Optimizer: Postgres query optimizer
(9 rows)

select c1 from t1_lasj where c1 not in (select c1n from t2_lasj_has_null where c1n is null or c1n > 0) and c1 is not null;
 c1 
----
  9
 10
(2 rows)

set local enable_hashjoin = off;
set local enable_nestloop = on;
-- Nested loop left anti semi (not-in) join
explain(costs off) select c1 from t1_lasj where c1 not in (select c1n from t2_lasj_has_null where c1n is null or c1n > 0) and c1 is not null;
                        QUERY PLAN                        
----------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   ->  Nested Loop Left Anti Semi (Not-In) Join
         Join Filter: (t1_lasj.c1 = t2_lasj_has_null.c1n)
         ->  Seq Scan on t1_lasj
               Filter: (c1 IS NOT NULL)
         ->  Materialize
               ->  Seq Scan on t2_lasj_has_null
                     Filter: ((c1n IS NULL) OR (c1n > 0))
 Optimizer: Postgres query optimizer
(9 rows)

select c1 from t1_lasj where c1 not in (select c1n from t2_lasj_has_null where c1n is null or c1n > 0) and c1 is not null;
 c1 
----
  9
 10
(2 rows)

reset enable_hashjoin;
reset enable_nestloop;
\pset null ''
abort;

ORCA correct results:

--
-- Test left anti semi (not-in) join
-- Join on the distribution key of both sides.
--
begin;
\pset null '<NULL>'
create table t1_lasj(c1 int) distributed by (c1);
create table t2_lasj_has_null(c1n int) distributed by (c1n);
insert into t1_lasj values (generate_series (1,10));
insert into t2_lasj_has_null values (1), (2), (3), (null), (5), (6), (7);
analyze t1_lasj;
analyze t2_lasj_has_null;
select c1n from t2_lasj_has_null where c1n is null or c1n > 0;
  c1n   
--------
      2
      3
 <NULL>
      7
      5
      6
      1
(7 rows)

-- Hash left anti semi (not-in) join
explain(costs off) select c1 from t1_lasj where c1 not in (select c1n from t2_lasj_has_null where c1n is null or c1n > 0) and c1 is not null;
                           QUERY PLAN                           
----------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   ->  Hash Left Anti Semi (Not-In) Join
         Hash Cond: (t1_lasj.c1 = t2_lasj_has_null.c1n)
         ->  Seq Scan on t1_lasj
               Filter: (NOT (c1 IS NULL))
         ->  Hash
               ->  Broadcast Motion 3:3  (slice2; segments: 3)
                     ->  Seq Scan on t2_lasj_has_null
                           Filter: ((c1n IS NULL) OR (c1n > 0))
 Optimizer: Pivotal Optimizer (GPORCA)
(10 rows)

select c1 from t1_lasj where c1 not in (select c1n from t2_lasj_has_null where c1n is null or c1n > 0) and c1 is not null;
 c1 
----
(0 rows)

set local enable_hashjoin = off;
set local enable_nestloop = on;
-- Nested loop left anti semi (not-in) join
explain(costs off) select c1 from t1_lasj where c1 not in (select c1n from t2_lasj_has_null where c1n is null or c1n > 0) and c1 is not null;
                           QUERY PLAN                           
----------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   ->  Hash Left Anti Semi (Not-In) Join
         Hash Cond: (t1_lasj.c1 = t2_lasj_has_null.c1n)
         ->  Seq Scan on t1_lasj
               Filter: (NOT (c1 IS NULL))
         ->  Hash
               ->  Broadcast Motion 3:3  (slice2; segments: 3)
                     ->  Seq Scan on t2_lasj_has_null
                           Filter: ((c1n IS NULL) OR (c1n > 0))
 Optimizer: Pivotal Optimizer (GPORCA)
(10 rows)

select c1 from t1_lasj where c1 not in (select c1n from t2_lasj_has_null where c1n is null or c1n > 0) and c1 is not null;
 c1 
----
(0 rows)

reset enable_hashjoin;
reset enable_nestloop;
\pset null ''
abort;

See GPDB: https://github.com/greenplum-db/gpdb/issues/15662.

@lss602726449
Copy link
Contributor

lss602726449 commented Aug 9, 2023

the problem can reproduct easily.
I find that in postgres optimizer with hashjoin, this interesting thing will happen.(the first wrong without Broadcast)

postgres=# explain select c1 from t1_lasj where c1 not in (select c1n from t2_lasj_has_null where c1n is null or c1n > 0) and c1 is not null;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=1.05..2.16 rows=3 width=4)
   ->  Hash Left Anti Semi (Not-In) Join  (cost=1.05..2.11 rows=1 width=4)
         Hash Cond: (t1_lasj.c1 = t2_lasj_has_null.c1n)
         ->  Seq Scan on t1_lasj  (cost=0.00..1.03 rows=3 width=4)
               Filter: (c1 IS NOT NULL)
         ->  Hash  (cost=1.03..1.03 rows=2 width=4)
               ->  Seq Scan on t2_lasj_has_null  (cost=0.00..1.03 rows=2 width=4)
                     Filter: ((c1n IS NULL) OR (c1n > 0))
 Optimizer: Postgres query optimizer
(9 rows)
postgres=# select c1 from t1_lasj where c1 not in (select c1n from t2_lasj_has_null where c1n is null or c1n > 0) and c1 is not null;  --hashnode.hashkeys_null =false;
 c1 
----
  9
 10
(2 rows)
postgres=# explain select c1 from t1_lasj where c1 not in (select c1n from t2_lasj_has_null) and c1 is not null;
                                           QUERY PLAN

-------------------------------------------------------------------------------------------
------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=1.20..2.31 rows=3 width=4)
   ->  Hash Left Anti Semi (Not-In) Join  (cost=1.20..2.26 rows=1 width=4)
         Hash Cond: (t1_lasj.c1 = t2_lasj_has_null.c1n)
         ->  Seq Scan on t1_lasj  (cost=0.00..1.03 rows=3 width=4)
               Filter: (c1 IS NOT NULL)
         ->  Hash  (cost=1.12..1.12 rows=7 width=4)
               ->  Broadcast Motion 3:3  (slice2; segments: 3)  (cost=0.00..1.12 rows=7 wid
th=4)
                     ->  Seq Scan on t2_lasj_has_null  (cost=0.00..1.02 rows=2 width=4)
 Optimizer: Postgres query optimizer
(9 rows)
postgres=# select c1 from t1_lasj where c1 not in (select c1n from t2_lasj_has_null) and c1 is not null;--hashnode.hashkeys_null = true ;
 c1 
----
(0 rows)

@avamingli
Copy link
Contributor

This issue has been resolved, I have a pr for GPDB https://github.com/greenplum-db/gpdb/pull/15663 wait for comments.

@avamingli
Copy link
Contributor

Pick it back to CBDB in #130

yjhjstz pushed a commit to yjhjstz/cloudberry that referenced this issue Dec 27, 2024
Gpload regression case apache#50 would fail,
if OS user that installed gpdb cluster is not gpadmin.
Since it specified a hardcode user gpadmin,
we need to use PGUSER instead to fix it.

Co-authored-by: wuchengwen <wcw190496@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants