Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ALTER TABLE SET DISTRIBUTED BY for external tables #818

Merged
merged 11 commits into from
Dec 30, 2024

Conversation

avamingli
Copy link
Contributor

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


@my-ship-it my-ship-it added the cherry-pick cherry-pick upstream commts label Dec 26, 2024
my-ship-it
my-ship-it previously approved these changes Dec 26, 2024
gfphoenix78
gfphoenix78 previously approved these changes Dec 27, 2024
SmartKeyerror and others added 10 commits December 27, 2024 20:42
In Greenplum, different from upstream, will try to pull up the correlated EXISTS sublink which
has aggregate, just like:

```
postgres=# create table t(a int, b int);
postgres=# create table s (a int, b int);
postgres=# explain (costs off)
select * from t where exists (select sum(s.a) from s where s.a = t.a group by s.a);
                QUERY PLAN
------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   ->  Hash Join
         Hash Cond: (t.a = s.a)
         ->  Seq Scan on t
         ->  Hash
               ->  HashAggregate
                     Group Key: s.a
                     ->  Seq Scan on s
 Optimizer: Postgres query optimizer
(9 rows)
```

So Greenplum changed the behavior of function `convert_EXISTS_sublink_to_join()` and
`simplify_EXISTS_query()`, so `simplify_EXISTS_query()` will simplify the EXISTS sublink
which has aggregate node, it will reset `targetList` and other clauses to null, just like:

```
query->targetList = NIL;
query->distinctClause = NIL;
query->sortClause = NIL;
query->hasDistinctOn = false;
```

But when we exec uncorrelated EXISTS sublink which also has an aggregate node, we can't
reset `targetList`. Otherwise, we can't make a plan for an aggregate node that doesn't have
an aggregate column, just like #11849.

This patch tries to fix the above issue, and the thought is **NOT** simplify uncorrelated EXISTS
sublink in some situation.
Creating AMs with internal handlers will not have entries in `pg_depend`,
which is reasonable since internal handlers could not be dropped, but
`gpcheckcat` is not happy with it.

This commit excludes that and fixes the `getName()` deprecated issue.
This could be a good case for gpcheckcat testing access methods with
internal handlers.
… remove macro IC_PROXY_LOG_LEVEL

Currently ic_proxy_log(LOG would never log anything as IC_PROXY_LOG_LEVEL
is hard-coded to WARNING. This should be under a GUC's control. This commit
unified ic-proxy log level under GUC gp_log_interconnect's control, and
remove macro IC_PROXY_LOG_LEVEL.
This test case has been flaky for a while, and while it's being worked
on and discussed, it currently makes developers either ignore red tests
or blocks other work from going in. For now, add a FIXME and disable the
test case for Orca.

Discussion: https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/u3-D7isdvmM
If a DynamicBitmapScan was under a NLJ, memory would increase very
quickly and the system would soon terminate due to OOM. Additionally,
performance was slower than an equivalent plan using regular Bitmap
Scans. This only affected Orca as only Orca uses Dynamic Scans.

This change had been done for regular index scans, but was not done for
bitmap scans.
This is a follow-up to 24140214f21, which was reverted for being too
verbose by default, causing CI failures. This reintroduces the
gp_log_interconnect GUC to ic-proxy logging and follows the
specifications outlined in cdbvars.c. The ic_proxy_log macro is replaced
with elog and elogif for WARNINGs/ERRORs and other events respectively.

The scheme used:
TERSE[LOG]:     Server main loop events, signals
VERBOSE[LOG]:   Startup, (un)register, close/shutdown, EOF receipt etc.
DEBUG[DEBUG1]:  Special packet (BYE/HELLO) transfer, pause/resume events
DEBUG[DEBUG3]:  Less interesting packet transfers, caching
DEBUG[DEBUG5]:  Very low-level tracing info
This commit amends commit d63731d. Fallback needs to happen inside the
rollback'd transaction as that is where the stale cache entry is not
cleaned up properly.
The external tables have distribution policy but it does not dictate
actual data distribution. It is only used when unloading data, to
compare with source table's distribution policy.

Therefore, when supporting ALTER TABLE SET DISTRIBUTED BY for external
tables, we don't really need to re-organize the table but just need
to make sure the catalog change happens.
@avamingli avamingli merged commit cf204de into apache:main Dec 30, 2024
22 checks passed
@avamingli avamingli deleted the cp_1226 branch December 30, 2024 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick cherry-pick upstream commts
Projects
None yet
Development

Successfully merging this pull request may close these issues.