Use interface for storage interactions in Appendonly TAM #2

reshke · 2024-09-30T06:21:41Z

fix #ISSUE_Number

Change logs

Describe your change clearly, including what problem is being solved or what feature is being added.

If it has some breaking backward or forward compatibility, please clary.

Why are the changes needed?

Describe why the changes are necessary.

Does this PR introduce any user-facing change?

If yes, please clarify the previous behavior and the change this PR proposes.

How was this patch tested?

Please detail how the changes were tested, including manual tests and any relevant unit or integration tests.

Contributor's Checklist

Here are some reminders and checklists before/when submitting your pull request, please check them:

Make sure your Pull Request has a clear title and commit message. You can take git-commit template as a reference.
Sign the Contributor License Agreement as prompted for your first-time contribution(One-time setup).
Learn the coding contribution guide, including our code conventions, workflow and more.
List your communication in the GitHub Issues or Discussions (if has or needed).
Document changes.
Add tests for the change
Pass make installcheck
Pass make -C src/test installcheck-cbdb-parallel
Feel free to request cloudberrydb/dev team for review and approval when your PR is ready🥳

…ly_print_compaction Recently have been debugging some unrelated issue, and i happened to notice that `Debug_appendonly_print_compaction` printing gone in CBDB (it was very helpfull in GPDB6) So, I propose to bring it back.

This takes no actaul effect, but allow extension that uses `geqo_enable` variable to compile.

…d_values is null. Now, if tag's allowed_values is null, when we add or alter tag value to object will error out. In this commit, we fix this unreasonable behavior. We can add or alter any tag value to object when the tag's allowed_values is null. Authored-by: Zhang Wenchao zwcpostgres@gmail.com

This commit introduces a new feature to ensure WAL (Write-Ahead Log) synchronization between the primary and standby before performing the replica check. Key changes include: 1.Add a new method wait_for_wal_sync() to poll and wait for WAL sync 2.Implement WAL sync check using pg_current_wal_lsn() and pg_stat_replication Call wait_for_wal_sync() before running the replica check This enhancement improves the reliability of the replica check by ensuring that all changes have been replicated before comparison, reducing false positives due to replication lag.

As GPDB vacuumss auxiliary tables together with base relation in one distributed transaction, rather than PostgreSQL where one transaction is opened and closed for each auxiliary relation, vacuuming auxiliary TOAST should not be dispatched.

This enables reporting for the table scan phase of index builds on AOCO tables. This directly affects: (1) pg_stat_progress_create_index.blocks_total (2) pg_stat_progress_create_index.blocks_done Sample: Connect in utility mode to a segment, while CREATE INDEX is running on an AOCO table: postgres=# select pid, command, phase, blocks_total, blocks_done from pg_stat_progress_create_index; pid | command | phase | blocks_total | blocks_done --------+--------------+--------------------------------+--------------+------------- 644771 | CREATE INDEX | building index: scanning table | 9779 | 6832 This commit introduces: (1) Accounting in AOCSScanDescData to track the total number of bytes read across all segment files. It is updated whenever a new block is read. If the block is compressed, its compressed length is added (uncompressed length otherwise). (2) open_all_datumstreamread_segfiles() is touched since it also reads blocks. It's interface is made simpler by simply passing the scan descriptor, which is also used to get at the total bytes counter. (3) Reporting code in aoco_index_build_range_scan() to report in increments of BLCKSZ, the number of blocks scanned so far. The total number of blocks to be scanned is also published up front and is determined by summing the compressed eof values of all segfiles. The total number of blocks depends on whether a block directory is being built as part of the scan. If it is being built, then the whole table needs to be scanned, whereas only index columns need to be scanned otherwise. A new function has been introduced to summarize the filesegtotals which accepts column projection info: GetAOCSSSegFilesTotalsWithProj(). Note: We currently only support reporting for non-concurrent and serial builds.

* IndexVacuumInfo.analyze_only was not initialized, leading to possible no-ops in index_vacuum_cleanup(), if the garbage value for it was non-zero. * IndexVacuumInfo.num_heap_tuples should be set to the AO table's old reltuples value (value prior to kicking off vacuum). Please see IndexVacuumInfo and lazy_vacuum_index() for details. As explained in the code comments, num_heap_tuples will always be an estimate during ambulkdelete(). Since dedd4075c9f, we have been using the index rel's reltuples instead. This is wrong (eg in scenarios where the index has been created post data manipulation). So, correctly use the table's reltuples value to provide the estimate.

Given the scenario of updating stats during VACUUM: 1) coordinator vacuums and updates stats of its own; 2) then coordinator dispatches vacuum to segments; 3) coordinator combines stats received from segments to overwrite the stats of its own. Because upstream introduced a feature which could skip full index scan uring cleanup of B-tree indexes when possible (refer to: postgres/postgres@857f9c3), there was a case in QD-QEs distributed deployment that some QEs could skip full index scan and stop updating statistics, resulting in QD being unable to collect all QEs' stats thus overwrote a paritial accumulated value to index->reltuples. More interesting, it usually happened starting from the 3rd time of consecutively VACUUM after fresh inserts due to above skipping index scan criteria. This patch is intended to prevent from above issue by two aspects: on QE, do not skip full index scan, report current statistics to QD as requested; on QD, restrict updating statistics only when collecting all QEs' data completely, if the reporting QE number doesn't match the total number of dispatched QEs, no stats update happens.

Since we have already checked the visibility during scan tuples in aocs compaction, the visibility check code in aocs compaction is unnecessary and can be removed. Co-authored-by: linxu.hlx <linxu.hlx@alibaba-inc.com>

When this option is passed, instead of vacuuming main tables look up auxiliary tables for all indicated AO tables and vacuum those. If this option is given without a tablename, vacuum ONLY all AO auxiliary tables. Also add regress test coverage of this behavior.

We used to have the pg_appendonly row for the AO/CO table stored in its relcache entry, but that somehow got lost during the PG12 merge (it is still in 6X). Now bringing it back, and utilize it wherever possible. This commit basically a partial cherrypick of ef92bcf151f99c9152a10c8750eee0d4dde39bc1. But due to a lot of changes in the code base and also we don't need other changes in that commit, we are not cherrypicking it.

…nsert() to reduce unnecessary catelog scans as the required segrelid was already obtained from appendonly_insert_init().

Enable system view "pg_stat_progress_vacuum" to report VACUUM progress for ao_row and ao_column tables. Here are details of how each field in this view is interpreted/updated for append-optimized tables: ----- phase ----- Since the mechanism of VACUMM append-optimized tables is significantly different from VACUUM heap tables, this commit adds three additional vacuum progress phases dedicated to append-optimized talbes. Possible VACUUM phases for append-optimized tables include the following: - PROGRESS_VACUUM_PHASE_AO_PRE_CLEANUP (append-optimized pre-cleanup) - PROGRESS_VACUUM_PHASE_AO_COMPACT (append-optimized compact) - PROGRESS_VACUUM_PHASE_AO_POST_CLEANUP (append-optimized post-cleanup) - PROGRESS_VACUUM_PHASE_VACUUM_INDEX (vacuuming indexes) Phases for VACUUM append-optimized table progresses in the following order: - append-optimized pre-cleanup - append-optimized compact - append-optimized post-cleanup "vacuuming index" is a "sub-phases" that could happen as part of pre-cleanup phase and/or post-cleanup phase, it happens if: - the relation has index, AND - there are awaiting-drop segment files that are invisible to all. While a segment can only be marked as awaiting-drop during VACUUM compact phase, it can be recycled during either post-cleanup phase or pre-clean phase of a later vacuum run. ----------- heap_blks_* ----------- We divide byte size by heap block size to get heap_blks_* numbers for AO tables. Since append-optimized tables have variable length block sizes, measuring heap-equivalent blocks instead of using number of varblocks better helps users to compare and calibrate. --------------- heap_blks_total --------------- - Collected at the begining of pre-cleanup phase by adding up the on-disk file sizes of all segment files for the append-optimized relation and convert that size into number of heap-equivalent blocks. - The value should not change while VACUUM progresses. ----------------- heap_blks_scanned ----------------- - Collected during compact phase. - ao_row: updated every time we finish scanning a segment file. - ao_column: updated every time the number advances. - SPECIAL NOTE for AO: heap_blks_scanned can be smaller or equal to heap_blks_total at the end of VACUUM, because for AO tables we need not to scan blocks after the logical EOF of a segment file. ------------------ heap_blks_vacuumed ------------------ - Collected whenever we truncate a segment file, which could happen during either/both/neither pre-cleanup phase and post-cleanup phase. - SPECIAL NOTE for AO: heap_blks_vacuumed could be either smaller or larger than heap_blks_scanned. For AO tables the bottom line is at the end of VACUUM the sum of heap_blks_vacuumed and heap_blks_scanned must be smaller or equal to heap_blks_total. ------------------ index_vacuum_count ------------------ - Collected whenever we recycle a dead segment file, which could happen during either/both/neither pre-cleanup phase and post-cleanup phase. --------------- max_dead_tuples --------------- - Collected at the beginning of pre_cleanup phase. This is the total number of tuples before the logical EOF of all segment files. - The value should not change while VACUUM progresses. --------------- num_dead_tuples --------------- - Collected during compact phase - ao_row: Update for every time we throw a dead tuple - ao_column: updated every time we move a tuple and if number of dead tuples advances Reviewed-by: Huansong Fu <fuhuansong@gmail.com> Remove tests that are not here

This commit fixes the following compiler warnings: vacuum_ao.c:738:1: warning: ‘init_vacrelstats’ was used with no prototype before its definition [-Wmissing-prototypes] 738 | init_vacrelstats() | ^~~~~~~~~~~~~~~~ CPartPruneStepsBuilder.cpp: In member function ‘PartitionedRelPruneInfo* gpdxl::CPartPruneStepsBuilder::CreatePartPruneInfoForOneLevel(gpdxl::CDXLNode*)’: CPartPruneStepsBuilder.cpp:83:29: warning: comparison of integer expressions of different signedness: ‘gpos::ULONG’ {aka ‘unsigned int’} and ‘int’ [-Wsign-compare] 83 | for (ULONG i = 0; i < pinfo->nparts; ++i) | ~~^~~~~~~~~~~~~~~

The following fields of the pg_stat_all_tables view are now available for append-optimized tables: n_live_tup n_dead_tup last_vacuum vacuum_count Note that n_dead_tup only accounts for dead tuples in non-awaiting-drop segment files. Reviewed-by: Hansong Fu <fuhuansong@gmail.com>

Calls to FaultInjector_InjectFaultIfSet that provide tableName as an argument currently break early if the tableName does not match the hash entry (set by gp_inject_fault) for the fault. This is problematic because the fault must be set with a specific table name to be triggered at all. This commit modifies the behavior to only check for matching tableName if the hash entry has a tableName set. This allows for more targeted testing. Example: Setting fault without a tableName will trigger for any table: ``` postgres=# SELECT gp_inject_fault('AppendOnlyStorageRead_ReadNextBlock_success', 'skip', '', '', '', 1, 100, 0, dbid) FROM gp_segment_configuration WHERE content = 1 AND role = 'p'; gp_inject_fault ----------------- Success: (1 row) postgres=# copy alter_attach_t1 to '/dev/null'; COPY 100000 postgres=# copy alter_attach_t2 to '/dev/null'; COPY 100000 postgres=# SELECT gp_inject_fault('AppendOnlyStorageRead_ReadNextBlock_success', 'status', dbid) FROM gp_segment_configuration WHERE content = 1 AND role = 'p'; gp_inject_fault ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Success: fault name:'AppendOnlyStorageRead_ReadNextBlock_success' fault type:'skip' ddl statement:'' database name:'' table name:'' start occurrence:'1' end occurrence:'100' extra arg:'0' fault injection state:'completed' num times hit:'100' + ``` Setting fault with a given tableName will trigger only for that table. ``` postgres=# SELECT gp_inject_fault('AppendOnlyStorageRead_ReadNextBlock_success', 'skip', '', '', 'alter_attach_t1', 1, 100, 0, dbid) FROM gp_segment_configuration WHERE content = 1 AND role = 'p'; gp_inject_fault ----------------- Success: (1 row) postgres=# copy alter_attach_t1 to '/dev/null'; COPY 100000 postgres=# copy alter_attach_t2 to '/dev/null'; COPY 100000 postgres=# SELECT gp_inject_fault('AppendOnlyStorageRead_ReadNextBlock_success', 'status', dbid) FROM gp_segment_configuration WHERE content = 1 AND role = 'p'; gp_inject_fault ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Success: fault name:'AppendOnlyStorageRead_ReadNextBlock_success' fault type:'skip' ddl statement:'' database name:'' table name:'alter_attach_t1' start occurrence:'1' end occurrence:'100' extra arg:'0' fault injection state:'triggered' num times hit:'51' + ``` Authored-by: Brent Doil <bdoil@vmware.com>

Change vacuum row progress expected output Fix vacuum_progress_column exp output Also some little refactoring done around src/backend/catalog/pg_attribute_encoding.c

We present an appendonly SMGR interface and use it for storage operations. Use of this interface allow to implement out-of-core extensions that store data not localy but in some external storage.

We've heard a couple of reports of people having trouble with multi-gigabyte-sized query-texts files. It occurred to me that on 32-bit platforms, there could be an issue with integer overflow of calculations associated with the total query text size. Address that with several changes: 1. Limit pg_stat_statements.max to INT_MAX / 2 not INT_MAX. The hashtable code will bound it to that anyway unless "long" is 64 bits. We still need overflow guards on its use, but this helps. 2. Add a check to prevent extending the query-texts file to more than MaxAllocHugeSize. If it got that big, qtext_load_file would certainly fail, so there's not much point in allowing it. Without this, we'd need to consider whether extent, query_offset, and related variables shouldn't be off_t not size_t. 3. Adjust the comparisons in need_gc_qtexts() to be done in 64-bit arithmetic on all platforms. It appears possible that under duress those multiplications could overflow 32 bits, yielding a false conclusion that we need to garbage-collect the texts file, which could lead to repeatedly garbage-collecting after every hash table insertion. Per report from Bruno da Silva. I'm not convinced that these issues fully explain his problem; there may be some other bug that's contributing to the query-texts file becoming so large in the first place. But it did get that big, so #2 is a reasonable defense, and #3 could explain the reported performance difficulties. (See also commit 8bbe4cb, which addressed some related bugs. The second Discussion: link is the thread that led up to that.) This issue is old, and is primarily a problem for old platforms, so back-patch. Discussion: https://postgr.es/m/CAB+Nuk93fL1Q9eLOCotvLP07g7RAv4vbdrkm0cVQohDVMpAb9A@mail.gmail.com Discussion: https://postgr.es/m/5601D354.5000703@BlueTreble.com

We used to rename index-backed constraints in the EXCHANGE PARTITION command in 6X. Now we don't. We've decided to keep that behavior in 7X after looking into the opposing arguments: Argument #1. It might cause problem during upgrade. - Firstly, we won't be using legacy syntax in the dump scripts so we just need to worry about the existing tables produced by EXCHANGE PARTITION. I.e. whether or not they can be restored correctly. - For upgrading from 6X->7X, since those tables already have matched constraint and index names with the table names, we should be OK. - For upgrading 7X->onward, since we implement EXCHANGE PARTIITON simply as a combination of upstream-syntax commands (see AtExecGPExchangePartition()), pg_upgrade should be able to handle them. We've verified that manually and the automated test should cover that too. In fact, this gives another point that we shouldn't do our own hacky things as part of EXCHANGE PARTITION which might confuse pg_upgrade. Argument #2. It might surprise the users and their existing workloads. - The indexed constraint names are all implicitly generated and shouldn't be directly used in most cases. - They are not the only thing that will appear changed. E.g. the normal indexes (e.g. B-tree ones) will be too. So the decision to change one should be made with changing all of them. More details see: https://docs.google.com/document/d/1enJdKYxkk5WFRV1WoqIgLgRxCGxOqI2nglJVE_Wglec

## Problem An error occurs in python lib when a plpython function is executed. After our analysis, in the user's cluster, a plpython UDF was running with the unstable network, and got a timeout error: `failed to acquire resources on one or more segments`. Then a plpython UDF was run in the same session, and the UDF failed with GC error. Here is the core dump: ``` 2023-11-24 10:15:18.945507 CST,,,p2705198,th2081832064,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log: #0 0x7f7c68b6d55b in frame_dealloc /home/cc/repo/cpython/Objects/frameobject.c:509:5 #1 0x7f7c68b5109d in gen_send_ex /home/cc/repo/cpython/Objects/genobject.c:108:9 #2 0x7f7c68af9ddd in PyIter_Next /home/cc/repo/cpython/Objects/abstract.c:3118:14 #3 0x7f7c78caa5c0 in PLy_exec_function /home/cc/repo/gpdb6/src/pl/plpython/plpy_exec.c:134:11 apache#4 0x7f7c78cb5ffb in plpython_call_handler /home/cc/repo/gpdb6/src/pl/plpython/plpy_main.c:387:13 apache#5 0x562f5e008bb5 in ExecMakeTableFunctionResult /home/cc/repo/gpdb6/src/backend/executor/execQual.c:2395:13 apache#6 0x562f5e0dddec in FunctionNext_guts /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:142:5 apache#7 0x562f5e0da094 in FunctionNext /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:350:11 apache#8 0x562f5e03d4b0 in ExecScanFetch /home/cc/repo/gpdb6/src/backend/executor/execScan.c:84:9 apache#9 0x562f5e03cd8f in ExecScan /home/cc/repo/gpdb6/src/backend/executor/execScan.c:154:10 #10 0x562f5e0da072 in ExecFunctionScan /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:380:9 apache#11 0x562f5e001a1c in ExecProcNode /home/cc/repo/gpdb6/src/backend/executor/execProcnode.c:1071:13 apache#12 0x562f5dfe6377 in ExecutePlan /home/cc/repo/gpdb6/src/backend/executor/execMain.c:3202:10 apache#13 0x562f5dfe5bf4 in standard_ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:1171:5 apache#14 0x562f5dfe4877 in ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:992:4 apache#15 0x562f5e857e69 in PortalRunSelect /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1164:4 apache#16 0x562f5e856d3f in PortalRun /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1005:18 apache#17 0x562f5e84607a in exec_simple_query /home/cc/repo/gpdb6/src/backend/tcop/postgres.c:1848:10 ``` ## Reproduce We can use a simple procedure to reproduce the above problem: - set timeout GUC: `gpconfig -c gp_segment_connect_timeout -v 5` and `gpstop -ari` - prepare function: ``` CREATE EXTENSION plpythonu; CREATE OR REPLACE FUNCTION test_func() RETURNS SETOF int AS $$ plpy.execute("select pg_backend_pid()") for i in range(0, 5): yield (i) $$ LANGUAGE plpythonu; ``` - exit from the current psql session. - stop the postmaster of segment: `gdb -p "the pid of segment postmaster"` - enter a psql session. - call `SELECT test_func();` and get error ``` gpadmin=# select test_func(); ERROR: function "test_func" error fetching next item from iterator (plpy_elog.c:121) DETAIL: Exception: failed to acquire resources on one or more segments CONTEXT: Traceback (most recent call last): PL/Python function "test_func" ``` - quit gdb and make postmaster runnable. - call `SELECT test_func();` again and get panic ``` gpadmin=# SELECT test_func(); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> ``` ## Analysis - There is an SPI call in test_func(): `plpy.execute()`. - Then coordinator will start a subtransaction by PLy_spi_subtransaction_begin(); - Meanwhile, if the segment cannot receive the instruction from the coordinator, the subtransaction beginning procedure return fails. - BUT! The Python processor does not know whether an error happened and does not clean its environment. - Then the next plpython UDF in the same session will fail due to the wrong Python environment. ## Solution - Use try-catch to catch the exception caused by PLy_spi_subtransaction_begin() - set the python error indicator by PLy_spi_exception_set() Co-authored-by: Chen Mulong <chenmulong@gmail.com>

reshke force-pushed the yezzey_ao_smgr branch 11 times, most recently from af2d94b to 0aa8b43 Compare September 30, 2024 15:56

reshke added 2 commits October 8, 2024 09:18

Bring back enable_geqo definition for extension compatibility sake

9e6035c

This takes no actaul effect, but allow extension that uses `geqo_enable` variable to compile.

reshke force-pushed the yezzey_ao_smgr branch from f6ea1c3 to 118319d Compare October 8, 2024 11:24

wenchaozhang-123 and others added 15 commits October 10, 2024 16:48

Remove visibility check code in aocs compaction (#13886)

23f65e3

Since we have already checked the visibility during scan tuples in aocs compaction, the visibility check code in aocs compaction is unnecessary and can be removed. Co-authored-by: linxu.hlx <linxu.hlx@alibaba-inc.com>

Remove unnecessary GetAppendOnlyEntryAuxOids() call from appendonly_i…

c60e6ac

…nsert() to reduce unnecessary catelog scans as the required segrelid was already obtained from appendonly_insert_init().

Update index stats test to be in sync with 3d351d9

734a8a1

Change vacuum row progress expected output Fix vacuum_progress_column exp output Also some little refactoring done around src/backend/catalog/pg_attribute_encoding.c

reshke force-pushed the yezzey_ao_smgr branch from 4b23401 to 2f81e18 Compare October 11, 2024 05:41

reshke force-pushed the yezzey_ao_smgr branch 2 times, most recently from 17bd67d to 9223daf Compare October 11, 2024 13:28

Use interface for storage interactions in Append-optimized TAM

e63c094

We present an appendonly SMGR interface and use it for storage operations. Use of this interface allow to implement out-of-core extensions that store data not localy but in some external storage.

reshke force-pushed the yezzey_ao_smgr branch from 9223daf to e63c094 Compare October 11, 2024 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use interface for storage interactions in Appendonly TAM #2

Use interface for storage interactions in Appendonly TAM #2

reshke commented Sep 30, 2024

Use interface for storage interactions in Appendonly TAM #2

Are you sure you want to change the base?

Use interface for storage interactions in Appendonly TAM #2

Conversation

reshke commented Sep 30, 2024

Change logs

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Contributor's Checklist