check_restored_datadir_content failure in pg_regress #559

LizardWizzard · 2021-09-07T17:02:15Z

Found in main

AssertionError: assert (['pg_xact/0000'], []) == ([], [])
  At index 0 diff: ['pg_xact/0000'] != []
  Full diff:
  - ([], [])
  + (['pg_xact/0000'], [])
batch_pg_regress/test_pg_regress.py:58: in test_pg_regress
    check_restored_datadir_content(zenith_cli, test_output_dir, pg)
fixtures/zenith_fixtures.py:1011: in check_restored_datadir_content
    assert (mismatch, error) == ([], [])
E   AssertionError: assert (['pg_xact/0000'], []) == ([], [])
E     At index 0 diff: ['pg_xact/0000'] != []
E     Full diff:
E     - ([], [])
E     + (['pg_xact/0000'], [])

@lubennikovaav is it because of the recent changes or this is a spurious error?

The text was updated successfully, but these errors were encountered:

lubennikovaav · 2021-09-07T17:20:24Z

It is supposedly a race in the test, that should be fixed by #558

hlinnaka · 2022-05-28T11:45:10Z

This happened again: https://app.circleci.com/pipelines/github/neondatabase/neon/6807/workflows/f33d1c05-ceef-40d6-842f-ebc3237166a9/jobs/69445

bojanserafimov · 2022-05-31T16:27:54Z

happened again https://app.circleci.com/pipelines/github/neondatabase/neon/6851/workflows/7d3d9336-e0a5-4220-a70b-b2003daddeeb/jobs/70018/tests

SomeoneToIgnore · 2022-06-03T16:00:01Z

Noticed it twice since #1872 (comment) today.

We've revamped the SK <-> PS wal streaming and made PS to connect to SK more eager than before with callmemaybe, might be very related.

yeputons · 2022-06-09T16:45:09Z

Another one: https://app.circleci.com/pipelines/github/neondatabase/neon/7064/workflows/4fac5531-8bbe-48f3-8fb1-a3a1f9d485f5/jobs/72593

arssher · 2022-12-12T12:29:00Z

And again
https://github.com/neondatabase/neon/actions/runs/3675484577/jobs/6215126982

knizhnik · 2022-12-12T15:32:12Z

Looks like the problem is that RecordTransactionAbort(bool isSubXact) is not performing XLogFLush:

 * We do not flush XLOG to disk here, since the default assumption after a
 * crash would be that we aborted, anyway.  For the same reason, we don't
 * need to worry about interlocking against checkpoint start.

So it is really possible that last aborted transactions will never reach pageserver.
I do not think that it will cause some problems .... except rare test failures because of pg_xacts file mismatch.
I see the following ways of addressing this problem:

Somehow force Postgres to flush WAL on termination request. But it is not only necessary to call XLogFlush, but also wait until walsender will be able to propagate this changes
Disable test at all: do not compare restored and original directory.
Do not compare pg_xacts segments
Smart comparison of pg_xacts segments (for example ignore aborted transaction or just last aborted transaction)

It is not good to introduce some changes in Postgres core just to make our tests pass. So I do not like 1.
But disabling or complicating this check is also not exciting proposal.
Any better suggestion?

arssher · 2022-12-27T15:33:51Z

We can do 1, but issue fsync manually from python code; we already have neon_xlogflush in neontest.c for exactly this purpose. And yes, we'd need to add wait for propagation. Or this aborted xact emerges immediately before/during shutdown? Do you have an idea what is an xact this is?

If this turns out to be not enough, I'd probably just skip pg_xact comparison.

koivunej · 2023-04-26T08:59:36Z

once more: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4082/release/4806590168/index.html#suites/e90f9f55d45ab2a087333a860583a7c3/966501dacb4d63bf/

hlinnaka · 2023-05-24T10:17:35Z

another one: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4304/5062384433/index.html#suites/e90f9f55d45ab2a087333a860583a7c3/7f35d1745fa08a30/

koivunej · 2023-06-02T19:13:12Z

with pg15: https://neon-github-public-dev.s3.amazonaws.com/reports/main/5158365624/index.html#suites/e90f9f55d45ab2a087333a860583a7c3/40f71b5882308c46/ (looks the same)

arssher · 2023-06-12T13:55:51Z

And again:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4468/5244173359/index.html#suites/e90f9f55d45ab2a087333a860583a7c3/750933fb3b3439ce

@knizhnik do you have an idea why it happens in test_multixact? We can add flushing WAL, but I don't understand out of hand why it would help here.

koivunej · 2023-07-05T13:58:38Z

test_multixact: https://neon-github-public-dev.s3.amazonaws.com/reports/main/5464338139/index.html#suites/e90f9f55d45ab2a087333a860583a7c3/dfef62c0a4ada04c/

koivunej · 2023-07-12T13:59:41Z

test_pg_regress: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4682/5531542238/index.html#suites/158be07438eb5188d40b466b6acfaeb3/fb64efef414b56e1

koivunej · 2023-08-04T08:11:56Z

test_pg_regress: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4892/5752740496/index.html#/testresult/6e70689c82f43e93

awestover · 2023-08-04T12:02:12Z

We can do "Somehow force Postgres to flush WAL on termination request. But it is not only necessary to call XLogFlush, but also wait until walsender will be able to propagate this changes", but issue fsync manually from python code; we already have neon_xlogflush in neontest.c for exactly this purpose. And yes, we'd need to add wait for propagation. Or this aborted xact emerges immediately before/during shutdown? Do you have an idea what is an xact this is?
If this turns out to be not enough, I'd probably just skip pg_xact comparison.

I'm going to see if I can figure this out

awestover · 2023-08-04T17:58:24Z

FYI:
If you want to test this faster and hopefully uncover flakiness faster you can comment out all but the relevant test in neon/vendor/postgresv14/src/test/regress/parallel_schedule
as follows:

suggested parallel_schedule

# ----------
# src/test/regress/parallel_schedule
#
# By convention, we put no more than twenty tests in any one parallel group;
# this limits the number of connections needed to run the tests.
# ----------

# # run tablespace by itself, and first, because it forces a checkpoint;
# # we'd prefer not to have checkpoints later in the tests because that
# # interferes with crash-recovery testing.
# test: tablespace

# # ----------
# # The first group of parallel tests
# # ----------
# test: boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn regproc

# # ----------
# # The second group of parallel tests
# # strings depends on char, varchar and text
# # numerology depends on int2, int4, int8, float4, float8
# # multirangetypes depends on rangetypes
# # multirangetypes shouldn't run concurrently with type_sanity
# # ----------
# test: strings numerology point lseg line box path polygon circle date time timetz timestamp timestamptz interval inet macaddr macaddr8 multirangetypes create_function_0

# # ----------
# # Another group of parallel tests
# # geometry depends on point, lseg, box, path, polygon and circle
# # horology depends on interval, timetz, timestamp, timestamptz
# # opr_sanity depends on create_function_0
# # ----------
# test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comments expressions unicode xid mvcc

# # ----------
# # These four each depend on the previous one
# # ----------
# test: create_function_1
# test: create_type
# test: create_table
# test: create_function_2

# # ----------
# # Load huge amounts of data
# # We should split the data files into single files and then
# # execute two copy tests parallel, to check that copy itself
# # is concurrent safe.
# # ----------
# test: copy copyselect copydml insert insert_conflict

# # ----------
# # More groups of parallel tests
# # ----------
# test: create_misc create_operator create_procedure create_schema
# # These depend on create_misc and create_operator
# test: create_index create_index_spgist create_view index_including index_including_gist

# # ----------
# # Another group of parallel tests
# # ----------
# test: create_aggregate create_function_3 create_cast constraints triggers select inherit typed_table vacuum drop_if_exists updatable_views roleattributes create_am hash_func errors infinite_recurse

# # ----------
# # sanity_check does a vacuum, affecting the sort order of SELECT *
# # results. So it should not run parallel to other tests.
# # ----------
# test: sanity_check

# # ----------
# # Another group of parallel tests
# # Note: the ignore: line does not run random, just mark it as ignorable
# # ----------
# ignore: random
# test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join aggregates transactions random portals arrays btree_index hash_index update delete namespace 
test: prepared_xacts

# # ----------
# # Another group of parallel tests
# # ----------
# test: brin gin gist spgist privileges init_privs security_label collate matview lock replica_identity rowsecurity object_address tablesample groupingsets drop_operator password identity generated join_hash

# # ----------
# # Additional BRIN tests
# # ----------
# test: brin_bloom brin_multi

# # ----------
# # Another group of parallel tests
# # ----------
# test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tid tidscan tidrangescan collate.icu.utf8 incremental_sort

# # rules cannot run concurrently with any test that creates
# # a view or rule in the public schema
# # collate.*.utf8 tests cannot be run in parallel with each other
# test: rules psql psql_crosstab amutils stats_ext collate.linux.utf8

# # run by itself so it can run parallel workers
# test: select_parallel
# test: write_parallel
# test: vacuum_parallel

# # no relation related tests can be put in this group
# test: publication subscription

# # ----------
# # Another group of parallel tests
# # ----------
# test: select_views portals_p2 foreign_key cluster dependency guc bitmapops combocid tsearch tsdicts foreign_data window xmlmap functional_deps advisory_lock indirect_toast equivclass

# # ----------
# # Another group of parallel tests (JSON related)
# # ----------
# test: json jsonb json_encoding jsonpath jsonpath_encoding jsonb_jsonpath

# # ----------
# # Another group of parallel tests
# # NB: temp.sql does a reconnect which transiently uses 2 connections,
# # so keep this parallel group to at most 19 tests
# # ----------
# test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion truncate alter_table sequence polymorphism rowtypes returning largeobject with xml

# # ----------
# # Another group of parallel tests
# # ----------
# test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize

# # event triggers cannot run concurrently with any test that runs DDL
# # oidjoins is read-only, though, and should run late for best coverage
# test: event_trigger oidjoins
# # this test also uses event triggers, so likewise run it by itself
# test: fast_default

# # run stats by itself because its delay may be insufficient under heavy load
# test: stats

`pg_regress` is flaky: #559 Consolidated `CHECKPOINT` to `check_restored_datadir_content`, add a wait for `wait_for_last_flush_lsn`. Some recently introduced flakyness was fixed with #4948. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>

koivunej · 2023-09-08T17:27:29Z

Recent attempts failed to ease this failed: https://neon-github-public-dev.s3.amazonaws.com/reports/main/6124215906/index.html#suites/158be07438eb5188d40b466b6acfaeb3/eadf5e4006a9544f

jcsp · 2024-02-07T13:04:51Z

This is still failing occasionally https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6659/7814559995/index.html#/testresult/23e61beb205b42f6

jcsp · 2024-02-07T13:20:32Z

Looking at the current situation:

We call CHECKPOINT in postgres before flushing to the pageserver, which is intended to make sure that the on-disk state is fully written, even if postgres wouldn't usually block on it (e.g. in the incomplete transaction case)
The test ignores pg_xact files presence/absence, but if the file is present in both the on-disk state and restored state, it is still compared.
The test is failing because the files are different

This test occasionally fails with a difference in "pg_xact/0000" file between the local and restored datadirs. My hypothesis is that something changed in the database between the last explicit checkpoint and the shutdown. I suspect autovacuum, it could certainly create transactions. To fix, be more precise about the point in time that we compare. Shut down the endpoint first, then read the last LSN (i.e. the shutdown checkpoint's LSN), from the local disk with pg_controldata. And use exactly that LSN in the basebackup. Closes #559

hlinnaka · 2024-02-07T17:55:27Z

This is still failing occasionally https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6659/7814559995/index.html#/testresult/23e61beb205b42f6

Hmm, the test is supposed to write a diff of the file to a .filediff file, which is supposed to be included as an attachment in the Allure report. I don't see it in the Allure report. Weird. When I simulate the failure locally by modifying the file, it does create a .filediff file, and the allure_attach_from_dir() function does attach it. I wonder what's going on there?

This test occasionally fails with a difference in "pg_xact/0000" file between the local and restored datadirs. My hypothesis is that something changed in the database between the last explicit checkpoint and the shutdown. I suspect autovacuum, it could certainly create transactions. To fix, be more precise about the point in time that we compare. Shut down the endpoint first, then read the last LSN (i.e. the shutdown checkpoint's LSN), from the local disk with pg_controldata. And use exactly that LSN in the basebackup. Closes #559

This test occasionally fails with a difference in "pg_xact/0000" file between the local and restored datadirs. My hypothesis is that something changed in the database between the last explicit checkpoint and the shutdown. I suspect autovacuum, it could certainly create transactions. To fix, be more precise about the point in time that we compare. Shut down the endpoint first, then read the last LSN (i.e. the shutdown checkpoint's LSN), from the local disk with pg_controldata. And use exactly that LSN in the basebackup. Closes #559. I'm proposing this as an alternative to #6662.

hlinnaka · 2024-02-09T09:34:55Z

This was hopefully fixed by #6666. If not, and this still reoccurs, please reopen.

jcsp · 2024-02-09T13:50:17Z

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6647/7843577515/index.html#suites/158be07438eb5188d40b466b6acfaeb3/526276cf23cb8ed5

The modified basebackup invocation that uses an explicit LSN is leading to a timeout waiting for that LSN.

jcsp · 2024-02-13T09:46:20Z

This seems to be failing more frequently often since #6666 . It is the most frequently failing flaky test over the last 4 days.

The next proposed change is #6712

This test occasionally fails with a difference in "pg_xact/0000" file between the local and restored datadirs. My hypothesis is that something changed in the database between the last explicit checkpoint and the shutdown. I suspect autovacuum, it could certainly create transactions. To fix, be more precise about the point in time that we compare. Shut down the endpoint first, then read the last LSN (i.e. the shutdown checkpoint's LSN), from the local disk with pg_controldata. And use exactly that LSN in the basebackup. Closes #559

lubennikovaav linked a pull request Sep 7, 2021 that will close this issue

Change CLI command to avoid races in tests. #558

Merged

kelvich closed this as completed in #558 Sep 8, 2021

hlinnaka reopened this May 28, 2022

hlinnaka added the a/reliability Area: relates to reliability of the service label May 28, 2022

hlinnaka mentioned this issue May 28, 2022

Remove obsolete Dockerfile.alpine. #1821

Merged

bojanserafimov mentioned this issue Jun 3, 2022

test_pg_regress flacks in release run #1882

Closed

arssher added the a/test/flaky Area: related to flaky tests label Dec 12, 2022

knizhnik mentioned this issue Apr 14, 2023

rare test_multixact test failure #4029

Closed

koivunej mentioned this issue Apr 26, 2023

build: remove busted sk-1.us-east-2 from staging hosts #4082

Merged

This was referenced Aug 4, 2023

test: make pg_regress less flaky, hopefully #4903

Merged

test_crafted_wal_end flakyness #4691

Closed

jcsp added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels Feb 7, 2024

jcsp mentioned this issue Feb 7, 2024

tests: ignore pg_xact files in comparison #6662

Closed

5 tasks

hlinnaka mentioned this issue Feb 7, 2024

tests: try to make restored-datadir comparison tests not flaky #6666

Merged

hlinnaka closed this as completed in #6666 Feb 9, 2024

jcsp reopened this Feb 9, 2024

jcsp assigned hlinnaka Feb 9, 2024

jcsp mentioned this issue Feb 9, 2024

control_plane: follow up for embedded migrations #6647

Merged

5 tasks

jcsp mentioned this issue Feb 13, 2024

Keep walproposer alive until shutdown checkpoint is safe on safekepeers #6712

Merged

arssher closed this as completed in #6712 Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check_restored_datadir_content failure in pg_regress #559

check_restored_datadir_content failure in pg_regress #559

LizardWizzard commented Sep 7, 2021

lubennikovaav commented Sep 7, 2021

hlinnaka commented May 28, 2022

bojanserafimov commented May 31, 2022

SomeoneToIgnore commented Jun 3, 2022

yeputons commented Jun 9, 2022

arssher commented Dec 12, 2022

knizhnik commented Dec 12, 2022

arssher commented Dec 27, 2022

koivunej commented Apr 26, 2023

hlinnaka commented May 24, 2023

koivunej commented Jun 2, 2023

arssher commented Jun 12, 2023

koivunej commented Jul 5, 2023

koivunej commented Jul 12, 2023

koivunej commented Aug 4, 2023

awestover commented Aug 4, 2023 •

edited

Loading

awestover commented Aug 4, 2023 •

edited by koivunej

Loading

koivunej commented Sep 8, 2023

jcsp commented Feb 7, 2024

jcsp commented Feb 7, 2024

hlinnaka commented Feb 7, 2024

hlinnaka commented Feb 9, 2024

jcsp commented Feb 9, 2024

jcsp commented Feb 13, 2024 •

edited

Loading

check_restored_datadir_content failure in pg_regress #559

check_restored_datadir_content failure in pg_regress #559

Comments

LizardWizzard commented Sep 7, 2021

lubennikovaav commented Sep 7, 2021

hlinnaka commented May 28, 2022

bojanserafimov commented May 31, 2022

SomeoneToIgnore commented Jun 3, 2022

yeputons commented Jun 9, 2022

arssher commented Dec 12, 2022

knizhnik commented Dec 12, 2022

arssher commented Dec 27, 2022

koivunej commented Apr 26, 2023

hlinnaka commented May 24, 2023

koivunej commented Jun 2, 2023

arssher commented Jun 12, 2023

koivunej commented Jul 5, 2023

koivunej commented Jul 12, 2023

koivunej commented Aug 4, 2023

awestover commented Aug 4, 2023 • edited Loading

awestover commented Aug 4, 2023 • edited by koivunej Loading

koivunej commented Sep 8, 2023

jcsp commented Feb 7, 2024

jcsp commented Feb 7, 2024

hlinnaka commented Feb 7, 2024

hlinnaka commented Feb 9, 2024

jcsp commented Feb 9, 2024

jcsp commented Feb 13, 2024 • edited Loading

awestover commented Aug 4, 2023 •

edited

Loading

awestover commented Aug 4, 2023 •

edited by koivunej

Loading

jcsp commented Feb 13, 2024 •

edited

Loading