Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_restored_datadir_content failure in pg_regress #559

Closed
LizardWizzard opened this issue Sep 7, 2021 · 24 comments · Fixed by #558, #6666 or #6712
Closed

check_restored_datadir_content failure in pg_regress #559

LizardWizzard opened this issue Sep 7, 2021 · 24 comments · Fixed by #558, #6666 or #6712
Assignees
Labels
a/reliability Area: relates to reliability of the service a/test/flaky Area: related to flaky tests c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@LizardWizzard
Copy link
Contributor

Found in main

AssertionError: assert (['pg_xact/0000'], []) == ([], [])
  At index 0 diff: ['pg_xact/0000'] != []
  Full diff:
  - ([], [])
  + (['pg_xact/0000'], [])
batch_pg_regress/test_pg_regress.py:58: in test_pg_regress
    check_restored_datadir_content(zenith_cli, test_output_dir, pg)
fixtures/zenith_fixtures.py:1011: in check_restored_datadir_content
    assert (mismatch, error) == ([], [])
E   AssertionError: assert (['pg_xact/0000'], []) == ([], [])
E     At index 0 diff: ['pg_xact/0000'] != []
E     Full diff:
E     - ([], [])
E     + (['pg_xact/0000'], [])

@lubennikovaav is it because of the recent changes or this is a spurious error?

@lubennikovaav
Copy link
Contributor

It is supposedly a race in the test, that should be fixed by #558

@lubennikovaav lubennikovaav linked a pull request Sep 7, 2021 that will close this issue
@hlinnaka
Copy link
Contributor

@hlinnaka hlinnaka reopened this May 28, 2022
@hlinnaka hlinnaka added the a/reliability Area: relates to reliability of the service label May 28, 2022
@SomeoneToIgnore
Copy link
Contributor

Noticed it twice since #1872 (comment) today.

We've revamped the SK <-> PS wal streaming and made PS to connect to SK more eager than before with callmemaybe, might be very related.

@arssher arssher added the a/test/flaky Area: related to flaky tests label Dec 12, 2022
@arssher
Copy link
Contributor

arssher commented Dec 12, 2022

@knizhnik
Copy link
Contributor

Looks like the problem is that RecordTransactionAbort(bool isSubXact) is not performing XLogFLush:

 * We do not flush XLOG to disk here, since the default assumption after a
 * crash would be that we aborted, anyway.  For the same reason, we don't
 * need to worry about interlocking against checkpoint start.

So it is really possible that last aborted transactions will never reach pageserver.
I do not think that it will cause some problems .... except rare test failures because of pg_xacts file mismatch.
I see the following ways of addressing this problem:

  1. Somehow force Postgres to flush WAL on termination request. But it is not only necessary to call XLogFlush, but also wait until walsender will be able to propagate this changes
  2. Disable test at all: do not compare restored and original directory.
  3. Do not compare pg_xacts segments
  4. Smart comparison of pg_xacts segments (for example ignore aborted transaction or just last aborted transaction)

It is not good to introduce some changes in Postgres core just to make our tests pass. So I do not like 1.
But disabling or complicating this check is also not exciting proposal.
Any better suggestion?

@arssher
Copy link
Contributor

arssher commented Dec 27, 2022

We can do 1, but issue fsync manually from python code; we already have neon_xlogflush in neontest.c for exactly this purpose. And yes, we'd need to add wait for propagation. Or this aborted xact emerges immediately before/during shutdown? Do you have an idea what is an xact this is?

If this turns out to be not enough, I'd probably just skip pg_xact comparison.

@arssher
Copy link
Contributor

arssher commented Jun 12, 2023

And again:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4468/5244173359/index.html#suites/e90f9f55d45ab2a087333a860583a7c3/750933fb3b3439ce

@knizhnik do you have an idea why it happens in test_multixact? We can add flushing WAL, but I don't understand out of hand why it would help here.

@koivunej
Copy link
Member

koivunej commented Aug 4, 2023

@awestover
Copy link
Contributor

awestover commented Aug 4, 2023

We can do "Somehow force Postgres to flush WAL on termination request. But it is not only necessary to call XLogFlush, but also wait until walsender will be able to propagate this changes", but issue fsync manually from python code; we already have neon_xlogflush in neontest.c for exactly this purpose. And yes, we'd need to add wait for propagation. Or this aborted xact emerges immediately before/during shutdown? Do you have an idea what is an xact this is?
If this turns out to be not enough, I'd probably just skip pg_xact comparison.

I'm going to see if I can figure this out

@awestover
Copy link
Contributor

awestover commented Aug 4, 2023

FYI:
If you want to test this faster and hopefully uncover flakiness faster you can comment out all but the relevant test in neon/vendor/postgresv14/src/test/regress/parallel_schedule
as follows:

suggested parallel_schedule
# ----------
# src/test/regress/parallel_schedule
#
# By convention, we put no more than twenty tests in any one parallel group;
# this limits the number of connections needed to run the tests.
# ----------

# # run tablespace by itself, and first, because it forces a checkpoint;
# # we'd prefer not to have checkpoints later in the tests because that
# # interferes with crash-recovery testing.
# test: tablespace

# # ----------
# # The first group of parallel tests
# # ----------
# test: boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn regproc

# # ----------
# # The second group of parallel tests
# # strings depends on char, varchar and text
# # numerology depends on int2, int4, int8, float4, float8
# # multirangetypes depends on rangetypes
# # multirangetypes shouldn't run concurrently with type_sanity
# # ----------
# test: strings numerology point lseg line box path polygon circle date time timetz timestamp timestamptz interval inet macaddr macaddr8 multirangetypes create_function_0

# # ----------
# # Another group of parallel tests
# # geometry depends on point, lseg, box, path, polygon and circle
# # horology depends on interval, timetz, timestamp, timestamptz
# # opr_sanity depends on create_function_0
# # ----------
# test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comments expressions unicode xid mvcc

# # ----------
# # These four each depend on the previous one
# # ----------
# test: create_function_1
# test: create_type
# test: create_table
# test: create_function_2

# # ----------
# # Load huge amounts of data
# # We should split the data files into single files and then
# # execute two copy tests parallel, to check that copy itself
# # is concurrent safe.
# # ----------
# test: copy copyselect copydml insert insert_conflict

# # ----------
# # More groups of parallel tests
# # ----------
# test: create_misc create_operator create_procedure create_schema
# # These depend on create_misc and create_operator
# test: create_index create_index_spgist create_view index_including index_including_gist

# # ----------
# # Another group of parallel tests
# # ----------
# test: create_aggregate create_function_3 create_cast constraints triggers select inherit typed_table vacuum drop_if_exists updatable_views roleattributes create_am hash_func errors infinite_recurse

# # ----------
# # sanity_check does a vacuum, affecting the sort order of SELECT *
# # results. So it should not run parallel to other tests.
# # ----------
# test: sanity_check

# # ----------
# # Another group of parallel tests
# # Note: the ignore: line does not run random, just mark it as ignorable
# # ----------
# ignore: random
# test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join aggregates transactions random portals arrays btree_index hash_index update delete namespace 
test: prepared_xacts

# # ----------
# # Another group of parallel tests
# # ----------
# test: brin gin gist spgist privileges init_privs security_label collate matview lock replica_identity rowsecurity object_address tablesample groupingsets drop_operator password identity generated join_hash

# # ----------
# # Additional BRIN tests
# # ----------
# test: brin_bloom brin_multi

# # ----------
# # Another group of parallel tests
# # ----------
# test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tid tidscan tidrangescan collate.icu.utf8 incremental_sort

# # rules cannot run concurrently with any test that creates
# # a view or rule in the public schema
# # collate.*.utf8 tests cannot be run in parallel with each other
# test: rules psql psql_crosstab amutils stats_ext collate.linux.utf8

# # run by itself so it can run parallel workers
# test: select_parallel
# test: write_parallel
# test: vacuum_parallel

# # no relation related tests can be put in this group
# test: publication subscription

# # ----------
# # Another group of parallel tests
# # ----------
# test: select_views portals_p2 foreign_key cluster dependency guc bitmapops combocid tsearch tsdicts foreign_data window xmlmap functional_deps advisory_lock indirect_toast equivclass

# # ----------
# # Another group of parallel tests (JSON related)
# # ----------
# test: json jsonb json_encoding jsonpath jsonpath_encoding jsonb_jsonpath

# # ----------
# # Another group of parallel tests
# # NB: temp.sql does a reconnect which transiently uses 2 connections,
# # so keep this parallel group to at most 19 tests
# # ----------
# test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion truncate alter_table sequence polymorphism rowtypes returning largeobject with xml

# # ----------
# # Another group of parallel tests
# # ----------
# test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize

# # event triggers cannot run concurrently with any test that runs DDL
# # oidjoins is read-only, though, and should run late for best coverage
# test: event_trigger oidjoins
# # this test also uses event triggers, so likewise run it by itself
# test: fast_default

# # run stats by itself because its delay may be insufficient under heavy load
# test: stats

koivunej added a commit that referenced this issue Aug 10, 2023
`pg_regress` is flaky: #559

Consolidated `CHECKPOINT` to `check_restored_datadir_content`, add a
wait for `wait_for_last_flush_lsn`.

Some recently introduced flakyness was fixed with #4948.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
@koivunej
Copy link
Member

koivunej commented Sep 8, 2023

@jcsp
Copy link
Collaborator

jcsp commented Feb 7, 2024

@jcsp jcsp added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels Feb 7, 2024
@jcsp
Copy link
Collaborator

jcsp commented Feb 7, 2024

Looking at the current situation:

  • We call CHECKPOINT in postgres before flushing to the pageserver, which is intended to make sure that the on-disk state is fully written, even if postgres wouldn't usually block on it (e.g. in the incomplete transaction case)
  • The test ignores pg_xact files presence/absence, but if the file is present in both the on-disk state and restored state, it is still compared.
  • The test is failing because the files are different

hlinnaka added a commit that referenced this issue Feb 7, 2024
This test occasionally fails with a difference in "pg_xact/0000" file
between the local and restored datadirs. My hypothesis is that
something changed in the database between the last explicit checkpoint
and the shutdown. I suspect autovacuum, it could certainly create
transactions.

To fix, be more precise about the point in time that we compare. Shut
down the endpoint first, then read the last LSN (i.e. the shutdown
checkpoint's LSN), from the local disk with pg_controldata. And use
exactly that LSN in the basebackup.

Closes #559
@hlinnaka
Copy link
Contributor

hlinnaka commented Feb 7, 2024

This is still failing occasionally https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6659/7814559995/index.html#/testresult/23e61beb205b42f6

Hmm, the test is supposed to write a diff of the file to a .filediff file, which is supposed to be included as an attachment in the Allure report. I don't see it in the Allure report. Weird. When I simulate the failure locally by modifying the file, it does create a .filediff file, and the allure_attach_from_dir() function does attach it. I wonder what's going on there?

hlinnaka added a commit that referenced this issue Feb 7, 2024
This test occasionally fails with a difference in "pg_xact/0000" file
between the local and restored datadirs. My hypothesis is that
something changed in the database between the last explicit checkpoint
and the shutdown. I suspect autovacuum, it could certainly create
transactions.

To fix, be more precise about the point in time that we compare. Shut
down the endpoint first, then read the last LSN (i.e. the shutdown
checkpoint's LSN), from the local disk with pg_controldata. And use
exactly that LSN in the basebackup.

Closes #559
hlinnaka added a commit that referenced this issue Feb 9, 2024
This test occasionally fails with a difference in "pg_xact/0000" file
between the local and restored datadirs. My hypothesis is that something
changed in the database between the last explicit checkpoint and the
shutdown. I suspect autovacuum, it could certainly create transactions.

To fix, be more precise about the point in time that we compare. Shut
down the endpoint first, then read the last LSN (i.e. the shutdown
checkpoint's LSN), from the local disk with pg_controldata. And use
exactly that LSN in the basebackup.

Closes #559.

I'm proposing this as an alternative to
#6662.
@hlinnaka
Copy link
Contributor

hlinnaka commented Feb 9, 2024

This was hopefully fixed by #6666. If not, and this still reoccurs, please reopen.

@jcsp
Copy link
Collaborator

jcsp commented Feb 9, 2024

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6647/7843577515/index.html#suites/158be07438eb5188d40b466b6acfaeb3/526276cf23cb8ed5

The modified basebackup invocation that uses an explicit LSN is leading to a timeout waiting for that LSN.

@jcsp
Copy link
Collaborator

jcsp commented Feb 13, 2024

This seems to be failing more frequently often since #6666 . It is the most frequently failing flaky test over the last 4 days.

The next proposed change is #6712

arssher pushed a commit that referenced this issue Feb 15, 2024
This test occasionally fails with a difference in "pg_xact/0000" file
between the local and restored datadirs. My hypothesis is that
something changed in the database between the last explicit checkpoint
and the shutdown. I suspect autovacuum, it could certainly create
transactions.

To fix, be more precise about the point in time that we compare. Shut
down the endpoint first, then read the last LSN (i.e. the shutdown
checkpoint's LSN), from the local disk with pg_controldata. And use
exactly that LSN in the basebackup.

Closes #559
arssher pushed a commit that referenced this issue Feb 15, 2024
This test occasionally fails with a difference in "pg_xact/0000" file
between the local and restored datadirs. My hypothesis is that
something changed in the database between the last explicit checkpoint
and the shutdown. I suspect autovacuum, it could certainly create
transactions.

To fix, be more precise about the point in time that we compare. Shut
down the endpoint first, then read the last LSN (i.e. the shutdown
checkpoint's LSN), from the local disk with pg_controldata. And use
exactly that LSN in the basebackup.

Closes #559
arssher pushed a commit that referenced this issue Mar 11, 2024
This test occasionally fails with a difference in "pg_xact/0000" file
between the local and restored datadirs. My hypothesis is that
something changed in the database between the last explicit checkpoint
and the shutdown. I suspect autovacuum, it could certainly create
transactions.

To fix, be more precise about the point in time that we compare. Shut
down the endpoint first, then read the last LSN (i.e. the shutdown
checkpoint's LSN), from the local disk with pg_controldata. And use
exactly that LSN in the basebackup.

Closes #559
arssher pushed a commit that referenced this issue Mar 11, 2024
This test occasionally fails with a difference in "pg_xact/0000" file
between the local and restored datadirs. My hypothesis is that
something changed in the database between the last explicit checkpoint
and the shutdown. I suspect autovacuum, it could certainly create
transactions.

To fix, be more precise about the point in time that we compare. Shut
down the endpoint first, then read the last LSN (i.e. the shutdown
checkpoint's LSN), from the local disk with pg_controldata. And use
exactly that LSN in the basebackup.

Closes #559
arssher pushed a commit that referenced this issue Mar 11, 2024
This test occasionally fails with a difference in "pg_xact/0000" file
between the local and restored datadirs. My hypothesis is that
something changed in the database between the last explicit checkpoint
and the shutdown. I suspect autovacuum, it could certainly create
transactions.

To fix, be more precise about the point in time that we compare. Shut
down the endpoint first, then read the last LSN (i.e. the shutdown
checkpoint's LSN), from the local disk with pg_controldata. And use
exactly that LSN in the basebackup.

Closes #559
arssher pushed a commit that referenced this issue Mar 11, 2024
This test occasionally fails with a difference in "pg_xact/0000" file
between the local and restored datadirs. My hypothesis is that
something changed in the database between the last explicit checkpoint
and the shutdown. I suspect autovacuum, it could certainly create
transactions.

To fix, be more precise about the point in time that we compare. Shut
down the endpoint first, then read the last LSN (i.e. the shutdown
checkpoint's LSN), from the local disk with pg_controldata. And use
exactly that LSN in the basebackup.

Closes #559
arssher pushed a commit that referenced this issue Mar 11, 2024
This test occasionally fails with a difference in "pg_xact/0000" file
between the local and restored datadirs. My hypothesis is that
something changed in the database between the last explicit checkpoint
and the shutdown. I suspect autovacuum, it could certainly create
transactions.

To fix, be more precise about the point in time that we compare. Shut
down the endpoint first, then read the last LSN (i.e. the shutdown
checkpoint's LSN), from the local disk with pg_controldata. And use
exactly that LSN in the basebackup.

Closes #559
arssher pushed a commit that referenced this issue Mar 11, 2024
This test occasionally fails with a difference in "pg_xact/0000" file
between the local and restored datadirs. My hypothesis is that
something changed in the database between the last explicit checkpoint
and the shutdown. I suspect autovacuum, it could certainly create
transactions.

To fix, be more precise about the point in time that we compare. Shut
down the endpoint first, then read the last LSN (i.e. the shutdown
checkpoint's LSN), from the local disk with pg_controldata. And use
exactly that LSN in the basebackup.

Closes #559
arssher pushed a commit that referenced this issue Mar 11, 2024
This test occasionally fails with a difference in "pg_xact/0000" file
between the local and restored datadirs. My hypothesis is that
something changed in the database between the last explicit checkpoint
and the shutdown. I suspect autovacuum, it could certainly create
transactions.

To fix, be more precise about the point in time that we compare. Shut
down the endpoint first, then read the last LSN (i.e. the shutdown
checkpoint's LSN), from the local disk with pg_controldata. And use
exactly that LSN in the basebackup.

Closes #559
arssher pushed a commit that referenced this issue Mar 11, 2024
This test occasionally fails with a difference in "pg_xact/0000" file
between the local and restored datadirs. My hypothesis is that
something changed in the database between the last explicit checkpoint
and the shutdown. I suspect autovacuum, it could certainly create
transactions.

To fix, be more precise about the point in time that we compare. Shut
down the endpoint first, then read the last LSN (i.e. the shutdown
checkpoint's LSN), from the local disk with pg_controldata. And use
exactly that LSN in the basebackup.

Closes #559
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment