Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZTS: dnode stress test #15608

Merged
merged 1 commit into from
Dec 11, 2023
Merged

ZTS: dnode stress test #15608

merged 1 commit into from
Dec 11, 2023

Conversation

tonyhutter
Copy link
Contributor

Motivation and Context

Add dnode stress test to test dirty dnode corruption: #15526

Description

Test dirty dnode corruption.

Marking this as WIP for now, as I don't know if the timing on the EC2 instances will trigger the bug. If it triggers the bug, then I will revert the changes to the TEST file before merging.

How Has This Been Tested?

Ran locally on centos 9 VM

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@tonyhutter tonyhutter changed the title ZTS: dnode stress test WIP: ZTS: dnode stress test Nov 29, 2023
@tonyhutter tonyhutter added Component: Test Suite Indicates an issue with the test framework or a test case Status: Work in Progress Not yet ready for general review labels Nov 29, 2023
@ericloewe
Copy link

The test seems to be failing on FreeBSD due to cp not supporting --reflink or other GNU-style options. Some people have been able to reproduce this with standard FreeBSD cp, though more rarely, so I'd suggest using cp without the extra options when run on FreeBSD.

@robn
Copy link
Member

robn commented Nov 29, 2023

If you're going to use cp, probably want to check its from coreutils 9.x; earlier versions didn't call lseek().

Might want to explicitly set zfs_dmu_offset_next_sync=1, to aid with the timing.

I do wonder if a better way to do this would be with a couple of helper programs: one that creates a file and lobs N random bytes into it, and another that just does lseek(SEEK_DATA) on the file, and looks for ENXIO. Then its not subject to the timing and vagaries of cp, and will work on FreeBSD too.

@tonyhutter
Copy link
Contributor Author

tonyhutter commented Nov 30, 2023

@robn I tried writing a little C program (below) but was unable to reproduce the error. Run the C program inside your pool. Warning: creates lots of test files in your directory.

#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
char *bigbuf;
int *file;

int main (int argc, char **argv)
{
    char filename[255];
    off_t hole_offset;
    int i;
    
    int count;
    int bigbuf_count;

    if (argc != 3) {
        printf("usage ./main file_count file_size_in_bytes\n");
        exit(1);
    };
    
    count = atoi(argv[1]);
    bigbuf_count = atoi(argv[2]);

    pid_t mypid = getpid();
    file = calloc(count, sizeof(int));
    bigbuf = malloc(bigbuf_count);

    /* Make the initial file */
    file[0] = open("/dev/urandom", O_RDONLY);
    read(file[0], bigbuf, bigbuf_count);
    close(file[0]);
    sprintf(filename,"reproducer_%d_%d", mypid, 0);
    file[0] = open(filename, O_RDWR | O_CREAT, 0666);
    write(file[0], bigbuf, bigbuf_count);
    close(file[0]);

    for (i = 0; i < count; i++) {
        sprintf(filename,"reproducer_%d_%d", mypid, i);
        file[i] = open(filename, O_RDONLY, 0666);

        /* Check SEEK_DATA on new file */
        hole_offset = 0;
        hole_offset = lseek(file[i], hole_offset, SEEK_DATA);
        if (hole_offset != 0) {
            printf("offset1 wrong %lu (expected %d) on %s\n", hole_offset, 0, filename);
            exit(1);
        }

        /* Make a copy */
        sprintf(filename,"reproducer_%d_%d", mypid, i + 1);
        file[i+1] = open(filename, O_RDWR | O_CREAT, 0666);
        read(file[i], bigbuf, bigbuf_count);
        write(file[i+1], bigbuf, bigbuf_count);

        /* Check SEEK_HOLE on new file */
        hole_offset = 0;
        hole_offset = lseek(file[i+1], hole_offset, SEEK_HOLE);
        if (hole_offset != bigbuf_count) {
            printf("offset2 wrong %lu (expected %d) on %s\n", hole_offset, bigbuf_count, filename);
            exit(1);
        }
        close(file[i]);
        close(file[i+1]);
    }
    free(file);
    free(bigbuf);

    return 0;
}
gcc -O2 -D_GNU_SOURCE -Wall main.c -o main
./main 1000 1048576

@robn
Copy link
Member

robn commented Dec 1, 2023

@tonyhutter Try this: https://gist.github.com/robn/58dd57196dc7123c65736e10922d4d10

$ seekflood
usage: seekflood <nfiles> <threads>
$ seekflood 500 8
E: 8 seek failures; test failed

Typically hard for me to hit, but the hack I did in #15615 to add a sleep to force the window wider helped show it off.

@tonyhutter
Copy link
Contributor Author

@robn thanks for seekflood! I'm able to successfully reproduce the bug on it with the older ZFS versions as expected. I'll roll it into this PR.

@tonyhutter tonyhutter force-pushed the dnode_stress branch 6 times, most recently from a28bcaf to d49b42e Compare December 5, 2023 21:14
@tonyhutter tonyhutter changed the title WIP: ZTS: dnode stress test ZTS: dnode stress test Dec 5, 2023
@tonyhutter tonyhutter removed the Status: Work in Progress Not yet ready for general review label Dec 5, 2023
@tonyhutter
Copy link
Contributor Author

This is ready for review. I removed my older code and just had the test case call @robn's seekflood.c reproducer, which works well even on the underpowered EC2 instances. This PR deliberately doesn't have the dirty dnode fix so it will reproduce the problem.

As you can see, the test is correcty passing on Centos 7-8, and correctly failing on Centos 9 and Fedora 37-39.

Note, do not pull this in after it's approved. I currently have the TEST file set to only run this test. I will remove the TEST file change before merging.

@tonyhutter tonyhutter added the Status: Code Review Needed Ready for review and testing label Dec 5, 2023
@tonyhutter
Copy link
Contributor Author

@grahamperrin thanks I've pushed your changes.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

TEST Outdated Show resolved Hide resolved
tests/zfs-tests/tests/functional/cp_files/cp_stress.ksh Outdated Show resolved Hide resolved
@behlendorf behlendorf removed the Status: Code Review Needed Ready for review and testing label Dec 7, 2023
@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Dec 7, 2023
@tonyhutter
Copy link
Contributor Author

Rebased on master

Add a test for the dirty dnode SEEK_HOLE/SEEK_DATA bug described in
openzfs#15526

The bug was fixed in openzfs#15571 and
was backported to 2.2.2 and 2.1.14.  This test case is just to
make sure it does not come back.

seekflood.c originally written by Rob Norris.

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Copy link
Member

@robn robn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tonyhutter, for this and all adjacent work here. You kinda kicked this whole mess off ;)

@tonyhutter
Copy link
Contributor Author

This was rebased on master (and thus correctly passes now). I think it's ready to merge.

@behlendorf behlendorf merged commit b1748ea into openzfs:master Dec 11, 2023
24 of 26 checks passed
ixhamza pushed a commit to truenas/zfs that referenced this pull request Jan 30, 2024
Add a test for the dirty dnode SEEK_HOLE/SEEK_DATA bug described in
openzfs#15526

The bug was fixed in openzfs#15571 and
was backported to 2.2.2 and 2.1.14.  This test case is just to
make sure it does not come back.

seekflood.c originally written by Rob Norris.

Reviewed-by: Graham Perrin <grahamperrin@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes openzfs#15608
behlendorf pushed a commit that referenced this pull request Feb 13, 2024
Add a test for the dirty dnode SEEK_HOLE/SEEK_DATA bug described in
#15526

The bug was fixed in #15571 and
was backported to 2.2.2 and 2.1.14.  This test case is just to
make sure it does not come back.

seekflood.c originally written by Rob Norris.

Reviewed-by: Graham Perrin <grahamperrin@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #15608
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 13, 2024
Add a test for the dirty dnode SEEK_HOLE/SEEK_DATA bug described in
openzfs#15526

The bug was fixed in openzfs#15571 and
was backported to 2.2.2 and 2.1.14.  This test case is just to
make sure it does not come back.

seekflood.c originally written by Rob Norris.

Reviewed-by: Graham Perrin <grahamperrin@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes openzfs#15608
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 13, 2024
Add a test for the dirty dnode SEEK_HOLE/SEEK_DATA bug described in
openzfs#15526

The bug was fixed in openzfs#15571 and
was backported to 2.2.2 and 2.1.14.  This test case is just to
make sure it does not come back.

seekflood.c originally written by Rob Norris.

Reviewed-by: Graham Perrin <grahamperrin@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes openzfs#15608
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Test Suite Indicates an issue with the test framework or a test case Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants