Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add network integration test for quick post sapling sync testing #1229

Merged
merged 21 commits into from
Nov 24, 2020

Conversation

yaahc
Copy link
Contributor

@yaahc yaahc commented Oct 28, 2020

Motivation

This change implements the second of three tests described in the basic network integration testing RFC. This test is meant to act as a quick test of sync behavior immediately post checkpointing.

Solution

This test is implemented by persisting a copy of the database that has been synced to sapling checkpoint height. This starting point is then used to sync the next few thousand blocks immediately after the Sapling activation.

Related Issues

Unresolved Questions

This design is currently an incomplete sketch of just the initial pieces of the final integration test. Syncing to a certain point and backing up the database, and starting from the backup if one already exists. This doesn't help us on our CI as it is currently implemented. Github Actions and Google Cloud Build both have little to no support for persistent data on test runners. We would have to copy the entire database over the network before each test run. There are a couple of solutions that I'm currently investigating:

VM + Custom Job Queue

Setup a persistent VM used to run these tests and some basic CI infra that can be used to dispatch and queue test jobs to that VM. That VM would then own the persisted copy of the up-to-sapling database. Test jobs would be sent from github actions and/or google cloud build, which would tell the remote VM to download the pre compiled zebrad image and run the test. Once the test completes it would reply back to the runner that requested the job, passing it the test results and letting its process exit.

Custom FUSE

Alternatively, we might be able to setup a simple FUSE that we can then mount in our Github Actions or Google Cloud Build containers. The FUSE would be only implement the minimum API needed to represent the single database file used by sled. It would implement copy on write functionality, where if sled tries to read a block that hasn't been updated locally it will read that block from the database file over the network, and if sled tries to write to an existing block or write a new one it would store the new block data in memory or on another file locally, rather than writing that data back to the database file on the network.

Edit: I've been doing more research on this option, it looks like all we'd need to do is setup a webserver to store the synced sled database files, then we could read bits and pieces of those files with https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests. Then we'd have a binary we run in ci to mount the FUSE in the tempdir, and based on which parts of the database file it needs to read it would either use the locally cached data (from previous writes) or do an http fetch to get the piece of the database it needs over the network for first time reads of old data.

@teor2345
Copy link
Contributor

I'd like to document the alternatives we've considered, so we retain this knowledge for when we modify tests in future.

Can we write a narrowly-scoped design RFC, focusing on restartable tests with cached state?
@yaahc or @dconnolly is that something we could do?
Would you like me to help writing it up?

(I don't have all the context here, so I'd need pointers to tickets, PRs, or other discussions.)

@teor2345 teor2345 mentioned this pull request Oct 28, 2020
51 tasks
@dconnolly
Copy link
Contributor

All of this is infra, work, and setup that can be handled for us by our cloud provider.

Copy link
Contributor

@dconnolly dconnolly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These currently timeout

image

@yaahc
Copy link
Contributor Author

yaahc commented Oct 28, 2020

These currently timeout

thats to be expected, this was just me trying to set something up that would download the chain, but that takes like 8 hours so its definitely not gonna work from within a github actions job without the cached data

@dconnolly
Copy link
Contributor

These currently timeout

thats to be expected, this was just me trying to set something up that would download the chain, but that takes like 8 hours so its definitely not gonna work from within a github actions job without the cached data

right so this should be behind a flag/environmental variable

@teor2345
Copy link
Contributor

teor2345 commented Oct 29, 2020

These currently timeout

thats to be expected, this was just me trying to set something up that would download the chain, but that takes like 8 hours so its definitely not gonna work from within a github actions job without the cached data

right so this should be behind a flag/environmental variable

We're currently using #[ignore] for off-by-default tests, and then CI runs all ignored tests.
We also have a ZEBRA_SKIP_NETWORK_TESTS env var that disables all network tests.

So I think another env var that enables the cached data tests would be a good idea.
Let's log a similar message to ZEBRA_SKIP_NETWORK_TESTS.
(Unfortunately there's no way to change a Rust test result to "skipped" from inside the test function.)

Can we commit a zebrad config file for the initial cached data setup?
Then we can run zebrad with that config file in the setup scripts.
We'll need a config file for mainnet and testnet (they have different Sapling heights.)

There isn't much point in doing the cached data setup inside the acceptance test framework.
It makes the logging harder to see, and it adds complexity to the process.

@yaahc
Copy link
Contributor Author

yaahc commented Oct 30, 2020

There isn't much point in doing the cached data setup inside the acceptance test framework.
It makes the logging harder to see, and it adds complexity to the process.

I don't agree on this point. There are advantages to doing it alongside our acceptance tests, or at least to doing it within rust code. We have a somewhat well polished framework for creating tests and interacting with the zebrad daemon and inspecting its output. We keep all of our tests in one place and one style, reducing the maintenance burden. And the issues with logging are not inherent to the test framework or difficult to bypass. All we have to do is use std::io::stdout / std::io::stderr to write output back to the real file descriptors directly, rather than using println! / eprintln! which write to the fake stderr/stdout file descriptors when running tests.

Alternatively, we could move them from being tests to being binaries or examples, which would let us maintain the same structure, while removing the test framework capturing stderr / stdout from the equation. Though we would need to do some code reorganization in that case, which is what makes me prefer to just write them as acceptance tests.

We're currently using #[ignore] for off-by-default tests, and then CI runs all ignored tests.

What tests do we have that we're ignoring but running in CI? My feeling is that we should only use #[ignore] for tests that we never want to run in general, and only want to run individually. So if we have any part of our ci where we say "run all ignored tests" that should probably change that rather than using environment variables to selectively ignore other tests.

@yaahc yaahc requested review from dconnolly and teor2345 October 30, 2020 19:45
@yaahc yaahc self-assigned this Nov 11, 2020
@dconnolly dconnolly added this to the First Alpha Release milestone Nov 12, 2020
@dconnolly dconnolly self-assigned this Nov 16, 2020
@dconnolly dconnolly force-pushed the sapling-sync-test branch 2 times, most recently from b5d9438 to 02f109b Compare November 23, 2020 05:12
@dconnolly dconnolly added A-devops Area: Pipelines, CI/CD and Dockerfiles A-infrastructure Area: Infrastructure changes A-rust Area: Updates to Rust code labels Nov 23, 2020
@dconnolly dconnolly merged commit 487ee6d into main Nov 24, 2020
@dconnolly dconnolly deleted the sapling-sync-test branch November 24, 2020 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-devops Area: Pipelines, CI/CD and Dockerfiles A-infrastructure Area: Infrastructure changes A-rust Area: Updates to Rust code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants