Add network integration test for quick post sapling sync testing #1229

yaahc · 2020-10-28T19:47:32Z

Motivation

This change implements the second of three tests described in the basic network integration testing RFC. This test is meant to act as a quick test of sync behavior immediately post checkpointing.

Solution

This test is implemented by persisting a copy of the database that has been synced to sapling checkpoint height. This starting point is then used to sync the next few thousand blocks immediately after the Sapling activation.

Related Issues

Draft RFC: Initial draft for basic network integration testing #1007

Unresolved Questions

This design is currently an incomplete sketch of just the initial pieces of the final integration test. Syncing to a certain point and backing up the database, and starting from the backup if one already exists. This doesn't help us on our CI as it is currently implemented. Github Actions and Google Cloud Build both have little to no support for persistent data on test runners. We would have to copy the entire database over the network before each test run. There are a couple of solutions that I'm currently investigating:

VM + Custom Job Queue

Setup a persistent VM used to run these tests and some basic CI infra that can be used to dispatch and queue test jobs to that VM. That VM would then own the persisted copy of the up-to-sapling database. Test jobs would be sent from github actions and/or google cloud build, which would tell the remote VM to download the pre compiled zebrad image and run the test. Once the test completes it would reply back to the runner that requested the job, passing it the test results and letting its process exit.

Custom FUSE

Alternatively, we might be able to setup a simple FUSE that we can then mount in our Github Actions or Google Cloud Build containers. The FUSE would be only implement the minimum API needed to represent the single database file used by sled. It would implement copy on write functionality, where if sled tries to read a block that hasn't been updated locally it will read that block from the database file over the network, and if sled tries to write to an existing block or write a new one it would store the new block data in memory or on another file locally, rather than writing that data back to the database file on the network.

Edit: I've been doing more research on this option, it looks like all we'd need to do is setup a webserver to store the synced sled database files, then we could read bits and pieces of those files with https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests. Then we'd have a binary we run in ci to mount the FUSE in the tempdir, and based on which parts of the database file it needs to read it would either use the locally cached data (from previous writes) or do an http fetch to get the piece of the database it needs over the network for first time reads of old data.

teor2345 · 2020-10-28T22:33:49Z

I'd like to document the alternatives we've considered, so we retain this knowledge for when we modify tests in future.

Can we write a narrowly-scoped design RFC, focusing on restartable tests with cached state?
@yaahc or @dconnolly is that something we could do?
Would you like me to help writing it up?

(I don't have all the context here, so I'd need pointers to tickets, PRs, or other discussions.)

zebrad/tests/acceptance.rs

dconnolly · 2020-10-28T23:01:35Z

All of this is infra, work, and setup that can be handled for us by our cloud provider.

dconnolly

These currently timeout

yaahc · 2020-10-28T23:22:37Z

These currently timeout

thats to be expected, this was just me trying to set something up that would download the chain, but that takes like 8 hours so its definitely not gonna work from within a github actions job without the cached data

dconnolly · 2020-10-28T23:27:53Z

These currently timeout

thats to be expected, this was just me trying to set something up that would download the chain, but that takes like 8 hours so its definitely not gonna work from within a github actions job without the cached data

right so this should be behind a flag/environmental variable

teor2345 · 2020-10-29T09:13:15Z

These currently timeout

thats to be expected, this was just me trying to set something up that would download the chain, but that takes like 8 hours so its definitely not gonna work from within a github actions job without the cached data

right so this should be behind a flag/environmental variable

We're currently using #[ignore] for off-by-default tests, and then CI runs all ignored tests.
We also have a ZEBRA_SKIP_NETWORK_TESTS env var that disables all network tests.

So I think another env var that enables the cached data tests would be a good idea.
Let's log a similar message to ZEBRA_SKIP_NETWORK_TESTS.
(Unfortunately there's no way to change a Rust test result to "skipped" from inside the test function.)

Can we commit a zebrad config file for the initial cached data setup?
Then we can run zebrad with that config file in the setup scripts.
We'll need a config file for mainnet and testnet (they have different Sapling heights.)

There isn't much point in doing the cached data setup inside the acceptance test framework.
It makes the logging harder to see, and it adds complexity to the process.

zebrad/tests/acceptance.rs

yaahc · 2020-10-30T19:17:04Z

There isn't much point in doing the cached data setup inside the acceptance test framework.
It makes the logging harder to see, and it adds complexity to the process.

I don't agree on this point. There are advantages to doing it alongside our acceptance tests, or at least to doing it within rust code. We have a somewhat well polished framework for creating tests and interacting with the zebrad daemon and inspecting its output. We keep all of our tests in one place and one style, reducing the maintenance burden. And the issues with logging are not inherent to the test framework or difficult to bypass. All we have to do is use std::io::stdout / std::io::stderr to write output back to the real file descriptors directly, rather than using println! / eprintln! which write to the fake stderr/stdout file descriptors when running tests.

Alternatively, we could move them from being tests to being binaries or examples, which would let us maintain the same structure, while removing the test framework capturing stderr / stdout from the equation. Though we would need to do some code reorganization in that case, which is what makes me prefer to just write them as acceptance tests.

We're currently using #[ignore] for off-by-default tests, and then CI runs all ignored tests.

What tests do we have that we're ignoring but running in CI? My feeling is that we should only use #[ignore] for tests that we never want to run in general, and only want to run individually. So if we have any part of our ci where we say "run all ignored tests" that should probably change that rather than using environment variables to selectively ignore other tests.

…e at Sapling activation for sync_past_sapling_mainnet test

…t and use for running sync_past_sapling_testnet

teor2345 reviewed Oct 28, 2020

View reviewed changes

zebrad/tests/acceptance.rs Outdated Show resolved Hide resolved

teor2345 reviewed Oct 28, 2020

View reviewed changes

zebrad/tests/acceptance.rs Outdated Show resolved Hide resolved

teor2345 mentioned this pull request Oct 28, 2020

Tracking: Testing the Entire Chain #745

Closed

51 tasks

dconnolly suggested changes Oct 28, 2020

View reviewed changes

dconnolly reviewed Oct 30, 2020

View reviewed changes

zebrad/tests/acceptance.rs Outdated Show resolved Hide resolved

dconnolly reviewed Oct 30, 2020

View reviewed changes

zebrad/tests/acceptance.rs Show resolved Hide resolved

yaahc requested review from dconnolly and teor2345 October 30, 2020 19:45

yaahc self-assigned this Nov 11, 2020

dconnolly mentioned this pull request Nov 12, 2020

Run acceptance tests post-merge with large cached state #1282

Merged

1 task

dconnolly added this to the First Alpha Release milestone Nov 12, 2020

dconnolly self-assigned this Nov 16, 2020

dconnolly force-pushed the sapling-sync-test branch 2 times, most recently from b5d9438 to 02f109b Compare November 23, 2020 05:12

dconnolly added A-devops Area: Pipelines, CI/CD and Dockerfiles A-infrastructure Area: Infrastructure changes A-rust Area: Updates to Rust code labels Nov 23, 2020

yaahc added 7 commits November 23, 2020 22:43

Add network integration test for quick post sapling sync testing

50f9004

update acceptance tests

5a73307

fix use cfg

a18c67e

further make use statement consistent

e0ebad5

revert unnecessary change in sync_until

9d5d31f

fix stdout issue with test framework for cached data tests

73fdb80

change ci to not blanket enable all ignored tests

658e668

yaahc and others added 13 commits November 23, 2020 22:43

document tests

f0a6688

rename test fn

0c48a01

slight comment tweek

fdf886f

fix args issue

43268a8

Remove defunct memory_cache_bytes from test config

6bb6150

Compile tests without running as container layer

2c5dda4

Enable stateful/long sync tests by features, mount rocksdb-based stat…

5844fb3

…e at Sapling activation for sync_past_sapling_mainnet test

Avoid disk naming collisions

4d38633

Not a tty

24e5b63

Reference correct disk name when mounting in container

45ecc87

Run sync_past_sapling_mainnet, not sync_to_sapling_mainnet

a98d331

allow(dead_code) not allow(clippy::dead_code)

d96c337

Create and mount another state cache for Sapling activation on testne…

b05179b

…t and use for running sync_past_sapling_testnet

dconnolly force-pushed the sapling-sync-test branch from 630c5e5 to b05179b Compare November 24, 2020 03:43

Use cloud disk default of auto-delete=true

12abfda

dconnolly approved these changes Nov 24, 2020

View reviewed changes

dconnolly merged commit 487ee6d into main Nov 24, 2020

dconnolly deleted the sapling-sync-test branch November 24, 2020 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add network integration test for quick post sapling sync testing #1229

Add network integration test for quick post sapling sync testing #1229

yaahc commented Oct 28, 2020 •

edited

Loading

teor2345 commented Oct 28, 2020

dconnolly commented Oct 28, 2020

dconnolly left a comment

yaahc commented Oct 28, 2020

dconnolly commented Oct 28, 2020

teor2345 commented Oct 29, 2020 •

edited

Loading

yaahc commented Oct 30, 2020 •

edited

Loading

Add network integration test for quick post sapling sync testing #1229

Add network integration test for quick post sapling sync testing #1229

Conversation

yaahc commented Oct 28, 2020 • edited Loading

Motivation

Solution

Related Issues

Unresolved Questions

VM + Custom Job Queue

Custom FUSE

teor2345 commented Oct 28, 2020

dconnolly commented Oct 28, 2020

dconnolly left a comment

Choose a reason for hiding this comment

yaahc commented Oct 28, 2020

dconnolly commented Oct 28, 2020

teor2345 commented Oct 29, 2020 • edited Loading

yaahc commented Oct 30, 2020 • edited Loading

yaahc commented Oct 28, 2020 •

edited

Loading

teor2345 commented Oct 29, 2020 •

edited

Loading

yaahc commented Oct 30, 2020 •

edited

Loading