Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(test): dockerize tests and run sync in detached mode #3459

Merged
merged 135 commits into from
Feb 16, 2022

Conversation

gustavovalverde
Copy link
Member

@gustavovalverde gustavovalverde commented Feb 3, 2022

Motivation

  • Inherit changes from refactor(cd): improve Docker and gcloud usage without Cloud Build #3431
  • Improve machine types and disks for CPU-heavy workloads
  • Make overall tests faster
  • Workflows with *patch.yml that were not needed were causing a bunch of noise on GitHub dashboard and PRs
  • Images were being build on GCP and then tested, this could be easily reuse images and cache from GitHub's Actions

Fixes: #2411 #3344
Partially: #1592 #2653 #2995 #3447

Solution

  • Use C2 machines with SSD
  • Change the state disk to SSD
  • Deploy and run the sync on a single step (on-deploy) with gcloud, instead of deploying the machine, ssh into it, build, run and wait.
  • Merge state regeneration and sync test on a single YAML
  • Use the same Dockerfile for CD and test (another PR will merge both)
  • Use a modular workflow with dependencies, reusing the same image for tests

Review

@teor2345 @conradoplg @dconnolly

Follow Up Work

- Use a more ENV configurable Dockerfile
- Remove cloudbuild dependency
- Use compute optimized machine types
- Use SSD instead of normal hard drives
- Move Sentry endpoint to secrets
- Use a single yml for auto & manual deploy
- Migrate to Google Artifact Registry
- Use a more ENV configurable Dockerfile
- Remove cloudbuild dependency
- Use compute optimized machine types
- Use SSD instead of normal hard drives
- Move Sentry endpoint to secrets
- Use a single yml for auto & manual deploy
- Migrate to Google Artifact Registry
Also substitute defaults with parameters sent through the workflow_dispatch
Caching from the latest image is one of the main reasons to add this extra tag. Before this commit, the inline cache was not being used.
The inline cache exporter only supports `min` cache mode. To enable `max` cache mode, push the image and the cache separately by using the registry cache exporter.

This also allows for smaller release images.
We're leveraging the registry to cache the actions, instead of using the 10GB limits from Github Actions cache storage
Use interactive shells for manual and test deployments. This allow greater flexibility if troubleshooting is needed inside the machines
Instead of using a VM to SSH into in to build and test. Build in GHA (to have the logs available), run the workspace tests in GHA, and just run the sync tests in GCP

Use a cintainer VM with zebra's image directly on it, and pass the needed parameters to run the Sync past mandatory checkpoint.
Compiling might be slow because different steps are compiling the same code 2-4 times because of the variations
@teor2345

This comment was marked as resolved.

Using the SHA from the base, creates confusion and it's not accurate with the SHA being shown and used on GitHub.

We have to keep both as manual runs with `workflow_dispatch` does not have a PR SHA
There's no impact in this workflow when a change is done in the dockerfile
… commands afterwards in a different step.

As GHA TTY is not working as expected, and workarounds does not play nicely with `gcloud compute ssh` actions/runner#241 (comment) we decided to get the container name from the logs, log directly to the container and run the cargo command from there.
@teor2345
Copy link
Contributor

@gustavovalverde it looks like RAM is very low on the regenerate-state-disks VM. This might be happening because the disk performance is very slow. (Or the disk could be slow due to VM swapping.)

Do you want to try a c2-standard-8 instead?

Here are the important parts of the failed logs:

Zebra starts downloading some new blocks, and checkpoints them:

Feb 14 23:10:06.609  INFO ***net="Main"***: zebrad::commands::start: estimated progress to chain tip sync_percent=6.750 remaining_sync_blocks=1469497
Feb 14 23:10:45.471  INFO ***net="Main"***:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=2000 lookahead_limit=2000 state_tip=Some(Height(106374))
Feb 14 23:10:53.391  INFO ***net="Main"***:sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=2473 lookahead_limit=2000
Feb 14 23:10:54.072  INFO ***net="Main"***:sync:extend_tips:checkpoint: zebra_consensus::checkpoint: verified checkpoint range block_count=340 current_range=(Excluded(Height(108533)), Included(Height(108873)))

It takes 4 minutes to write 700 blocks to disk:
(Normal speed is tens of thousands of blocks per minute on my machine.)

Feb 14 23:10:59.262  WARN ***net="Main"***:sync: zebrad::components::sync: error downloading and verifying block e=VerifierError(Elapsed(()))
Feb 14 23:10:59.262  INFO ***net="Main"***:sync: zebrad::components::sync: waiting to restart sync timeout=67s state_tip=Some(Height(106374))
Feb 14 23:11:06.610  INFO ***net="Main"***: zebrad::commands::start: estimated progress to chain tip sync_percent=6.750 remaining_sync_blocks=1469497
Feb 14 23:12:06.267  INFO ***net="Main"***:sync: zebrad::components::sync: starting sync, obtaining new tips state_tip=Some(Height(106374))
Feb 14 23:12:06.612  INFO ***net="Main"***: zebrad::commands::start: estimated progress to chain tip sync_percent=6.750 remaining_sync_blocks=1469498
Feb 14 23:13:06.613  INFO ***net="Main"***: zebrad::commands::start: estimated progress to chain tip sync_percent=6.750 remaining_sync_blocks=1469499
Feb 14 23:14:06.614  INFO ***net="Main"***: zebrad::commands::start: estimated progress to chain tip sync_percent=6.789 remaining_sync_blocks=1468878
Feb 14 23:15:06.616  INFO ***net="Main"***: zebrad::commands::start: estimated progress to chain tip sync_percent=6.798 remaining_sync_blocks=1468765

Zebra's Peer Set isn't getting requests, so it seems like it might be the disk that's slow:

Feb 14 23:15:06.850  INFO ***net="Main"***:crawl_and_dial: zebra_network::peer_set::inventory_registry: dropped lagged inventory advertisements count=41

Still writing blocks, 8 minutes to write 1200 blocks:

Feb 14 23:15:14.850  INFO ***net="Main"***:crawl_and_dial: zebra_network::peer_set::candidate_set: timeout waiting for peer service readiness or peer responses
Feb 14 23:16:06.617  INFO ***net="Main"***: zebrad::commands::start: estimated progress to chain tip sync_percent=6.802 remaining_sync_blocks=1468693
Feb 14 23:17:06.618  INFO ***net="Main"***: zebrad::commands::start: estimated progress to chain tip sync_percent=6.816 remaining_sync_blocks=1468453
Feb 14 23:18:06.619  INFO ***net="Main"***: zebrad::commands::start: estimated progress to chain tip sync_percent=6.821 remaining_sync_blocks=1468367
Feb 14 23:19:06.620  INFO ***net="Main"***: zebrad::commands::start: estimated progress to chain tip sync_percent=6.831 remaining_sync_blocks=1468221
Feb 14 23:20:06.621  INFO ***net="Main"***: zebrad::commands::start: estimated progress to chain tip sync_percent=6.839 remaining_sync_blocks=1468090
Feb 14 23:21:06.623  INFO ***net="Main"***: zebrad::commands::start: estimated progress to chain tip sync_percent=6.844 remaining_sync_blocks=1468007
Feb 14 23:22:06.624  INFO ***net="Main"***: zebrad::commands::start: estimated progress to chain tip sync_percent=6.855 remaining_sync_blocks=1467851
Feb 14 23:23:06.625  INFO ***net="Main"***: zebrad::commands::start: estimated progress to chain tip sync_percent=6.881 remaining_sync_blocks=1467450

The network part of the syncer got too far ahead, so it waited for a while:

Feb 14 23:23:29.400  INFO ***net="Main"***:sync:obtain_tips: zebrad::components::sync::downloads: synced block height too far ahead of the tip: dropped downloaded block. Hint: Try increasing the value of the lookahead_limit field in the sync section of the configuration file. hash=block::Hash("000000000039e5dfa40c8213067bcb288479559b595630f38f6eef36caaf261b") block_height=Height(1565894) tip_height=Some(Height(108873)) max_lookahead_height=Height(114873) lookahead_limit=2000
Feb 14 23:23:31.413  WARN ***net="Main"***:sync: zebrad::components::sync: error downloading and verifying block e=AboveLookaheadHeightLimit
Feb 14 23:23:31.413  INFO ***net="Main"***:sync: zebrad::components::sync: waiting to restart sync timeout=67s state_tip=Some(Height(108873))
Feb 14 23:24:06.627  INFO ***net="Main"***: zebrad::commands::start: estimated progress to chain tip sync_percent=6.909 remaining_sync_blocks=1467005

https://github.com/ZcashFoundation/zebra/runs/5192084284?check_suite_focus=true#step:9:1286

And then the process got killed for out of memory.

Happy to jump on a call or chat if that would help diagnose things. We're so close! 🤔

This allows to follow logs in Github Actions terminal, while the GCP container is still running.

Just delete the instance when following the logs ends successfully or fails
@gustavovalverde gustavovalverde changed the title refactor(test): overall pipeline improvements refactor(test): dockerize tests and run sync in detached mode with logs stream Feb 15, 2022
@gustavovalverde gustavovalverde changed the title refactor(test): dockerize tests and run sync in detached mode with logs stream refactor(test): dockerize tests and run sync in detached mode Feb 15, 2022
This allows to follow logs in Github Actions terminal, while the GCP container is still running.

Just delete the instance when following the logs ends successfully or fails
Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, let's merge!

mergify bot added a commit that referenced this pull request Feb 15, 2022
@mergify mergify bot merged commit fe2edca into main Feb 16, 2022
@mergify mergify bot deleted the docker-test-refactor branch February 16, 2022 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-devops Area: Pipelines, CI/CD and Dockerfiles
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check that Google Cloud builds actually run the tests
3 participants