Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize migration by adding ability to read or extract payloads from state #5386

Merged
merged 14 commits into from
Feb 22, 2024

Conversation

fxamacker
Copy link
Member

@fxamacker fxamacker commented Feb 13, 2024

Closes #5231
Added flags to execution state extraction program:

  • --input-payload-filename

  • --output-payload-filename (if specified without --extract-payloads-by-address, all payloads are exported)

  • --extract-payloads-by-address filters payloads to export for specified addresses (accounts)

The new flags don't affect migration and other existing functionality of state extraction program. These options only affect input and output of state extraction program.

In other words, this can be used to extract migrated payloads or extract as-is payloads for specified accounts.

There are 2 use cases:

  • high priority (for Cadence 1.0 migration): save time during development, testing, or troubleshooting by focusing on a small subset of accounts. This PR was opened on Feb 13 after a Feb 12 meeting made this use case a high priority.

  • low priority (for reducing migration duration): produce intermediate file containing all payloads migrated with atree inlining and Cadence v0.42. Then read use that file as input to Cadence 1.0 migration to reduce total duration of migrations. This use case was confirmed to be low priority, see Feb 14 discord chat.

The 2nd use case became low priority because enabling atree inlining reduces migration duration in the same place. Migrations with atree inlining enabled are faster than do-nothing migrations given same input files. Atree inlining reduces node counts, payload counts, total data size, etc. making the later stages of migrations faster and use less memory.

NOTE: we want all payloads for all accounts to use atree inlining during migration because

  1. It produces maximum benefits (reduce memory, storage, node count, payload count).
  2. Given same input file, migration with atree inlining is faster than migration without atree inlining due to less nodes, etc. to build during later stages of migration. Given same input file, migration using atree inlining is faster than do-nothing migration.

@fxamacker fxamacker added enhancement New feature or request Performance Execution Cadence Execution Team spork improvements Improvements needed for the spork process. labels Feb 13, 2024
@fxamacker fxamacker self-assigned this Feb 13, 2024
Copy link
Member

@turbolent turbolent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thank you for adding this 🙏

Added two flags to execution state extraction program:

--extract-payloads-by-address produces a file of payloads for
specified accounts or for all accounts instead of checkpoint files

--use-payload-as-input uses payload file as input instead of
checkpoint files

The two new flags don't affect migration and other existing
functionaly of state extraction program.  These two options
only affect input and output of state extraction program.

In other words, this can be used to extract migrated payloads
or extract as-is payloads for specified accounts.
@fxamacker fxamacker force-pushed the fxamacker/extract-payloads-from-state-extraction branch from 2383569 to 85527b8 Compare February 13, 2024 17:18
Co-authored-by: Bastian Müller <bastian@turbolent.com>
@codecov-commenter
Copy link

codecov-commenter commented Feb 13, 2024

Codecov Report

Attention: 220 lines in your changes are missing coverage. Please review.

Comparison is base (27793f4) 55.97% compared to head (3c2d313) 56.02%.

Files Patch % Lines
...execution-state-extract/execution_state_extract.go 47.55% 66 Missing and 9 partials ⚠️
cmd/util/cmd/extract-payloads-by-address/cmd.go 59.01% 50 Missing and 25 partials ⚠️
cmd/util/ledger/util/payload_file.go 64.49% 33 Missing and 16 partials ⚠️
cmd/util/cmd/execution-state-extract/cmd.go 82.20% 14 Missing and 7 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5386      +/-   ##
==========================================
+ Coverage   55.97%   56.02%   +0.04%     
==========================================
  Files        1024     1026       +2     
  Lines       99327    99859     +532     
==========================================
+ Hits        55602    55943     +341     
- Misses      39483    39617     +134     
- Partials     4242     4299      +57     
Flag Coverage Δ
unittests 56.02% <62.19%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Refactored payload file related functionality to be more reusable.

Added flags:
--input-payload-filename
--output-payload-filename
This utility can be used to create a subset of execution state
which can save time during development, testing, and
support/troubleshooting.
@fxamacker fxamacker enabled auto-merge February 15, 2024 17:44
@fxamacker fxamacker added this pull request to the merge queue Feb 15, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 15, 2024
@fxamacker fxamacker added this pull request to the merge queue Feb 15, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 15, 2024
Copy link
Member

@zhangchiqing zhangchiqing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice optimization. I left a few comments about data validation to discuss.

cmd/util/cmd/execution-state-extract/cmd.go Show resolved Hide resolved
cmd/util/cmd/execution-state-extract/cmd.go Show resolved Hide resolved
cmd/util/cmd/execution-state-extract/cmd.go Outdated Show resolved Hide resolved
return fmt.Errorf("cannot generate payloads file: %w", err)
}

log.Info().Msgf("Exported %d payloads out of %d payloads", exportedPayloadCount, len(payloads))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We break the state extraction into 2 steps, the first step to output all payloads into a file after the first migration, and the second step to read all the payloads from the output file and continue running the remaining migrations and extract state.

I wonder how can we make the process verifiable in case someone is trying to reproduce the same process, and try to validate each intermediate steps? Basically we need a way to guarantee and prove that the payloads did not change when reading it back and continue with the remaining migrations and state extraction.

One idea I had is to compute some hash (or checksum) of all the payloads before writing them to disk, and validate it after reading and decode the payloads.

Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how can we make the process verifiable in case someone is trying to reproduce the same process, and try to validate each intermediate steps? Basically we need a way to guarantee and prove that the payloads did not change when reading it back and continue with the remaining migrations and state extraction.

Making this verifiable "in case someone is trying to reproduce the same process" is not the primary use case now (this PR is to help speed up development/testing) but it makes sense to add a hash for other use cases. 👍

One idea I had is to compute some hash (or checksum) of all the payloads before writing them to disk, and validate it after reading and decode the payloads.

Thoughts?

Due to parallelism, the payloads within the output file may not be in the same sequence each time this program is used. In practice payload file would still produce the same checkpoint when used as input (as long as sequence of payloads inside is the only thing that changed and the contents of all payloads are unchanged).

I'll merge this PR and open a separate PR to address the "reproduce the same process" use case by:

  • making the payloads have deterministic sequence in the output payload file
  • computing and adding a hash to the payload file representing all the payloads

require.NoError(t, err)
require.Equal(t, len(payloads), numOfPayloadWritten)

payloadsFromFile, err := util.ReadPayloadFile(zerolog.Nop(), payloadFileName)
Copy link
Member

@zhangchiqing zhangchiqing Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test case assumes the util to read the payload file is the same as the util produces it.

This test case is necessary, however, in the real use case we will be using two utils with different versions. I think it's necessary to verify that too.

One idea is to commit a payload file fixture produced by previous version, and add a test case to read the payload file and verify with the current version. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea is to commit a payload file fixture produced by previous version, and add a test case to read the payload file and verify with the current version. Thoughts?

There is only one version of payload file. Not sure what is meant by "a payload file fixture produced by previous version".

If you are referring to payload version payloadEncodingVersion, we only use payloadEncodingVersion = 1, no other version is supported.

@@ -78,6 +109,18 @@ func run(*cobra.Command, []string) {
return
}

if len(flagBlockHash) == 0 && len(flagStateCommitment) == 0 && len(flagInputPayloadFileName) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this check, and the block hash and state commitment should always be verified.

Because even though the check was performed when the payload file was generated, the payload itself does not include any information about the previously performed checks, and we are skipping them here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this check, and the block hash and state commitment should always be verified.

Maybe we should keep this check because it is to ensure that user provides one of these:

  • either flagBlockHash or flagStateCommitment for checkpoint file input
  • flagInputPayloadFileName for payload file input

If none of these are provided, the program doesn't have input to extract state so it should exit early.

if len(exportedAddresses) == 0 {
outputMsg = fmt.Sprintf("exporting all payloads to %s", flagOutputPayloadFileName)
} else {
outputMsg = fmt.Sprintf(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now an output payload file can possibly include only a subset of all the addresses, which means a payload file can possibly include a subset of world state. This is useful for debugging, but also introduce a chance of a wrong payload file used for state extraction, which only include subset of the world state.

Yes, we have include the payload count encoded in the payload file, but it's only used for validating the file integrity instead of ensuring the payloads include the full set of the world state.

Should we add something to the payload file to ensure that a payload file generated by a subset of world state will never be used for state extraction?

Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is useful for debugging, but also introduce a chance of a wrong payload file used for state extraction, which only include subset of the world state.

Great point! 👍

I'll modify payload file format to include a flag indicating whether payloads represent partial/full state. I'll also add a flag in state extraction program to explicitly allow partial state as input so we reduce chance of user error.

These changes will be in a separate PR combined with hashed payload file mentioned earlier.

Co-authored-by: Leo Zhang <zhangchiqing@gmail.com>
@fxamacker fxamacker added this pull request to the merge queue Feb 21, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 21, 2024
@fxamacker fxamacker enabled auto-merge February 22, 2024 00:04
@fxamacker fxamacker added this pull request to the merge queue Feb 22, 2024
Merged via the queue into master with commit cb9a2f3 Feb 22, 2024
49 of 51 checks passed
@fxamacker fxamacker deleted the fxamacker/extract-payloads-from-state-extraction branch February 22, 2024 00:45
@turbolent
Copy link
Member

👏👌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Execution Cadence Execution Team Performance spork improvements Improvements needed for the spork process.
Projects
None yet
5 participants