Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive config.txt after loading from disk #18270

Open
5 of 15 tasks
mhess-swl opened this issue Mar 11, 2025 · 2 comments · May be fixed by #18427
Open
5 of 15 tasks

Archive config.txt after loading from disk #18270

mhess-swl opened this issue Mar 11, 2025 · 2 comments · May be fixed by #18427
Assignees
Milestone

Comments

@mhess-swl
Copy link
Contributor

mhess-swl commented Mar 11, 2025

Background

A specific sequence of events causes a restarting/reconnecting node to ISS:

  • Set addressBook.force=true in the config
  • Start the network; process transactions across a 15min boundary
  • Restart a node

The restarting node, which finds a config.txt present in its working directory, will generate synthetic node "overrides" and write them to state. The resulting node metadata is the same, but the entity counter is increased by the number of nodes in the network; thus the resulting state on the restarted node, for the round immediately following the loaded state, will have a different entity count and hash, resulting in an ISS.

We need to fix this in release 0.60+ by archiving the config.txt file as with the other network .json files after loading the network from disk during startup; an absent config.txt file will not trigger the extra overrides to be written to state.

Acceptance Criteria

  1. Test is written simulating the bug conditions
  2. Written test is passing
  3. @tomzhenghedera's restart/reconnect tests on engnet1 don't result in an ISS

Dependencies

None

Definition of Ready (DoR) Checklist

  • Clear acceptance criteria
  • Clear and detailed description
  • Dependencies identified
  • Links to documentation
  • Should be completable in 2-3 Days
  • Initial draft of Low-level design document
  • At least high level test plan
  • Groomed/Estimated

Definition of Done (DoD) Checklist

  • Acceptance Criteria complete
  • No Codacy issues greater than minor (in new code)
  • JavaDocs updated/created
  • Code commented
  • Unit tests created/updated
  • 80% test code coverage (in new code)
  • Happy Path and major negative cases in HAPI tests as applicable
@mhess-swl mhess-swl self-assigned this Mar 11, 2025
@mhess-swl mhess-swl added Bug An error that causes the feature to behave differently than what was expected based on design. and removed Bug An error that causes the feature to behave differently than what was expected based on design. labels Mar 11, 2025
@mhess-swl mhess-swl added this to the v0.60 milestone Mar 11, 2025
@mhess-swl
Copy link
Contributor Author

Archiving config.txt in this case doesn't actually fix this problem.

When a node calls AddressBookTransplantSchema#restart during startup migration, it uses the migration context to get the startup override network. If forceUseOfConfigAddressBook is enabled–which is true in current engnet tests–the override network will return the contents of config.txt as a network instead of an empty override. However, we can see in the ISS state differences that the state already has all needed node definitions at this point; therefore we're not really creating any new entities, yet we still increment the entity counter.

If we do archive config.txt then RosterTransplantSchema.super.restart(ctx, rosterStoreFactory) returns false because of the absence of config.txt (as the fallback override), ultimately resulting in an exception during the restart migration.

We can solve this issue instead by not inserting any "new" node definitions from config.txt in AddressBookTransplantSchema#restart. This can be a stopgap measure until we retire config.txt as a network fallback completely.

Finally, note that this issue shouldn't affect the startup sequence of networks where forceUseOfConfigAddressBook is not enabled.

@mhess-swl
Copy link
Contributor Author

Caused a JRS test to fail.

Potentially irrelevant following #18339

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant