Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow not storing ephemeral node hashes #2568

Merged
merged 4 commits into from
May 17, 2022

Conversation

pav-kv
Copy link
Contributor

@pav-kv pav-kv commented Jul 5, 2021

This change adds --tree_ids_with_no_ephemeral_nodes flag to the sequencer,
which allows disabling the writes of ephemeral nodes for a subset of Trillian
trees. The plan is to disable it unconditionally in the next release, but the
flag helps to test the new behaviour safely on a subset of trees.

The stored ephemeral nodes are not used because all the higher level logic uses
compact ranges, operating only with the "perfect" nodes. Since commit df47465 on
Jul 13, 2021 ephemeral nodes are never read from the tree storage.

Not storing ephemeral nodes also has a positive performance effect. Previously,
every sequencing transaction for a typical tree of hundreds of millions entries
would write at least 4 tiles at the tree edge, because it "touched" all the
ephemeral nodes. Now, it will only touch perfect nodes, which for a typical
transaction are in the bottom 1-2 tiles.

Relatedly, now that only non-ephemeral node updates trigger a tile write, every
tile can be written at most 256 times, whereas previously tiles could be written
millions of times (e.g. the root tile of the tree). It led to issues like #1188
which is the result of needing to scan / query millions of rows.

Checklist

@pav-kv pav-kv requested a review from a team as a code owner July 5, 2021 16:58
@pav-kv pav-kv changed the title Do not store and load ephemeral node hashes Do not store/load ephemeral node hashes Jul 5, 2021
@pav-kv pav-kv marked this pull request as draft July 5, 2021 17:08
@pav-kv
Copy link
Contributor Author

pav-kv commented Jul 5, 2021

Checking whether it's legit.

@Martin2112
Copy link
Contributor

This didn't work before compact ranges and the problems only showed up at particular tree sizes. Need to be sure it's well tested.

@Martin2112
Copy link
Contributor

Also when the issue occurred it wasn't always using the latest version, among other changes.

@pav-kv pav-kv force-pushed the rm_ephemeral_nodes branch 2 times, most recently from 6f43e42 to 48e1a2a Compare July 14, 2021 17:02
@pav-kv
Copy link
Contributor Author

pav-kv commented Jul 19, 2021

It's covered by tests already (and they pass). I factored out #2579 which only removes the reading of ephemeral nodes, and the tests pass. Since not reading them is ok, there is no point in writing them, hence this PR.

@pav-kv pav-kv changed the title Do not store/load ephemeral node hashes Do not store ephemeral node hashes Jul 19, 2021
@pav-kv pav-kv marked this pull request as ready for review July 19, 2021 15:52
Copy link
Member

@AlCutter AlCutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering about release strategy for this set of changes - perhaps this is large enough to merit a bump to 1.4, and maybe there should even be a 1.4.0 with the "don't read ephemeral nodes" change and then a 1.4.1 with this "don't write ephemeral nodes" included so that folks can be happy that they're good on 1.4.0 before trying 1.4.1, because there'll be no roll-back past the "don't read ephemeral nodes" version in case of trouble?

@pav-kv
Copy link
Contributor Author

pav-kv commented Jul 20, 2021

Yes, there can be special effects depending on how this is rolled out (specifically, a new job writes in new format, and an old job tries reading it the old way while being rolled out). I'll leave this unmerged for now and see how far I can get without it. The 2 separate releases SGTM.

@pav-kv
Copy link
Contributor Author

pav-kv commented Apr 8, 2022

I made more research on the effect of this change, and it looks like submitting this PR is backwards compatible with versions after #305 [Jan 2017, v0.6], when there was an introduction of rehashing for fetching proofs at arbitrary tree sizes. This is when ephemeral nodes stopped being read from tree storage, and writing them became unnecessary. It is safe to remove the writing now.

Upd: There is a later commit, see below.

@pav-kv
Copy link
Contributor Author

pav-kv commented Apr 11, 2022

I made more research :) It looks like the last commit that stopped reading ephemeral nodes was df47465, which was only released in v1.4.0. Still, @AlCutter's comment applies: there is a release that stops reading, the next release can stop writing.

@pav-kv
Copy link
Contributor Author

pav-kv commented Apr 11, 2022

If this update is rolled out after df47465 (or v1.4.0) has been rolled out, there won't be any issues.

If this update is rolled out with skipping df47465 (or v1.4.0), the worst case effect would be a temporary inability to serve some proofs correctly (while the rollout is ongoing). Specifically, the conditions that would trigger it:

  1. A new (rolled out) logsigner node completed a sequencing run (hence written some tiles, but haven't updated ephemeral nodes).
  2. An old (not rolled out) logserver node receives an (inclusion|consistency) proof request. It may try to read ephemeral nodes from storage if:
  3. The requested tree size equals the current tree size stored in Trillian.

The amount of time in which this can happen is limited by:

  • The rollout time.
  • The period of time while the requested tree size equals the current tree size. Trillian updates the tree frequently (multiple times per second), so it's likely that the request will return the correct result after a couple of retries.

If this effect is completely undesirable (and there is no way to rollout v1.4.0 first), there is still a way to avoid it:

  1. Roll out logserver first, so that all the reading of ephemeral nodes disappears.
  2. Roll out logsigner second, so that all the writes are compatible with all the reads.

They are not used for reads, everything is based on compact ranges which
don't use the ephemeral nodes.
@pav-kv pav-kv force-pushed the rm_ephemeral_nodes branch 2 times, most recently from 0d757ca to f3fa5a8 Compare May 16, 2022 23:23
@pav-kv pav-kv changed the title Do not store ephemeral node hashes Allow not storing ephemeral node hashes May 16, 2022
@pav-kv pav-kv requested review from mhutchinson and removed request for Martin2112 May 16, 2022 23:38
@pav-kv
Copy link
Contributor Author

pav-kv commented May 16, 2022

@AlCutter @mhutchinson I ended up with a safer approach. This PR now introduces a flag that can enable the new behaviour only for certain trees. We can roll this out for a few trees first, and, once made sure it works, fully enable it.

So, I propose to cut a release v1.4.2 with this flag now, and later cut v1.5.0 with this flag removed, and the new behaviour enabled by default. Or maybe we can skip v1.4.2, but I think it's good to have so that other operators may do the transition more safely too.

WDYT?

CHANGELOG.md Outdated Show resolved Hide resolved
@pav-kv pav-kv merged commit d62733d into google:master May 17, 2022
@pav-kv pav-kv deleted the rm_ephemeral_nodes branch May 17, 2022 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants