Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atree Register Inlining and Data Deduplication #342

Merged
merged 56 commits into from
Oct 4, 2023

Conversation

fxamacker
Copy link
Member

@fxamacker fxamacker commented Sep 14, 2023

Updates epics #292 onflow/flow-go#1744
Updates #296 #297 #347 #348 onflow/cadence#1854
Required by Cadence Integration with Atree onflow/cadence#2809
Required by Atree Storage Migration:

NOTE: some of the above updated issues will be closed by this PR:

  • after integration with Cadence (in onflow/cadence repo) and
  • after long-running smoke tests run longer (some have been running locally since Oct 5 and Oct 6).

Problem

Memory use on execution nodes, register count, and growth rate can be improved.

Currently, every array or map are stored in its own slab and parent slab refers to child array or map by SlabID.

The current approach can lead to many small slabs, especially for Cadence data structures with multiple nested levels.

Another aspect is duplicate data, especially when encoding nested Cadence composites. The design of atree inlining already requries changing the encoding format, so use this opportunity to also deduplicate data when encoding nested composites.

Changes

Reduce register count by inlining and optimize encoding size by deduplicating Cadence composite types within slabs.

  1. Inlining. This commit inlines child array/map in parent slab when:
  • child array/map fits in one slab (root slab is data slab)
  • encoded size of inlined child array/map is less than the max inline size limit enforced by parent
  1. Data deduplication. This commit optimizes encoding size by:
  • reusing inlined array types
  • reusing seed, digests, and field names of inlined composite types

Also update debugging code to handle inlined array/map element.

TODO

In this PR

  • add more tests
  • cleanup some of the tests

Atree Inlining Part 2

Open new PR to add support for inlinability of child values returned by iterator (requires PR #345)

Cadence Integration

Open new PR in onflow/cadence to integrate with Cadence.


  • Targeted PR against main branch
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work
  • Code follows the standards mentioned here
  • Updated relevant documentation
  • Re-reviewed Files changed in the Github PR explorer
  • Added appropriate labels

Currently, every array or map are stored in its own slab and
parent slab refers to child array or map by SlabID.

The current approach can lead to many small slabs, especially for
Cadence data structures with multiple nested levels.

This commit inlines child array/map in parent slab when:
- child array/map fits in one slab (root slab is data slab)
- encoded size of inlined child array/map is less than the
  max inline size limit enforced by parent

This commit optimizes encoding size by:
- reusing inlined array types
- reusing seed, digests, and field names of inlined composite types

Also update debugging code to handle inlined array/map element.
@fxamacker fxamacker added enhancement New feature or request performance improvement breaking change changes to API that can break programs using Atree's API storage breaking change breaks existing stored data (requires storage migration) labels Sep 14, 2023
@fxamacker fxamacker self-assigned this Sep 14, 2023
@fxamacker fxamacker marked this pull request as draft September 14, 2023 17:35
@fxamacker fxamacker changed the title Atree Inlining (part 1 of 2): inline child arra and /map data slab into parent slab Atree Inlining (part 1 of 2): inline child array and /map data slab into parent slab Sep 14, 2023
@fxamacker fxamacker changed the title Atree Inlining (part 1 of 2): inline child array and /map data slab into parent slab Atree Inlining (part 1 of 2): inline child array and map data slab into parent slab Sep 14, 2023
@fxamacker fxamacker changed the title Atree Inlining (part 1 of 2): inline child array and map data slab into parent slab Inline child array and map data slab into parent slab Sep 14, 2023
@fxamacker fxamacker marked this pull request as ready for review September 15, 2023 00:19
Currently, in parentUpdater callback, parent array/map resets same
child value.

Child value ID should match overwritten SlabIDStorable
or Slab.SlabID().

This commit adds check to make sure same child value is being reset.
@codecov-commenter
Copy link

codecov-commenter commented Sep 17, 2023

Codecov Report

Attention: 1110 lines in your changes are missing coverage. Please review.

Comparison is base (1e6ec55) 64.93% compared to head (5f1de0a) 62.62%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #342      +/-   ##
==========================================
- Coverage   64.93%   62.62%   -2.31%     
==========================================
  Files          14       15       +1     
  Lines        8811    10559    +1748     
==========================================
+ Hits         5721     6613     +892     
- Misses       2356     3004     +648     
- Partials      734      942     +208     
Files Coverage Δ
encode.go 80.72% <100.00%> (+4.25%) ⬆️
storable.go 59.37% <ø> (ø)
basicarray.go 49.38% <70.00%> (+0.01%) ⬆️
storage.go 73.09% <52.00%> (-0.60%) ⬇️
typeinfo.go 60.32% <60.32%> (ø)
map_debug.go 43.65% <49.34%> (-8.51%) ⬇️
array_debug.go 47.44% <49.04%> (-4.67%) ⬇️
array.go 68.10% <62.09%> (-1.97%) ⬇️
map.go 65.30% <58.52%> (-1.78%) ⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@turbolent
Copy link
Member

Fantastic work Faye! 👏

Some notes from the first review session:

  • It would be good to document that the "parent pointers" are not stored physically, but are actually functions, in memory (and when they are set). Maybe add some high-level comment to the code and indicate it in the diagrams; they usually show "stored" physical pointers ("storage IDs" (now "slab ID" / "value ID", right?))
  • Maybe document (if not already) that the inlining approach is to not inline all slabs that are "inlinable" in the sense of fitting inside of the parent, but that inlinability is determined "as a whole" for a whole value. The example was that, a small array is inlinable if it is just a single data slab, but a larger array consisting of a meta slab and some data slabs is not inlinable, even when the meta slab could fit into the parent
  • Maybe document (if not already), that "slab ID" == "value ID", i.e. they share a namespace, and an ID might be used for a slab, but it might also not. Currently, a slab ID could be checked to exist
  • As discussed, we might also want to update the child' parent callback when the child is inserted into the container (array and map), e.g. in the Set and Insert operations.

I'll do a proper code review after the next review session. So far the work looks great, well done! 👏

Copy link
Contributor

@ramtinms ramtinms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 the logic and key parts of the changes look good to me, I did not check all the lines as we have a very good coverage by the tests.
Besides the suggestions I put in the code, I suggest also adding this logic to the stress test or tests to check for register allocation leaks (register falls behind), especially when we continuously move from large mutable to inlineable ones.

p.s. (cc: @janezpodhostnik ) I think we need higher-level testing for applications like the FVM environment but that's out of the scope of this repo. we might re-use the NFT tracker reporter to check the health before and after migration. and simulate some transactions to check the changes (especially re-sizing logic).

type EquatableStorable interface {
Storable
// Equal returns true if the given storable is equal to this storable.
Equal(Storable) bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

storage.go Outdated
Comment on lines 39 to 40
copy(id[:], sid.address[:])
copy(id[8:], sid.index[:])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we just add a warning that if we change the size of the address or index, we could have a problem here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better yet, consider using unsafe.Sizeof for the definition of ValueID above, in the range expression here. See for example: https://goplay.tools/snippet/EtdTvPNnIvF

typeinfo.go Outdated
"github.com/fxamacker/cbor/v2"
)

type TypeInfo interface {
Encode(*cbor.StreamEncoder) error
IsComposite() bool
ID() string
// TODO: maybe add a copy function because decoded TypeInfo can be shared by multiple slabs if not copied.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still a relevant TODO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still a relevant TODO?

Yes, good catch. The TODO was a precaution that was postponed to get PR reviews started on the other parts.

It's done in 72d3614. TypeInfo is copied in the commit as a precaution because TypeInfo can be shared among inlined slabs referring to the same type and any changes to one may affect all others if not copied.

map.go Outdated
firstKey: elements.firstKey(),
}

// TODO: does extra data needs to be copied?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still a valid TODO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still a valid TODO?

Yes 👍 See comment above and this is also done in the same commit 72d3614.

array.go Outdated Show resolved Hide resolved
array.go Show resolved Hide resolved
array.go Show resolved Hide resolved
array.go Outdated Show resolved Hide resolved
typeinfo.go Outdated Show resolved Hide resolved
This commit adds callback notification in the set or inserted child
element in array.  So if child element is modified after Array.Set()
or Array.Insert(), changes to child element is properly reflected
in the parent array.

This commit doesn't appear to be needed by current version of Cadence
but it helps reduce Atree's dependence on an implementation detail
of Cadence.
This commit adds callback notification in the set child element in map.
So if child element is modified after OrderedMap.Set() , changes to
child element is properly reflected in the parent map.

This commit doesn't appear to be needed by current version of Cadence
but it helps reduce Atree's dependence on an implementation detail
of Cadence.
Currently, deduplication feature in PR 342 (not merged yet)
does not support Cadence attachments.

Add support for Cadence attachments and other future
use cases by using type ID and sorted field names
instead of field count.

While at it, also refactor encoding composite type to
reduce traversal and type assertion.
Decoded ExtraData (including TypeInfo) can be shared by all
inlined slabs.  This commit creates a new ExtraData with copied
TypeInfo for inlined slabs to prevent accidental mutation.
Decoded ExtraData (including keys) can be shared by all
inlined composite referring to the same type.

This commit copies key for inlined composite to prevent
accidental mutation.
@turbolent
Copy link
Member

I had a look at adding the container/child address matching checks to Cadence: Just before performing a mutating operation on a container (array/map), we already transfer the child, so checking the invariant does not really help much: if we insert the assertion we can also check the transfer.

On the atree side, it's not easy to check this invariant: children are values at first, and getting the storable (which is needed to get the potential storage ID) is delayed until the very "end", e.g. deep down when the container/parent is not available anymore.

Maybe there isn't much we can do here

@fxamacker
Copy link
Member Author

@turbolent Thanks for detailed review and meeting notes!

I think everything discussed so far have been addressed in this PR. 🎉

  • More comments were added about how atree inlining works in commit a75e388

  • Commit 63ea7a7 handles values with outdated parent callbacks (value was removed or overwritten in parent container):

    • the callback is no-op when callback detects value is no longer a child in its old parent.
    • value unsets outdated parent callback
  • Commit 8d02bbc renamed existingValue.

  • Commit 042eb68 unlined removed or overwritten child (if it is inlined)

  • I double-checked that atree doesn't make assumption about value returned from StoredValue().

  • About "invariant that all values stored in an account always reference values in the same account"

    • it isn't easy to check in code as you mentioned. However, we check this in tests, for example, atree.CheckStorageHealth() checks that all child or referenced slabs have the same address and atree validation functions also checks inlined slabs in memory have the same address as its parent.
  • I opened these issues to add more integration tests:

BTW, can you please take a look at PR #345? That is the 2nd of 2 atree inlining PRs.

Copy link
Member

@turbolent turbolent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work Faye! 👏

My review focused mainly on the changes to array and ordered map, I didn't review the tests, but they seem very thorough (18 kLOC wow!) 👏

As mentioned above, we can address remaining work in follow-up PRs

@fxamacker fxamacker changed the base branch from main to feature/array-map-inlining October 4, 2023 16:09
@fxamacker fxamacker merged commit 0fc8f74 into feature/array-map-inlining Oct 4, 2023
11 checks passed
@fxamacker fxamacker changed the title Inline child array and map data slab into parent slab Atree Register Inlining and Data Deduplication Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change changes to API that can break programs using Atree's API enhancement New feature or request improvement performance storage breaking change breaks existing stored data (requires storage migration)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants