Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache.Putter duplicates RefCnt #4525

Closed
ldeffenb opened this issue Dec 19, 2023 · 1 comment · Fixed by #4546
Closed

cache.Putter duplicates RefCnt #4525

ldeffenb opened this issue Dec 19, 2023 · 1 comment · Fixed by #4546
Assignees
Labels
issue needs-triaging new issues that need triaging

Comments

@ldeffenb
Copy link
Collaborator

Context

1.18.2

Summary

Rapidly downloading a non-local reference can result in chunkStore RefCnt inconsistencies relative to the cache records. This happens because there is no mutex protection in the Cache.Putter method.

func (c *Cache) Putter(store internal.Storage) storage.Putter {

Expected behavior

If a chunk is retrieved multiple times in quick succession, I'd still only expect it to be inserted into the cache with a corresponding chunkStore RefCnt increment ONCE.

Actual behavior

I've been analyzing the RefCnts in the ChunkStore and added extensive logging (called chunkTrace) to my bee node to track whenever the RefCnt is incremented or decremented as well as when actual chunks are initially stored to and finally removed from sharky. As an extension of this logging, I've recently started working on a RefCnt analyzer/fixer and much to my surprise, I have a bunch of chunks with incorrect RefCnts vs what is represented by the cache, reserve, and pins (ignoring upload for the moment).

I db nuked a node in testnet and started it up with even more extensive logging that reveals why the ChunkStore.Put() and .Delete() were invoked. In just a few hours of running, I have RefCnts that are higher than is justified. And in all cases, it was a Download Retrieval CachePutter that incorrectly incremented the RefCnt for a chunk while storing a single entry in the actual cache store.

I requested a download of a specific reference, maybe two multiple times in quick succession I don't remember why (I suspect it was the OSM map browser), but that's what the logs show. Since the 2 chunks making up that reference were not local, they were fetched from the swarm, concurrently, for each of my SIX download requests.

When these six retrievals (times 2 chunks) completed, the netstore.Download method invoked the cache putter to save the chunks. As you can see in the logs below, this all occurred in less than 1 millisecond for each of the two chunks.

err := db.Cache().Put(db.cacheLimiter.ctx, ch)

First the API resolving the reference:

bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.407428" "level"="debug" "logger"="node/api" "v"=1 "msg"="resolve name: parsing bzz address successful" "string"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011" "address"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.407428" "level"="debug" "logger"="node/api" "v"=1 "msg"="resolve name: parsing bzz address successful" "string"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011" "address"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.407428" "level"="debug" "logger"="node/api" "v"=1 "msg"="resolve name: parsing bzz address successful" "string"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011" "address"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.409145" "level"="debug" "logger"="node/api" "v"=1 "msg"="resolve name: parsing bzz address successful" "string"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011" "address"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.409163" "level"="debug" "logger"="node/api" "v"=1 "msg"="resolve name: parsing bzz address successful" "string"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011" "address"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.409147" "level"="debug" "logger"="node/api" "v"=1 "msg"="resolve name: parsing bzz address successful" "string"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011" "address"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011"

Then the successful retrieval of the first chunk:

bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.674035" "level"="debug" "logger"="node/retrieval" "msg"="retrieved chunk" "chunk_address"="714655aa19478dadfdb869f7f4def0c0409fecdc41107915b7a017bb7ed94cff" "peer_address"="675d2d8f31ba851a5a3eaad5cbbeeca94c02b0fe666dcb2acef95d9bf4c73e3f" "peer_proximity"=3
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.674618" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put new chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="714655aa19478dadfdb869f7f4def0c0409fecdc41107915b7a017bb7ed94cff" "loc"="14:242(1013)"

And the redundant increments on that chunk:

bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.674677" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="714655aa19478dadfdb869f7f4def0c0409fecdc41107915b7a017bb7ed94cff" "loc"="14:242(1013)" "refCnt"=2
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.674721" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="714655aa19478dadfdb869f7f4def0c0409fecdc41107915b7a017bb7ed94cff" "loc"="14:242(1013)" "refCnt"=3
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.674773" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="714655aa19478dadfdb869f7f4def0c0409fecdc41107915b7a017bb7ed94cff" "loc"="14:242(1013)" "refCnt"=4
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.674822" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="714655aa19478dadfdb869f7f4def0c0409fecdc41107915b7a017bb7ed94cff" "loc"="14:242(1013)" "refCnt"=5
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.674876" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="714655aa19478dadfdb869f7f4def0c0409fecdc41107915b7a017bb7ed94cff" "loc"="14:242(1013)" "refCnt"=6

Then the second chunk was successfully retrieved:

bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.932027" "level"="debug" "logger"="node/retrieval" "msg"="retrieved chunk" "chunk_address"="eed32a1fc98f4e6dc980a022b8c429fafcea04ffb71ff74b85b5aa9e3ce73561" "peer_address"="f320fb80e8c3872fd7605eb44dba93ae192b23ab0a78d6e8c0523565072e8e18" "peer_proximity"=3
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.932539" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put new chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="eed32a1fc98f4e6dc980a022b8c429fafcea04ffb71ff74b85b5aa9e3ce73561" "loc"="1:246(840)"

And also redundantly incremented:

bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.932628" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="eed32a1fc98f4e6dc980a022b8c429fafcea04ffb71ff74b85b5aa9e3ce73561" "loc"="1:246(840)" "refCnt"=2
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.932722" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="eed32a1fc98f4e6dc980a022b8c429fafcea04ffb71ff74b85b5aa9e3ce73561" "loc"="1:246(840)" "refCnt"=3
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.932775" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="eed32a1fc98f4e6dc980a022b8c429fafcea04ffb71ff74b85b5aa9e3ce73561" "loc"="1:246(840)" "refCnt"=4
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.932842" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="eed32a1fc98f4e6dc980a022b8c429fafcea04ffb71ff74b85b5aa9e3ce73561" "loc"="1:246(840)" "refCnt"=5
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.932921" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="eed32a1fc98f4e6dc980a022b8c429fafcea04ffb71ff74b85b5aa9e3ce73561" "loc"="1:246(840)" "refCnt"=6

In actual fact, it wasn't redundant fetches of a single reference from the API, but the internal manifest resolution fetching one of the nodes along the /8/12... path as seen in these api completion logs:

bee-1.18.2-B.out:"time"="2023-12-18 23:09:56.610923" "level"="info" "logger"="node/api" "msg"="api access" "ip"="192.168.10.177" "method"="GET" "host"="192.168.10.36:1633" "uri"="/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/8/128/85.png" "proto"="HTTP/1.1" "status"=200 "size"=38560 "duration"="2.201763364s" "referrer"="http://192.168.10.36:1633/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/" "user-agent"="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:56.639080" "level"="info" "logger"="node/api" "msg"="api access" "ip"="192.168.10.177" "method"="GET" "host"="192.168.10.36:1633" "uri"="/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/8/127/83.png" "proto"="HTTP/1.1" "status"=200 "size"=47064 "duration"="2.229934697s" "referrer"="http://192.168.10.36:1633/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/" "user-agent"="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:56.754622" "level"="info" "logger"="node/api" "msg"="api access" "ip"="192.168.10.177" "method"="GET" "host"="192.168.10.36:1633" "uri"="/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/8/128/83.png" "proto"="HTTP/1.1" "status"=200 "size"=28438 "duration"="2.345477204s" "referrer"="http://192.168.10.36:1633/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/" "user-agent"="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:56.755814" "level"="info" "logger"="node/api" "msg"="api access" "ip"="192.168.10.177" "method"="GET" "host"="192.168.10.36:1633" "uri"="/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/8/127/84.png" "proto"="HTTP/1.1" "status"=200 "size"=48705 "duration"="2.348396845s" "referrer"="http://192.168.10.36:1633/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/" "user-agent"="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:56.756660" "level"="info" "logger"="node/api" "msg"="api access" "ip"="192.168.10.177" "method"="GET" "host"="192.168.10.36:1633" "uri"="/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/8/128/84.png" "proto"="HTTP/1.1" "status"=200 "size"=42980 "duration"="2.349252757s" "referrer"="http://192.168.10.36:1633/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/" "user-agent"="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:57.041609" "level"="info" "logger"="node/api" "msg"="api access" "ip"="192.168.10.177" "method"="GET" "host"="192.168.10.36:1633" "uri"="/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/8/127/85.png" "proto"="HTTP/1.1" "status"=200 "size"=50954 "duration"="2.634205123s" "referrer"="http://192.168.10.36:1633/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/" "user-agent"="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"

Steps to reproduce

Find a reference not already local in your node and fire quick concurrent /bytes (or it may have been /bzz) retrievals of that reference. If the timing is correct and quick enough, you'll end up with a chunk that will never leave your sharky even after it has long since been flushed from the cache.

But there are NO tools nor logging in the stock bee node that will ever let you see this effect other than your sharky will slowly grow larger and larger over time with no apparently explanation.

Possible solution

Protect the Cache.Putter from concurrent execution between the Has and actually storing the chunk in the cache. This will ensure that concurrent invocations will not all blow through the Has not being present and them all redundantly putting a single cacheEntry record (keyed) but invoking ChunkStore().Put multiple times causing the RefCnt to increment multiple times with but a single decrement when the cacheEntry is finally flushed.

func (c *Cache) Putter(store internal.Storage) storage.Putter {

@ldeffenb ldeffenb added the needs-triaging new issues that need triaging label Dec 19, 2023
@bee-runner bee-runner bot added the issue label Dec 19, 2023
@ldeffenb
Copy link
Collaborator Author

ldeffenb commented Dec 19, 2023

I just also realized that the Retrieval.RetrieveChunk singleflight keyed on the chunk reference (with an optional _origin suffix as would be in this case) will practically guarantee that multiple concurrent downloads of a single reference chunk will all complete within microseconds of each other, which is probably why I'm seeing so many of these in such a short time after browsing the OSM map manifest through this newly nuked node.

v, _, err := s.singleflight.Do(ctx, flightRoute, func(ctx context.Context) (swarm.Chunk, error) {

I've got 85 chunks with incorrect RefCnts out of 1,621,219 chunks in this freshly nuked database. Most of those chunks are in reserve (1,937,226, yes more than my total sharky chunks because a single chunk with multiple stamps counts multiple times in the reserve) vs the cache (33,387).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
issue needs-triaging new issues that need triaging
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants