cache.Putter duplicates RefCnt #4525

ldeffenb · 2023-12-19T18:11:25Z

Context

1.18.2

Summary

Rapidly downloading a non-local reference can result in chunkStore RefCnt inconsistencies relative to the cache records. This happens because there is no mutex protection in the Cache.Putter method.

bee/pkg/storer/internal/cache/cache.go

Line 69 in 1a9e1be

func (c *Cache) Putter(store internal.Storage) storage.Putter {

Expected behavior

If a chunk is retrieved multiple times in quick succession, I'd still only expect it to be inserted into the cache with a corresponding chunkStore RefCnt increment ONCE.

Actual behavior

I've been analyzing the RefCnts in the ChunkStore and added extensive logging (called chunkTrace) to my bee node to track whenever the RefCnt is incremented or decremented as well as when actual chunks are initially stored to and finally removed from sharky. As an extension of this logging, I've recently started working on a RefCnt analyzer/fixer and much to my surprise, I have a bunch of chunks with incorrect RefCnts vs what is represented by the cache, reserve, and pins (ignoring upload for the moment).

I db nuked a node in testnet and started it up with even more extensive logging that reveals why the ChunkStore.Put() and .Delete() were invoked. In just a few hours of running, I have RefCnts that are higher than is justified. And in all cases, it was a Download Retrieval CachePutter that incorrectly incremented the RefCnt for a chunk while storing a single entry in the actual cache store.

I requested a download of a specific reference, maybe two multiple times in quick succession I don't remember why (I suspect it was the OSM map browser), but that's what the logs show. Since the 2 chunks making up that reference were not local, they were fetched from the swarm, concurrently, for each of my SIX download requests.

When these six retrievals (times 2 chunks) completed, the netstore.Download method invoked the cache putter to save the chunks. As you can see in the logs below, this all occurred in less than 1 millisecond for each of the two chunks.

bee/pkg/storer/netstore.go

Line 94 in 1a9e1be

err := db.Cache().Put(db.cacheLimiter.ctx, ch)

First the API resolving the reference:

bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.407428" "level"="debug" "logger"="node/api" "v"=1 "msg"="resolve name: parsing bzz address successful" "string"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011" "address"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.407428" "level"="debug" "logger"="node/api" "v"=1 "msg"="resolve name: parsing bzz address successful" "string"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011" "address"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.407428" "level"="debug" "logger"="node/api" "v"=1 "msg"="resolve name: parsing bzz address successful" "string"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011" "address"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.409145" "level"="debug" "logger"="node/api" "v"=1 "msg"="resolve name: parsing bzz address successful" "string"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011" "address"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.409163" "level"="debug" "logger"="node/api" "v"=1 "msg"="resolve name: parsing bzz address successful" "string"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011" "address"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.409147" "level"="debug" "logger"="node/api" "v"=1 "msg"="resolve name: parsing bzz address successful" "string"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011" "address"="2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011"

Then the successful retrieval of the first chunk:

bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.674035" "level"="debug" "logger"="node/retrieval" "msg"="retrieved chunk" "chunk_address"="714655aa19478dadfdb869f7f4def0c0409fecdc41107915b7a017bb7ed94cff" "peer_address"="675d2d8f31ba851a5a3eaad5cbbeeca94c02b0fe666dcb2acef95d9bf4c73e3f" "peer_proximity"=3
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.674618" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put new chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="714655aa19478dadfdb869f7f4def0c0409fecdc41107915b7a017bb7ed94cff" "loc"="14:242(1013)"

And the redundant increments on that chunk:

bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.674677" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="714655aa19478dadfdb869f7f4def0c0409fecdc41107915b7a017bb7ed94cff" "loc"="14:242(1013)" "refCnt"=2
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.674721" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="714655aa19478dadfdb869f7f4def0c0409fecdc41107915b7a017bb7ed94cff" "loc"="14:242(1013)" "refCnt"=3
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.674773" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="714655aa19478dadfdb869f7f4def0c0409fecdc41107915b7a017bb7ed94cff" "loc"="14:242(1013)" "refCnt"=4
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.674822" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="714655aa19478dadfdb869f7f4def0c0409fecdc41107915b7a017bb7ed94cff" "loc"="14:242(1013)" "refCnt"=5
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.674876" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="714655aa19478dadfdb869f7f4def0c0409fecdc41107915b7a017bb7ed94cff" "loc"="14:242(1013)" "refCnt"=6

Then the second chunk was successfully retrieved:

bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.932027" "level"="debug" "logger"="node/retrieval" "msg"="retrieved chunk" "chunk_address"="eed32a1fc98f4e6dc980a022b8c429fafcea04ffb71ff74b85b5aa9e3ce73561" "peer_address"="f320fb80e8c3872fd7605eb44dba93ae192b23ab0a78d6e8c0523565072e8e18" "peer_proximity"=3
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.932539" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put new chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="eed32a1fc98f4e6dc980a022b8c429fafcea04ffb71ff74b85b5aa9e3ce73561" "loc"="1:246(840)"

And also redundantly incremented:

bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.932628" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="eed32a1fc98f4e6dc980a022b8c429fafcea04ffb71ff74b85b5aa9e3ce73561" "loc"="1:246(840)" "refCnt"=2
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.932722" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="eed32a1fc98f4e6dc980a022b8c429fafcea04ffb71ff74b85b5aa9e3ce73561" "loc"="1:246(840)" "refCnt"=3
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.932775" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="eed32a1fc98f4e6dc980a022b8c429fafcea04ffb71ff74b85b5aa9e3ce73561" "loc"="1:246(840)" "refCnt"=4
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.932842" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="eed32a1fc98f4e6dc980a022b8c429fafcea04ffb71ff74b85b5aa9e3ce73561" "loc"="1:246(840)" "refCnt"=5
bee-1.18.2-B.out:"time"="2023-12-18 23:09:54.932921" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk NOstamp!" "why"="CachePutter(Cache(Download))" "address"="eed32a1fc98f4e6dc980a022b8c429fafcea04ffb71ff74b85b5aa9e3ce73561" "loc"="1:246(840)" "refCnt"=6

In actual fact, it wasn't redundant fetches of a single reference from the API, but the internal manifest resolution fetching one of the nodes along the /8/12... path as seen in these api completion logs:

bee-1.18.2-B.out:"time"="2023-12-18 23:09:56.610923" "level"="info" "logger"="node/api" "msg"="api access" "ip"="192.168.10.177" "method"="GET" "host"="192.168.10.36:1633" "uri"="/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/8/128/85.png" "proto"="HTTP/1.1" "status"=200 "size"=38560 "duration"="2.201763364s" "referrer"="http://192.168.10.36:1633/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/" "user-agent"="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:56.639080" "level"="info" "logger"="node/api" "msg"="api access" "ip"="192.168.10.177" "method"="GET" "host"="192.168.10.36:1633" "uri"="/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/8/127/83.png" "proto"="HTTP/1.1" "status"=200 "size"=47064 "duration"="2.229934697s" "referrer"="http://192.168.10.36:1633/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/" "user-agent"="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:56.754622" "level"="info" "logger"="node/api" "msg"="api access" "ip"="192.168.10.177" "method"="GET" "host"="192.168.10.36:1633" "uri"="/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/8/128/83.png" "proto"="HTTP/1.1" "status"=200 "size"=28438 "duration"="2.345477204s" "referrer"="http://192.168.10.36:1633/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/" "user-agent"="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:56.755814" "level"="info" "logger"="node/api" "msg"="api access" "ip"="192.168.10.177" "method"="GET" "host"="192.168.10.36:1633" "uri"="/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/8/127/84.png" "proto"="HTTP/1.1" "status"=200 "size"=48705 "duration"="2.348396845s" "referrer"="http://192.168.10.36:1633/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/" "user-agent"="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:56.756660" "level"="info" "logger"="node/api" "msg"="api access" "ip"="192.168.10.177" "method"="GET" "host"="192.168.10.36:1633" "uri"="/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/8/128/84.png" "proto"="HTTP/1.1" "status"=200 "size"=42980 "duration"="2.349252757s" "referrer"="http://192.168.10.36:1633/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/" "user-agent"="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
bee-1.18.2-B.out:"time"="2023-12-18 23:09:57.041609" "level"="info" "logger"="node/api" "msg"="api access" "ip"="192.168.10.177" "method"="GET" "host"="192.168.10.36:1633" "uri"="/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/8/127/85.png" "proto"="HTTP/1.1" "status"=200 "size"=50954 "duration"="2.634205123s" "referrer"="http://192.168.10.36:1633/bzz/2f0d6bd9afb1d95f37b241f2b79f258c0af577288cb320a226ab46fecfd66011/" "user-agent"="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"

Steps to reproduce

Find a reference not already local in your node and fire quick concurrent /bytes (or it may have been /bzz) retrievals of that reference. If the timing is correct and quick enough, you'll end up with a chunk that will never leave your sharky even after it has long since been flushed from the cache.

But there are NO tools nor logging in the stock bee node that will ever let you see this effect other than your sharky will slowly grow larger and larger over time with no apparently explanation.

Possible solution

Protect the Cache.Putter from concurrent execution between the Has and actually storing the chunk in the cache. This will ensure that concurrent invocations will not all blow through the Has not being present and them all redundantly putting a single cacheEntry record (keyed) but invoking ChunkStore().Put multiple times causing the RefCnt to increment multiple times with but a single decrement when the cacheEntry is finally flushed.

bee/pkg/storer/internal/cache/cache.go

Line 69 in 1a9e1be

func (c *Cache) Putter(store internal.Storage) storage.Putter {

The text was updated successfully, but these errors were encountered:

ldeffenb · 2023-12-19T18:34:23Z

I just also realized that the Retrieval.RetrieveChunk singleflight keyed on the chunk reference (with an optional _origin suffix as would be in this case) will practically guarantee that multiple concurrent downloads of a single reference chunk will all complete within microseconds of each other, which is probably why I'm seeing so many of these in such a short time after browsing the OSM map manifest through this newly nuked node.

bee/pkg/retrieval/retrieval.go

Line 155 in 1a9e1be

    
           v, _, err := s.singleflight.Do(ctx, flightRoute, func(ctx context.Context) (swarm.Chunk, error) {

I've got 85 chunks with incorrect RefCnts out of 1,621,219 chunks in this freshly nuked database. Most of those chunks are in reserve (1,937,226, yes more than my total sharky chunks because a single chunk with multiple stamps counts multiple times in the reserve) vs the cache (33,387).

ldeffenb added the needs-triaging new issues that need triaging label Dec 19, 2023

bee-runner bot added the issue label Dec 19, 2023

istae self-assigned this Jan 17, 2024

istae mentioned this issue Jan 17, 2024

fix(cache): added mutex protection at chunk and eviction levels #4546

Merged

4 tasks

istae closed this as completed in #4546 Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache.Putter duplicates RefCnt #4525

cache.Putter duplicates RefCnt #4525

ldeffenb commented Dec 19, 2023

ldeffenb commented Dec 19, 2023 •

edited

Loading

cache.Putter duplicates RefCnt #4525

cache.Putter duplicates RefCnt #4525

Comments

ldeffenb commented Dec 19, 2023

Context

Summary

Expected behavior

Actual behavior

Steps to reproduce

Possible solution

ldeffenb commented Dec 19, 2023 • edited Loading

ldeffenb commented Dec 19, 2023 •

edited

Loading