You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rapidly downloading a non-local reference can result in chunkStore RefCnt inconsistencies relative to the cache records. This happens because there is no mutex protection in the Cache.Putter method.
If a chunk is retrieved multiple times in quick succession, I'd still only expect it to be inserted into the cache with a corresponding chunkStore RefCnt increment ONCE.
Actual behavior
I've been analyzing the RefCnts in the ChunkStore and added extensive logging (called chunkTrace) to my bee node to track whenever the RefCnt is incremented or decremented as well as when actual chunks are initially stored to and finally removed from sharky. As an extension of this logging, I've recently started working on a RefCnt analyzer/fixer and much to my surprise, I have a bunch of chunks with incorrect RefCnts vs what is represented by the cache, reserve, and pins (ignoring upload for the moment).
I db nuked a node in testnet and started it up with even more extensive logging that reveals why the ChunkStore.Put() and .Delete() were invoked. In just a few hours of running, I have RefCnts that are higher than is justified. And in all cases, it was a Download Retrieval CachePutter that incorrectly incremented the RefCnt for a chunk while storing a single entry in the actual cache store.
I requested a download of a specific reference, maybe two multiple times in quick succession I don't remember why (I suspect it was the OSM map browser), but that's what the logs show. Since the 2 chunks making up that reference were not local, they were fetched from the swarm, concurrently, for each of my SIX download requests.
When these six retrievals (times 2 chunks) completed, the netstore.Download method invoked the cache putter to save the chunks. As you can see in the logs below, this all occurred in less than 1 millisecond for each of the two chunks.
In actual fact, it wasn't redundant fetches of a single reference from the API, but the internal manifest resolution fetching one of the nodes along the /8/12... path as seen in these api completion logs:
Find a reference not already local in your node and fire quick concurrent /bytes (or it may have been /bzz) retrievals of that reference. If the timing is correct and quick enough, you'll end up with a chunk that will never leave your sharky even after it has long since been flushed from the cache.
But there are NO tools nor logging in the stock bee node that will ever let you see this effect other than your sharky will slowly grow larger and larger over time with no apparently explanation.
Possible solution
Protect the Cache.Putter from concurrent execution between the Has and actually storing the chunk in the cache. This will ensure that concurrent invocations will not all blow through the Has not being present and them all redundantly putting a single cacheEntry record (keyed) but invoking ChunkStore().Put multiple times causing the RefCnt to increment multiple times with but a single decrement when the cacheEntry is finally flushed.
I just also realized that the Retrieval.RetrieveChunk singleflight keyed on the chunk reference (with an optional _origin suffix as would be in this case) will practically guarantee that multiple concurrent downloads of a single reference chunk will all complete within microseconds of each other, which is probably why I'm seeing so many of these in such a short time after browsing the OSM map manifest through this newly nuked node.
I've got 85 chunks with incorrect RefCnts out of 1,621,219 chunks in this freshly nuked database. Most of those chunks are in reserve (1,937,226, yes more than my total sharky chunks because a single chunk with multiple stamps counts multiple times in the reserve) vs the cache (33,387).
Context
1.18.2
Summary
Rapidly downloading a non-local reference can result in chunkStore RefCnt inconsistencies relative to the cache records. This happens because there is no mutex protection in the Cache.Putter method.
bee/pkg/storer/internal/cache/cache.go
Line 69 in 1a9e1be
Expected behavior
If a chunk is retrieved multiple times in quick succession, I'd still only expect it to be inserted into the cache with a corresponding chunkStore RefCnt increment ONCE.
Actual behavior
I've been analyzing the RefCnts in the ChunkStore and added extensive logging (called chunkTrace) to my bee node to track whenever the RefCnt is incremented or decremented as well as when actual chunks are initially stored to and finally removed from sharky. As an extension of this logging, I've recently started working on a RefCnt analyzer/fixer and much to my surprise, I have a bunch of chunks with incorrect RefCnts vs what is represented by the cache, reserve, and pins (ignoring upload for the moment).
I
db nuke
d a node in testnet and started it up with even more extensive logging that reveals why the ChunkStore.Put() and .Delete() were invoked. In just a few hours of running, I have RefCnts that are higher than is justified. And in all cases, it was a Download Retrieval CachePutter that incorrectly incremented the RefCnt for a chunk while storing a single entry in the actual cache store.I requested a download of a specific reference, maybe two multiple times in quick succession I don't remember why (I suspect it was the OSM map browser), but that's what the logs show. Since the 2 chunks making up that reference were not local, they were fetched from the swarm, concurrently, for each of my SIX download requests.
When these six retrievals (times 2 chunks) completed, the netstore.Download method invoked the cache putter to save the chunks. As you can see in the logs below, this all occurred in less than 1 millisecond for each of the two chunks.
bee/pkg/storer/netstore.go
Line 94 in 1a9e1be
In actual fact, it wasn't redundant fetches of a single reference from the API, but the internal manifest resolution fetching one of the nodes along the /8/12... path as seen in these api completion logs:
Steps to reproduce
Find a reference not already local in your node and fire quick concurrent /bytes (or it may have been /bzz) retrievals of that reference. If the timing is correct and quick enough, you'll end up with a chunk that will never leave your sharky even after it has long since been flushed from the cache.
But there are NO tools nor logging in the stock bee node that will ever let you see this effect other than your sharky will slowly grow larger and larger over time with no apparently explanation.
Possible solution
Protect the Cache.Putter from concurrent execution between the Has and actually storing the chunk in the cache. This will ensure that concurrent invocations will not all blow through the Has not being present and them all redundantly putting a single cacheEntry record (keyed) but invoking ChunkStore().Put multiple times causing the RefCnt to increment multiple times with but a single decrement when the cacheEntry is finally flushed.
bee/pkg/storer/internal/cache/cache.go
Line 69 in 1a9e1be
The text was updated successfully, but these errors were encountered: