You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
// ShallowCopy creates cache entries with the expectation that the chunk already exists in the chunkstore.
The chunk is flushed out of the cache due to age decrementing the refCnt to 1.
There are no longer any actual references to the chunk, but it is now stuck in the ChunkStore due to the orphaned refCnt of 1.
Here are the logs of two chunks doing this sequence in one of my nodes. The node did a rather large eviction pass at 03:05 which moved a bunch of chunks from reserve to the cache, causing the RemoveOldest. Unfortunately, I didn't have logs on the actual chunks being evicted (but rest assured, I'm adding them now!).
First retrieval put the chunk into the cache:
bee-1.18.2-G.out:"time"="2023-12-20 04:56:48.881592" "level"="debug" "logger"="node/retrieval" "msg"="retrieved chunk" "chunk_address"="41cd72a73bd69adb33e02d82060381e324da8c2b917f068cfa121bf43cdf973b" "peer_address"="556d9ddb7434a00dae1a4b13a71b51ffc93a4d6ec24164df3673cc496b0b32a8" "peer_proximity"=3
bee-1.18.2-G.out:"time"="2023-12-20 04:56:48.881901" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put new chunk NOstamp!" "why"="CachePutter(RetrievalCache)" "address"="41cd72a73bd69adb33e02d82060381e324da8c2b917f068cfa121bf43cdf973b" "loc"="6:57511(4104)"
Dome time later, PullSync discovered that this chunk should be in the reserve:
bee-1.18.2-G.out:"time"="2023-12-20 05:51:18.561658" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk" "why"="Reserve(ReservePutter(Pullsync.Sync))" "address"="41cd72a73bd69adb33e02d82060381e324da8c2b917f068cfa121bf43cdf973b" "loc"="6:57511(4104)" "refCnt"=2 "batch_id"="19806cdb6f4c8582adfa496bee64dc3482903434db52e2b9f9e0ccbd25f587d7" "index"="000041cd00000000"
The chunk was evicted from the reserve (no detailed logs), and the influx of evicted chunks to the cache removed the original cache reference.
bee-1.18.2-I.out:"time"="2023-12-21 03:05:21.744250" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Delete decrement chunk" "why"="Cache.RemoveOldest" "address"="41cd72a73bd69adb33e02d82060381e324da8c2b917f068cfa121bf43cdf973b" "loc"="6:57511(4104)" "refCnt"=1
This second chunk went through exactly the same sequence:
bee-1.18.2-G.out:"time"="2023-12-20 04:54:44.307266" "level"="debug" "logger"="node/retrieval" "msg"="retrieved chunk" "chunk_address"="4799a9ed309683bca768e707d58609a8b297a21674c5ec052a3b7f8e600bf873" "peer_address"="5461338e8c1939477eb72a6835a41128baa8750358342c5a8ca2c13940fd1b57" "peer_proximity"=3
bee-1.18.2-G.out:"time"="2023-12-20 04:54:44.307631" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put new chunk NOstamp!" "why"="CachePutter(RetrievalCache)" "address"="4799a9ed309683bca768e707d58609a8b297a21674c5ec052a3b7f8e600bf873" "loc"="8:57267(4104)"
bee-1.18.2-G.out:"time"="2023-12-20 05:51:11.651364" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk" "why"="Reserve(ReservePutter(Pullsync.Sync))" "address"="4799a9ed309683bca768e707d58609a8b297a21674c5ec052a3b7f8e600bf873" "loc"="8:57267(4104)" "refCnt"=2 "batch_id"="19806cdb6f4c8582adfa496bee64dc3482903434db52e2b9f9e0ccbd25f587d7" "index"="0000479900000000"
bee-1.18.2-I.out:"time"="2023-12-21 03:05:19.879231" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Delete decrement chunk" "why"="Cache.RemoveOldest" "address"="4799a9ed309683bca768e707d58609a8b297a21674c5ec052a3b7f8e600bf873" "loc"="8:57267(4104)" "refCnt"=1
Here's the grafana metrics for the eviction:
And the non-detailed log of the batch being evicted:
Retrieve a chunk to put it in the cache. Wait and hope for pullsync to discover the same chunk and put it in the reserve. Wait longer for the chunk to flush from the cache. At this point, the chunk will still be in the ChunkStore with a refCnt of 1 even though there are no longer any actual needs to keep the chunk.
Possible solution
The cache.ShallowCopy method needs to detect that the chunk is already in the cache and cause the refCnt to be decremented to account for the no-longer-reference from the source of the newly caching chunks. Either here:
But I also think there may be some mutex locking that may need to happen here as well, just in case a single chunk is being concurrently evicted from multiple stamp batches from the reserve. Both ShallowCopy's may see the chunk already in the cache and they'll both decrement the refCnt, but then again, that is actually the behavior we should see in this case. Never mind... But I still feel like there's a possible race condition if multiple ShallowCopy's are executing concurrently.
The text was updated successfully, but these errors were encountered:
these headaches come from the fact that expired chunks are cached, and because moving around chunks between these stores is expensive, we introduced an optimization called "ShallowCopy". imo, expired chunks should not be cached. If a user wants this behavior, they can retrieve the chunks just before the batch expiration so that the network temporarily caches them.
Context
1.18.2
Summary
If a chunk is already in the cache when it is evicted from the reserve, it ends up with an orphan refCnt when it is eventually removed from the cache.
Expected behavior
When all references to a chunk are gone, the refCnt should go to zero and the chunk should be removed.
Actual behavior
Consider the following sequence:
bee/pkg/storer/internal/reserve/reserve.go
Line 366 in ad2602c
bee/pkg/storer/internal/cache/cache.go
Line 176 in ad2602c
Here are the logs of two chunks doing this sequence in one of my nodes. The node did a rather large eviction pass at 03:05 which moved a bunch of chunks from reserve to the cache, causing the RemoveOldest. Unfortunately, I didn't have logs on the actual chunks being evicted (but rest assured, I'm adding them now!).
Here's the grafana metrics for the eviction:
And the non-detailed log of the batch being evicted:
Steps to reproduce
Retrieve a chunk to put it in the cache. Wait and hope for pullsync to discover the same chunk and put it in the reserve. Wait longer for the chunk to flush from the cache. At this point, the chunk will still be in the ChunkStore with a refCnt of 1 even though there are no longer any actual needs to keep the chunk.
Possible solution
The cache.ShallowCopy method needs to detect that the chunk is already in the cache and cause the refCnt to be decremented to account for the no-longer-reference from the source of the newly caching chunks. Either here:
bee/pkg/storer/internal/cache/cache.go
Line 201 in ad2602c
but more likely here so that the refCnt decrement (ChunkStore().Delete()) can go into the batch.
bee/pkg/storer/internal/cache/cache.go
Line 218 in ad2602c
But I also think there may be some mutex locking that may need to happen here as well, just in case a single chunk is being concurrently evicted from multiple stamp batches from the reserve. Both ShallowCopy's may see the chunk already in the cache and they'll both decrement the refCnt, but then again, that is actually the behavior we should see in this case. Never mind... But I still feel like there's a possible race condition if multiple ShallowCopy's are executing concurrently.
The text was updated successfully, but these errors were encountered: