Reserve eviction messes up refCnt on previously-cached items #4530

ldeffenb · 2023-12-21T15:20:20Z

Context

1.18.2

Summary

If a chunk is already in the cache when it is evicted from the reserve, it ends up with an orphan refCnt when it is eventually removed from the cache.

Expected behavior

When all references to a chunk are gone, the refCnt should go to zero and the chunk should be removed.

Actual behavior

Consider the following sequence:

A chunk is retrieved and placed into the cache storing the chunk and setting the refCnt to 1.
The same chunk is later discovered by pullsync and placed into the reserve, incrementing the refCnt to 2.
The chunk is evicted from the reserve and placed into the cache via cacheCb/ShallowCopy. This does NOT change the refCnt.

bee/pkg/storer/internal/reserve/reserve.go

Line 366 in ad2602c

if err := r.cacheCb(ctx, store, moveToCache...); err != nil {

bee/pkg/storer/internal/cache/cache.go

Line 176 in ad2602c

// ShallowCopy creates cache entries with the expectation that the chunk already exists in the chunkstore.
The chunk is flushed out of the cache due to age decrementing the refCnt to 1.
There are no longer any actual references to the chunk, but it is now stuck in the ChunkStore due to the orphaned refCnt of 1.

Here are the logs of two chunks doing this sequence in one of my nodes. The node did a rather large eviction pass at 03:05 which moved a bunch of chunks from reserve to the cache, causing the RemoveOldest. Unfortunately, I didn't have logs on the actual chunks being evicted (but rest assured, I'm adding them now!).

First retrieval put the chunk into the cache:

bee-1.18.2-G.out:"time"="2023-12-20 04:56:48.881592" "level"="debug" "logger"="node/retrieval" "msg"="retrieved chunk" "chunk_address"="41cd72a73bd69adb33e02d82060381e324da8c2b917f068cfa121bf43cdf973b" "peer_address"="556d9ddb7434a00dae1a4b13a71b51ffc93a4d6ec24164df3673cc496b0b32a8" "peer_proximity"=3
bee-1.18.2-G.out:"time"="2023-12-20 04:56:48.881901" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put new chunk NOstamp!" "why"="CachePutter(RetrievalCache)" "address"="41cd72a73bd69adb33e02d82060381e324da8c2b917f068cfa121bf43cdf973b" "loc"="6:57511(4104)"

Dome time later, PullSync discovered that this chunk should be in the reserve:

bee-1.18.2-G.out:"time"="2023-12-20 05:51:18.561658" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk" "why"="Reserve(ReservePutter(Pullsync.Sync))" "address"="41cd72a73bd69adb33e02d82060381e324da8c2b917f068cfa121bf43cdf973b" "loc"="6:57511(4104)" "refCnt"=2 "batch_id"="19806cdb6f4c8582adfa496bee64dc3482903434db52e2b9f9e0ccbd25f587d7" "index"="000041cd00000000"

The chunk was evicted from the reserve (no detailed logs), and the influx of evicted chunks to the cache removed the original cache reference.

bee-1.18.2-I.out:"time"="2023-12-21 03:05:21.744250" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Delete decrement chunk" "why"="Cache.RemoveOldest" "address"="41cd72a73bd69adb33e02d82060381e324da8c2b917f068cfa121bf43cdf973b" "loc"="6:57511(4104)" "refCnt"=1

This second chunk went through exactly the same sequence:

bee-1.18.2-G.out:"time"="2023-12-20 04:54:44.307266" "level"="debug" "logger"="node/retrieval" "msg"="retrieved chunk" "chunk_address"="4799a9ed309683bca768e707d58609a8b297a21674c5ec052a3b7f8e600bf873" "peer_address"="5461338e8c1939477eb72a6835a41128baa8750358342c5a8ca2c13940fd1b57" "peer_proximity"=3
bee-1.18.2-G.out:"time"="2023-12-20 04:54:44.307631" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put new chunk NOstamp!" "why"="CachePutter(RetrievalCache)" "address"="4799a9ed309683bca768e707d58609a8b297a21674c5ec052a3b7f8e600bf873" "loc"="8:57267(4104)"

bee-1.18.2-G.out:"time"="2023-12-20 05:51:11.651364" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk" "why"="Reserve(ReservePutter(Pullsync.Sync))" "address"="4799a9ed309683bca768e707d58609a8b297a21674c5ec052a3b7f8e600bf873" "loc"="8:57267(4104)" "refCnt"=2 "batch_id"="19806cdb6f4c8582adfa496bee64dc3482903434db52e2b9f9e0ccbd25f587d7" "index"="0000479900000000"

bee-1.18.2-I.out:"time"="2023-12-21 03:05:19.879231" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Delete decrement chunk" "why"="Cache.RemoveOldest" "address"="4799a9ed309683bca768e707d58609a8b297a21674c5ec052a3b7f8e600bf873" "loc"="8:57267(4104)" "refCnt"=1

Here's the grafana metrics for the eviction:

And the non-detailed log of the batch being evicted:

bee-1.18.2-I.out:"time"="2023-12-21 03:05:35.956261" "level"="debug" "logger"="node/storer" "msg"="reserve eviction" "uptoBin"=3 "evicted"=250046 "batchID"="45ef9e72cad000467d828bd82b769f63ff848f216013c1866c0270533bc879e3" "new_size"=3944636

Steps to reproduce

Retrieve a chunk to put it in the cache. Wait and hope for pullsync to discover the same chunk and put it in the reserve. Wait longer for the chunk to flush from the cache. At this point, the chunk will still be in the ChunkStore with a refCnt of 1 even though there are no longer any actual needs to keep the chunk.

Possible solution

The cache.ShallowCopy method needs to detect that the chunk is already in the cache and cause the refCnt to be decremented to account for the no-longer-reference from the source of the newly caching chunks. Either here:

bee/pkg/storer/internal/cache/cache.go

Line 201 in ad2602c

entry := &cacheEntry{Address: addr, AccessTimestamp: now().UnixNano()}

but more likely here so that the refCnt decrement (ChunkStore().Delete()) can go into the batch.

bee/pkg/storer/internal/cache/cache.go

Line 218 in ad2602c

err = batch.Put(entry)

But I also think there may be some mutex locking that may need to happen here as well, just in case a single chunk is being concurrently evicted from multiple stamp batches from the reserve. Both ShallowCopy's may see the chunk already in the cache and they'll both decrement the refCnt, but then again, that is actually the behavior we should see in this case. Never mind... But I still feel like there's a possible race condition if multiple ShallowCopy's are executing concurrently.

The text was updated successfully, but these errors were encountered:

istae · 2024-01-10T23:01:02Z

these headaches come from the fact that expired chunks are cached, and because moving around chunks between these stores is expensive, we introduced an optimization called "ShallowCopy". imo, expired chunks should not be cached. If a user wants this behavior, they can retrieve the chunks just before the batch expiration so that the network temporarily caches them.

ldeffenb added the needs-triaging new issues that need triaging label Dec 21, 2023

bee-runner bot added the issue label Dec 21, 2023

istae mentioned this issue Jan 15, 2024

evict just enough chunks of a batch to fall below the reserve capacity #4538

Closed

istae self-assigned this Jan 17, 2024

istae mentioned this issue Feb 5, 2024

fix(cache): deference already cached chunk during shallow copy #4567

Merged

4 tasks

istae closed this as completed in #4567 Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reserve eviction messes up refCnt on previously-cached items #4530

Reserve eviction messes up refCnt on previously-cached items #4530

ldeffenb commented Dec 21, 2023

istae commented Jan 10, 2024 •

edited

Loading

Uh oh!

Reserve eviction messes up refCnt on previously-cached items #4530

Reserve eviction messes up refCnt on previously-cached items #4530

Comments

ldeffenb commented Dec 21, 2023

Context

Summary

Expected behavior

Actual behavior

Steps to reproduce

Possible solution

istae commented Jan 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

istae commented Jan 10, 2024 •

edited

Loading