Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reserve eviction messes up refCnt on previously-cached items #4530

Closed
ldeffenb opened this issue Dec 21, 2023 · 1 comment · Fixed by #4567
Closed

Reserve eviction messes up refCnt on previously-cached items #4530

ldeffenb opened this issue Dec 21, 2023 · 1 comment · Fixed by #4567
Assignees
Labels
issue needs-triaging new issues that need triaging

Comments

@ldeffenb
Copy link
Collaborator

Context

1.18.2

Summary

If a chunk is already in the cache when it is evicted from the reserve, it ends up with an orphan refCnt when it is eventually removed from the cache.

Expected behavior

When all references to a chunk are gone, the refCnt should go to zero and the chunk should be removed.

Actual behavior

Consider the following sequence:

  1. A chunk is retrieved and placed into the cache storing the chunk and setting the refCnt to 1.
  2. The same chunk is later discovered by pullsync and placed into the reserve, incrementing the refCnt to 2.
  3. The chunk is evicted from the reserve and placed into the cache via cacheCb/ShallowCopy. This does NOT change the refCnt.
    if err := r.cacheCb(ctx, store, moveToCache...); err != nil {

    // ShallowCopy creates cache entries with the expectation that the chunk already exists in the chunkstore.
  4. The chunk is flushed out of the cache due to age decrementing the refCnt to 1.
  5. There are no longer any actual references to the chunk, but it is now stuck in the ChunkStore due to the orphaned refCnt of 1.

Here are the logs of two chunks doing this sequence in one of my nodes. The node did a rather large eviction pass at 03:05 which moved a bunch of chunks from reserve to the cache, causing the RemoveOldest. Unfortunately, I didn't have logs on the actual chunks being evicted (but rest assured, I'm adding them now!).

First retrieval put the chunk into the cache:

bee-1.18.2-G.out:"time"="2023-12-20 04:56:48.881592" "level"="debug" "logger"="node/retrieval" "msg"="retrieved chunk" "chunk_address"="41cd72a73bd69adb33e02d82060381e324da8c2b917f068cfa121bf43cdf973b" "peer_address"="556d9ddb7434a00dae1a4b13a71b51ffc93a4d6ec24164df3673cc496b0b32a8" "peer_proximity"=3
bee-1.18.2-G.out:"time"="2023-12-20 04:56:48.881901" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put new chunk NOstamp!" "why"="CachePutter(RetrievalCache)" "address"="41cd72a73bd69adb33e02d82060381e324da8c2b917f068cfa121bf43cdf973b" "loc"="6:57511(4104)"

Dome time later, PullSync discovered that this chunk should be in the reserve:

bee-1.18.2-G.out:"time"="2023-12-20 05:51:18.561658" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk" "why"="Reserve(ReservePutter(Pullsync.Sync))" "address"="41cd72a73bd69adb33e02d82060381e324da8c2b917f068cfa121bf43cdf973b" "loc"="6:57511(4104)" "refCnt"=2 "batch_id"="19806cdb6f4c8582adfa496bee64dc3482903434db52e2b9f9e0ccbd25f587d7" "index"="000041cd00000000"

The chunk was evicted from the reserve (no detailed logs), and the influx of evicted chunks to the cache removed the original cache reference.

bee-1.18.2-I.out:"time"="2023-12-21 03:05:21.744250" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Delete decrement chunk" "why"="Cache.RemoveOldest" "address"="41cd72a73bd69adb33e02d82060381e324da8c2b917f068cfa121bf43cdf973b" "loc"="6:57511(4104)" "refCnt"=1

This second chunk went through exactly the same sequence:

bee-1.18.2-G.out:"time"="2023-12-20 04:54:44.307266" "level"="debug" "logger"="node/retrieval" "msg"="retrieved chunk" "chunk_address"="4799a9ed309683bca768e707d58609a8b297a21674c5ec052a3b7f8e600bf873" "peer_address"="5461338e8c1939477eb72a6835a41128baa8750358342c5a8ca2c13940fd1b57" "peer_proximity"=3
bee-1.18.2-G.out:"time"="2023-12-20 04:54:44.307631" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put new chunk NOstamp!" "why"="CachePutter(RetrievalCache)" "address"="4799a9ed309683bca768e707d58609a8b297a21674c5ec052a3b7f8e600bf873" "loc"="8:57267(4104)"

bee-1.18.2-G.out:"time"="2023-12-20 05:51:11.651364" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Put increment chunk" "why"="Reserve(ReservePutter(Pullsync.Sync))" "address"="4799a9ed309683bca768e707d58609a8b297a21674c5ec052a3b7f8e600bf873" "loc"="8:57267(4104)" "refCnt"=2 "batch_id"="19806cdb6f4c8582adfa496bee64dc3482903434db52e2b9f9e0ccbd25f587d7" "index"="0000479900000000"

bee-1.18.2-I.out:"time"="2023-12-21 03:05:19.879231" "level"="debug" "logger"="node/storer" "msg"="chunkTrace: chunkStore.Delete decrement chunk" "why"="Cache.RemoveOldest" "address"="4799a9ed309683bca768e707d58609a8b297a21674c5ec052a3b7f8e600bf873" "loc"="8:57267(4104)" "refCnt"=1

Here's the grafana metrics for the eviction:
image
And the non-detailed log of the batch being evicted:

bee-1.18.2-I.out:"time"="2023-12-21 03:05:35.956261" "level"="debug" "logger"="node/storer" "msg"="reserve eviction" "uptoBin"=3 "evicted"=250046 "batchID"="45ef9e72cad000467d828bd82b769f63ff848f216013c1866c0270533bc879e3" "new_size"=3944636

Steps to reproduce

Retrieve a chunk to put it in the cache. Wait and hope for pullsync to discover the same chunk and put it in the reserve. Wait longer for the chunk to flush from the cache. At this point, the chunk will still be in the ChunkStore with a refCnt of 1 even though there are no longer any actual needs to keep the chunk.

Possible solution

The cache.ShallowCopy method needs to detect that the chunk is already in the cache and cause the refCnt to be decremented to account for the no-longer-reference from the source of the newly caching chunks. Either here:

entry := &cacheEntry{Address: addr, AccessTimestamp: now().UnixNano()}

but more likely here so that the refCnt decrement (ChunkStore().Delete()) can go into the batch.
err = batch.Put(entry)

But I also think there may be some mutex locking that may need to happen here as well, just in case a single chunk is being concurrently evicted from multiple stamp batches from the reserve. Both ShallowCopy's may see the chunk already in the cache and they'll both decrement the refCnt, but then again, that is actually the behavior we should see in this case. Never mind... But I still feel like there's a possible race condition if multiple ShallowCopy's are executing concurrently.

@ldeffenb ldeffenb added the needs-triaging new issues that need triaging label Dec 21, 2023
@bee-runner bee-runner bot added the issue label Dec 21, 2023
@istae
Copy link
Member

istae commented Jan 10, 2024

these headaches come from the fact that expired chunks are cached, and because moving around chunks between these stores is expensive, we introduced an optimization called "ShallowCopy". imo, expired chunks should not be cached. If a user wants this behavior, they can retrieve the chunks just before the batch expiration so that the network temporarily caches them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
issue needs-triaging new issues that need triaging
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants