Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache: fix bug where TTLs were ignored leading to leaked memory in client agents #9978

Merged
merged 3 commits into from
Apr 8, 2021

Conversation

banks
Copy link
Member

@banks banks commented Apr 7, 2021

While testing Consul at scale we found this issue where agent caches might never expire.

It occurs for cache types that cancel requests and so return from Fetch when Close is called.

Currently, only streaming cache types have this behaviour but it is a supported behaviour for other future cache types too. The race here occurs very frequently (i.e. virtually all the time):

  1. Value expires in cache, the entry is removes and then Close is called on the cache type.
  2. This immediately cancels a context which causes the goroutine running the current backround Fetch call to return.
  3. The cache handles Fetch return in fetch method and immediately re-inserts the resulting error into the cache entry that was just removed, touches the TTL and lets it live again!

Never removing the cache type even after clients have stopped requesting the data effectively causes a permanent memory leak which could be exacerbated by heavy churn in watches or tokens used during a client agent's lifetime.

Even if we move streaming types to not use the cache, it's worth fixing this so that Close behaviour is safe to use in future cache types without this leak.

This mostly affects streaming since streaming will immediately return from Fetch calls when the state is Closed on eviction which causes the race condition every time.

However this also affects all other cache types if the fetch call happens to return between the eviction and then next time around the Get loop by any client.

There is a separate bug that allows cache items to be evicted even when there are active clients which is the trigger here.
@banks banks added theme/streaming Related to Streaming connections between server and client backport/1.9 labels Apr 7, 2021
@banks banks added this to the 1.9.x milestone Apr 7, 2021
@banks banks requested a review from a team April 7, 2021 13:42
@vercel vercel bot temporarily deployed to Preview – consul April 7, 2021 13:42 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging April 7, 2021 13:42 Inactive
@github-actions github-actions bot added the theme/agent-cache Agent Cache label Apr 7, 2021
.changelog/9978.txt Outdated Show resolved Hide resolved
@vercel vercel bot temporarily deployed to Preview – consul April 7, 2021 13:58 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging April 7, 2021 13:58 Inactive
Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -723,6 +723,22 @@ func (c *Cache) fetch(key string, r getOptions, allowNew bool, attempt uint, ign
// Set our entry
c.entriesLock.Lock()

if _, ok := c.entries[key]; !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is possible for a new entry to be added to the entries map while the fetch is happening. In that case we would still leak a goroutine right? We could compare the entry here, maybe by using the Expiry which is a pointer.

Suggested change
if _, ok := c.entries[key]; !ok {
if cachedEntry, ok := c.entries[key]; !ok || cachedEntry.Expiry != entry.Expiry {

Probably not a big deal since the follow up PR makes this effectively impossible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I think if we get a double race and the entry expires and then is legitimately recreated by a new client Get during fetch then the behavior is still correct (i.e. not leaky) because by definition the client that recreated the entry is still interested in keeping it alive right? So it's not really being leaked in that case but rather kept because there is a client that cares about it still?

But maybe I'm missing a possible edge case?

@banks banks merged commit ee04d45 into master Apr 8, 2021
@banks banks deleted the cache-fixes branch April 8, 2021 10:08
@hashicorp-ci
Copy link
Contributor

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/347134.

@hashicorp-ci
Copy link
Contributor

🍒✅ Cherry pick of commit ee04d45 onto release/1.9.x succeeded!

hashicorp-ci pushed a commit that referenced this pull request Apr 8, 2021
…ient agents (#9978)

* Fix bug in cache where TTLs are effectively ignored

This mostly affects streaming since streaming will immediately return from Fetch calls when the state is Closed on eviction which causes the race condition every time.

However this also affects all other cache types if the fetch call happens to return between the eviction and then next time around the Get loop by any client.

There is a separate bug that allows cache items to be evicted even when there are active clients which is the trigger here.

* Add changelog entry

* Update .changelog/9978.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/agent-cache Agent Cache theme/streaming Related to Streaming connections between server and client
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants