Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add objects list caching for boltdb-shipper index store to reduce object storage list api calls #5160

Merged

Conversation

sandeepsukhani
Copy link
Contributor

What this PR does / why we need it:
We, as of now, do a LIST calls per table when we need to find its objects.
If someone has a lot of tables cached locally or has query readiness set to a large number of days, it results in many list calls because each querier tries to sync tables every 5 mins by default.
This PR reduces the number of LIST calls we make when using hosted object stores(S3, GCS, Azure Blob Storage and Swift) as a shared store for boltdb-shipper.

The idea is to do a flat listing of objects supported by hosted object stores mentioned above and cache it until it goes stale.

Special notes for your reviewer:
Since caching requires a flat listing supported only by hosted object stores, I have added a prefixedObjectClient, making the implementation somewhat cleaner. prefixedObjectClient takes care of adding/removing configured object prefix to the keys. Without prefixedObjectClient, we will have to also make the caching client aware of object prefixes.

Checklist

  • Tests updated

@sandeepsukhani sandeepsukhani requested a review from a team as a code owner January 17, 2022 10:49
@sandeepsukhani sandeepsukhani force-pushed the boltdb-shipper-list-caching branch from 1d20707 to 5bd0127 Compare January 17, 2022 11:02
@sandeepsukhani sandeepsukhani force-pushed the boltdb-shipper-list-caching branch from 5bd0127 to 9d2a103 Compare January 19, 2022 06:43
Copy link
Contributor

@chaudum chaudum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff!

return nil, nil, fmt.Errorf("invalid prefix %s", prefix)
}

if !c.cacheBuiltAt.Add(cacheTimeout).After(time.Now()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also be written as:

Suggested change
if !c.cacheBuiltAt.Add(cacheTimeout).After(time.Now()) {
if time.Since(c.cacheBuiltAt) > cacheTimeout {

which I find easier to read.

Comment on lines 56 to 68
select {
case c.rebuildCacheChan <- struct{}{}:
c.err = nil
c.err = c.buildCache(ctx)
<-c.rebuildCacheChan
if c.err != nil {
level.Error(util_log.Logger).Log("msg", "failed to build cache", "err", c.err)
}
default:
for !c.cacheBuiltAt.Add(cacheTimeout).After(time.Now()) && c.err == nil {
time.Sleep(time.Millisecond)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a hard time understanding why you chose to use a channel here. I assume to block concurrent access on List(). First call is building the cache while all others wait until cache is built?

Copy link
Contributor Author

@sandeepsukhani sandeepsukhani Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just first or one of the concurrent calls to list should get to build the cache while others wait for it to finish successfully or with error. I will add a comment to make it clearer.

Comment on lines +115 to +118
c.tablesMtx.Lock()
defer c.tablesMtx.Unlock()

c.tables = map[string]*table{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we decrease the lock time by assigning c.tables at the very end?

Suggested change
c.tablesMtx.Lock()
defer c.tablesMtx.Unlock()
c.tables = map[string]*table{}
new_tables := map[string]*table{}
...
c.tablesMtx.Lock()
defer c.tablesMtx.Unlock()
c.tables = new_tables
c.cacheBuiltAt = time.Now()
return nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to keep it locked until we build the cache to avoid returning stale results. Most of these list calls happen async so I am refreshing the cache on demand instead of running a goroutine refreshing it every min since we usually do these operations every 5 mins in index-gateway and 10 mins in compactor by default.

objects, commonPrefixes, err := cachedObjectClient.List(context.Background(), "", "")
require.NoError(t, err)
require.Equal(t, 1, objectClient.listCallsCount)
require.Equal(t, objects, []chunk.StorageObject{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguments of the Equal function are in "incorrect" order:

Suggested change
require.Equal(t, objects, []chunk.StorageObject{})
require.Equal(t, []chunk.StorageObject{}, objects)

The function interface is

func Equal(t TestingT, expected interface{}, actual interface{}, msgAndArgs ...interface{})

This isn't a problem as long as expected and actual are equal, but the test error message is misleading in case they aren't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess this is not only a problem in your test, but we have that all over the place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, sorry I messed up the order. Fixed it.

}
default:
for time.Since(c.cacheBuiltAt) >= cacheTimeout && c.err == nil {
time.Sleep(time.Millisecond)
Copy link
Contributor

@cyriltovena cyriltovena Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For loop and time.Sleep is a no no !

You want to use the promise pattern instead. Not sure if we can avoid a lock/ RW lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a sync.WaitGroup to make all the goroutines attempting to build the cache to wait until the operation gets over. Can you please check now whether it looks good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep looks good.

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sandeepsukhani sandeepsukhani merged commit 49ffe52 into grafana:main Jan 20, 2022
@slim-bean
Copy link
Collaborator

Fixes #5018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants