-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add objects list caching for boltdb-shipper index store to reduce object storage list api calls #5160
add objects list caching for boltdb-shipper index store to reduce object storage list api calls #5160
Conversation
1d20707
to
5bd0127
Compare
…ect storage list api calls
5bd0127
to
9d2a103
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff!
return nil, nil, fmt.Errorf("invalid prefix %s", prefix) | ||
} | ||
|
||
if !c.cacheBuiltAt.Add(cacheTimeout).After(time.Now()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also be written as:
if !c.cacheBuiltAt.Add(cacheTimeout).After(time.Now()) { | |
if time.Since(c.cacheBuiltAt) > cacheTimeout { |
which I find easier to read.
select { | ||
case c.rebuildCacheChan <- struct{}{}: | ||
c.err = nil | ||
c.err = c.buildCache(ctx) | ||
<-c.rebuildCacheChan | ||
if c.err != nil { | ||
level.Error(util_log.Logger).Log("msg", "failed to build cache", "err", c.err) | ||
} | ||
default: | ||
for !c.cacheBuiltAt.Add(cacheTimeout).After(time.Now()) && c.err == nil { | ||
time.Sleep(time.Millisecond) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a hard time understanding why you chose to use a channel here. I assume to block concurrent access on List(). First call is building the cache while all others wait until cache is built?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just first or one of the concurrent calls to list should get to build the cache while others wait for it to finish successfully or with error. I will add a comment to make it clearer.
c.tablesMtx.Lock() | ||
defer c.tablesMtx.Unlock() | ||
|
||
c.tables = map[string]*table{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we decrease the lock time by assigning c.tables
at the very end?
c.tablesMtx.Lock() | |
defer c.tablesMtx.Unlock() | |
c.tables = map[string]*table{} | |
new_tables := map[string]*table{} | |
... | |
c.tablesMtx.Lock() | |
defer c.tablesMtx.Unlock() | |
c.tables = new_tables | |
c.cacheBuiltAt = time.Now() | |
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to keep it locked until we build the cache to avoid returning stale results. Most of these list calls happen async so I am refreshing the cache on demand instead of running a goroutine refreshing it every min since we usually do these operations every 5 mins in index-gateway
and 10 mins in compactor
by default.
objects, commonPrefixes, err := cachedObjectClient.List(context.Background(), "", "") | ||
require.NoError(t, err) | ||
require.Equal(t, 1, objectClient.listCallsCount) | ||
require.Equal(t, objects, []chunk.StorageObject{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguments of the Equal function are in "incorrect" order:
require.Equal(t, objects, []chunk.StorageObject{}) | |
require.Equal(t, []chunk.StorageObject{}, objects) |
The function interface is
func Equal(t TestingT, expected interface{}, actual interface{}, msgAndArgs ...interface{})
This isn't a problem as long as expected
and actual
are equal, but the test error message is misleading in case they aren't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess this is not only a problem in your test, but we have that all over the place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, sorry I messed up the order. Fixed it.
} | ||
default: | ||
for time.Since(c.cacheBuiltAt) >= cacheTimeout && c.err == nil { | ||
time.Sleep(time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For loop and time.Sleep is a no no !
You want to use the promise pattern instead. Not sure if we can avoid a lock/ RW lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a sync.WaitGroup
to make all the goroutines attempting to build the cache to wait until the operation gets over. Can you please check now whether it looks good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #5018 |
What this PR does / why we need it:
We, as of now, do a LIST calls per table when we need to find its objects.
If someone has a lot of tables cached locally or has query readiness set to a large number of days, it results in many list calls because each querier tries to sync tables every 5 mins by default.
This PR reduces the number of LIST calls we make when using hosted object stores(S3, GCS, Azure Blob Storage and Swift) as a shared store for boltdb-shipper.
The idea is to do a flat listing of objects supported by hosted object stores mentioned above and cache it until it goes stale.
Special notes for your reviewer:
Since caching requires a flat listing supported only by hosted object stores, I have added a
prefixedObjectClient
, making the implementation somewhat cleaner.prefixedObjectClient
takes care of adding/removing configured object prefix to the keys. WithoutprefixedObjectClient
, we will have to also make the caching client aware of object prefixes.Checklist