-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create smaller unique files from boltdb shipper and other code refactorings #2261
create smaller unique files from boltdb shipper and other code refactorings #2261
Conversation
d1a2af1
to
16ce8df
Compare
Codecov Report
@@ Coverage Diff @@
## master #2261 +/- ##
==========================================
+ Coverage 61.55% 62.70% +1.15%
==========================================
Files 160 162 +2
Lines 13612 13847 +235
==========================================
+ Hits 8379 8683 +304
+ Misses 4607 4488 -119
- Partials 626 676 +50
|
pkg/loki/modules.go
Outdated
@@ -34,7 +34,8 @@ import ( | |||
"github.com/grafana/loki/pkg/querier" | |||
"github.com/grafana/loki/pkg/querier/queryrange" | |||
loki_storage "github.com/grafana/loki/pkg/storage" | |||
"github.com/grafana/loki/pkg/storage/stores/local" | |||
"github.com/grafana/loki/pkg/storage/stores/shipper" | |||
shipper_uploads "github.com/grafana/loki/pkg/storage/stores/shipper/uploads" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a collision that requires this? (I didn't look super close just aw it in the diff)
2a21e61
to
89ec425
Compare
@cyriltovena Thanks for the feedback! I have addressed them all. |
return t.err | ||
} | ||
|
||
t.dbsMtx.RLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code for waiting initialization should be before this. I know it is the case already but it should included in this function and not in the table manager, because If I use table without tablemanager I can get deadlocked here.
Another option is to make Table not exported in this package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be easier to just rename Table to table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made public methods to give a guarantee that usage of them is concurrency safe. The Table
is not meant to be used without the table manager. Having public methods(for concurrency guarantees) on a private type would look weird. It would be better to move readiness check here from table manager given we want to support using Table
without table manager.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either you make Table
=> table
or you move the readiness check here. Up to you !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed the code. Thanks!
for { | ||
select { | ||
case <-syncTicker.C: | ||
err := tm.uploadTables(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you should be using a context.Background here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the issue when you cancel this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we would ever want to cancel that context, not even while stopping to make sure we stop the service only after uploading the files. We could have a context with a minimum of 10 mins timeout to avoid waiting on it forever. What do you think? Is that what you are looking for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My problem is mostly when using context.Background you're blocking the waitgroup and you'll likely be killed and you don't know when. I'm wondering if this could be a problem ? Basically being stopped before gracefully stopping.
First I think you should cancel the loop because the upload happen again anyways after stop. Am I wrong ?
Second 10min would get just shutdown by Kubernetes see https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace
At this point, Kubernetes waits for a specified time called the termination grace period. By default, this is 30 seconds. It’s important to note that this happens in parallel to the preStop hook and the SIGTERM signal. Kubernetes does not wait for the preStop hook to finish.
Basically you're going to get SIGKILL after 30s or more depending on what we have configured, for sure not 10min.
Let's change the Close function to give a 1m timeout, and log if it has been cancelled or success, so we can see if we get killed before finishing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not wrong our ingesters don't get killed within 30s of receiving SIGTERM because we do chunk transfers and it could take a couple of minutes. I checked the config and it seems we configure termination grace period to 80 minutes, see
deployment.mixin.spec.template.spec.withTerminationGracePeriodSeconds(4800), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this context should have a timeout and should be configured to a value less than the termination grace period and we should have explicit info level logging on when a table upload starts and completed.
I don't like the idea of hiding a table upload into the background without any visibility, many people run Loki with much more aggressive shutdown requirements and it shoudl be clear from the logs if table uploads succeeded or if the process was killed before they completed, have a timeout less than the grace period would ensure we log that upload failed before we gave up and shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some more discussion with @sandeepsukhani , we already try to flush chunks forever on shutdown it makes sense to try to upload tables forever on shutdown.
I would like to see the explicit logging of both started and succeeded (as well as error) so that someone can determine from their logs if all tables were uploaded.
45785bc
to
8d6996d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good To Get Merged To Me!
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
8d6996d
to
7e14254
Compare
e0da42d
to
863038a
Compare
What this PR does / why we need it:
This PR does major overhaul of boltdb shipper code. Most of the functionality remains the same except it would shard index files by creating a new one every 15 mins.
The overhaul includes splitting upload and downloads code to a separate sub-package.
This PR still needs some final touches and tests.
Checklist