Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create smaller unique files from boltdb shipper and other code refactorings #2261

Merged
merged 19 commits into from
Jul 28, 2020

Conversation

sandeepsukhani
Copy link
Contributor

@sandeepsukhani sandeepsukhani commented Jun 24, 2020

What this PR does / why we need it:
This PR does major overhaul of boltdb shipper code. Most of the functionality remains the same except it would shard index files by creating a new one every 15 mins.
The overhaul includes splitting upload and downloads code to a separate sub-package.
This PR still needs some final touches and tests.

Checklist

  • Tests updated

@sandeepsukhani sandeepsukhani force-pushed the boltdb-shipper-overhaul branch 5 times, most recently from d1a2af1 to 16ce8df Compare June 27, 2020 09:57
@codecov-commenter
Copy link

codecov-commenter commented Jun 27, 2020

Codecov Report

Merging #2261 into master will increase coverage by 1.15%.
The diff coverage is 60.25%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2261      +/-   ##
==========================================
+ Coverage   61.55%   62.70%   +1.15%     
==========================================
  Files         160      162       +2     
  Lines       13612    13847     +235     
==========================================
+ Hits         8379     8683     +304     
+ Misses       4607     4488     -119     
- Partials      626      676      +50     
Impacted Files Coverage Δ
pkg/loki/loki.go 0.00% <0.00%> (ø)
pkg/storage/stores/shipper/metrics.go 0.00% <0.00%> (ø)
pkg/storage/stores/shipper/shipper_index_client.go 0.00% <0.00%> (ø)
pkg/storage/stores/shipper/table_client.go 41.93% <ø> (ø)
pkg/loki/modules.go 10.86% <44.44%> (ø)
pkg/storage/store.go 62.02% <60.00%> (-0.48%) ⬇️
pkg/storage/stores/shipper/downloads/table.go 65.00% <65.00%> (ø)
...kg/storage/stores/shipper/uploads/table_manager.go 67.52% <67.52%> (ø)
pkg/storage/stores/shipper/uploads/table.go 68.62% <68.62%> (ø)
.../storage/stores/shipper/downloads/table_manager.go 69.76% <69.76%> (ø)
... and 11 more

@@ -34,7 +34,8 @@ import (
"github.com/grafana/loki/pkg/querier"
"github.com/grafana/loki/pkg/querier/queryrange"
loki_storage "github.com/grafana/loki/pkg/storage"
"github.com/grafana/loki/pkg/storage/stores/local"
"github.com/grafana/loki/pkg/storage/stores/shipper"
shipper_uploads "github.com/grafana/loki/pkg/storage/stores/shipper/uploads"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a collision that requires this? (I didn't look super close just aw it in the diff)

@sandeepsukhani sandeepsukhani force-pushed the boltdb-shipper-overhaul branch 5 times, most recently from 2a21e61 to 89ec425 Compare July 8, 2020 13:33
@sandeepsukhani
Copy link
Contributor Author

@cyriltovena Thanks for the feedback! I have addressed them all.
Please feel free to point out any problems that you see, if any.

return t.err
}

t.dbsMtx.RLock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code for waiting initialization should be before this. I know it is the case already but it should included in this function and not in the table manager, because If I use table without tablemanager I can get deadlocked here.

Another option is to make Table not exported in this package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be easier to just rename Table to table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made public methods to give a guarantee that usage of them is concurrency safe. The Table is not meant to be used without the table manager. Having public methods(for concurrency guarantees) on a private type would look weird. It would be better to move readiness check here from table manager given we want to support using Table without table manager.
What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either you make Table => table or you move the readiness check here. Up to you !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed the code. Thanks!

for {
select {
case <-syncTicker.C:
err := tm.uploadTables(context.Background())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you should be using a context.Background here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the issue when you cancel this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we would ever want to cancel that context, not even while stopping to make sure we stop the service only after uploading the files. We could have a context with a minimum of 10 mins timeout to avoid waiting on it forever. What do you think? Is that what you are looking for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My problem is mostly when using context.Background you're blocking the waitgroup and you'll likely be killed and you don't know when. I'm wondering if this could be a problem ? Basically being stopped before gracefully stopping.

First I think you should cancel the loop because the upload happen again anyways after stop. Am I wrong ?

Second 10min would get just shutdown by Kubernetes see https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace

At this point, Kubernetes waits for a specified time called the termination grace period. By default, this is 30 seconds. It’s important to note that this happens in parallel to the preStop hook and the SIGTERM signal. Kubernetes does not wait for the preStop hook to finish.

Basically you're going to get SIGKILL after 30s or more depending on what we have configured, for sure not 10min.

Let's change the Close function to give a 1m timeout, and log if it has been cancelled or success, so we can see if we get killed before finishing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not wrong our ingesters don't get killed within 30s of receiving SIGTERM because we do chunk transfers and it could take a couple of minutes. I checked the config and it seems we configure termination grace period to 80 minutes, see

deployment.mixin.spec.template.spec.withTerminationGracePeriodSeconds(4800),

Copy link
Collaborator

@slim-bean slim-bean Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this context should have a timeout and should be configured to a value less than the termination grace period and we should have explicit info level logging on when a table upload starts and completed.

I don't like the idea of hiding a table upload into the background without any visibility, many people run Loki with much more aggressive shutdown requirements and it shoudl be clear from the logs if table uploads succeeded or if the process was killed before they completed, have a timeout less than the grace period would ensure we log that upload failed before we gave up and shutdown.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some more discussion with @sandeepsukhani , we already try to flush chunks forever on shutdown it makes sense to try to upload tables forever on shutdown.

I would like to see the explicit logging of both started and succeeded (as well as error) so that someone can determine from their logs if all tables were uploaded.

@sandeepsukhani sandeepsukhani marked this pull request as ready for review July 23, 2020 13:02
@sandeepsukhani sandeepsukhani force-pushed the boltdb-shipper-overhaul branch from 45785bc to 8d6996d Compare July 24, 2020 06:10
Copy link
Collaborator

@slim-bean slim-bean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good To Get Merged To Me!

@sandeepsukhani sandeepsukhani force-pushed the boltdb-shipper-overhaul branch from 8d6996d to 7e14254 Compare July 28, 2020 07:25
@sandeepsukhani sandeepsukhani force-pushed the boltdb-shipper-overhaul branch from e0da42d to 863038a Compare July 28, 2020 10:48
@sandeepsukhani sandeepsukhani merged commit e59adcc into grafana:master Jul 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants