-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
delete series api and purger to purge data requested for deletion #2103
delete series api and purger to purge data requested for deletion #2103
Conversation
Forgot to make it a draft PR while I work on tests 🤦♂️ |
b736036
to
1d23841
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good. I have a few comments.
671706a
to
3e35a6d
Compare
d3446bc
to
be92f93
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more minor nits.
5ec0a39
to
01e2422
Compare
e4fab35
to
5511880
Compare
0428f10
to
022c0e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some minor nits but overall this LGTM
022c0e7
to
9232530
Compare
@@ -104,6 +104,20 @@ Where default_value is the value to use if the environment variable is undefined | |||
# The table_manager_config configures the Cortex table-manager. | |||
[table_manager: <table_manager_config>] | |||
|
|||
data_purger: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind adding a new root block to the doc generator, so that the data_purger_config
will have its own dedicated section in this reference file? It's 3 lines config at the top of tools/doc-generator/main.go
(look at other root blocks as a reference)
@@ -104,6 +104,20 @@ Where default_value is the value to use if the environment variable is undefined | |||
# The table_manager_config configures the Cortex table-manager. | |||
[table_manager: <table_manager_config>] | |||
|
|||
data_purger: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to keep consistency between YAML config and CLI flags. The YAML config block is called data_purger
while CLI flags have the prefix -purger.
. Have you considered renaming the YAML config block from data_purger
to purger
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I missed doing that somehow. Pushed the changes. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job Sandeep! I've left some questions and comments you may want to look at, some more important than others.
option (gogoproto.marshaler_all) = true; | ||
option (gogoproto.unmarshaler_all) = true; | ||
|
||
message Plan { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Messages in this proto file are not immediately clear to me, and would appreciate some comments to understand how they are supposed to be used and work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Update: it made bit more sense after reading purger.go, but still – some comments would be nice)
pkg/cortex/modules.go
Outdated
if err != nil { | ||
return err | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed, because named return variable err
is used here. It's a pattern we use in this file at many places.
return | ||
} | ||
|
||
adminRouter := t.server.HTTP.PathPrefix(cfg.HTTPPrefix + "/api/v1/admin/tsdb").Subrouter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this some well-known API endpoint? Isn't "tsdb" going to be confusing as this is only used by chunks storage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we want to keep APIs the same as what Prometheus provides, I agree with your point. I also think we would have to eventually support delete API for blocks storage as well so I think for consistency with Prometheus APIs we can keep it like this. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
pkg/cortex/modules.go
Outdated
All: { | ||
deps: []moduleName{Querier, Ingester, Distributor, TableManager}, | ||
deps: []moduleName{Querier, Ingester, Distributor, TableManager, DataPurger}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can multiple data-purgers run at the same time? (If someone runs multiple single-binary instances of Cortex)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but I think chances of that would be less. I think we can address that issue with future PRs if needed. We can for now make it clear in docs that there should be only 1 Purger running at a time.
store: boltdb | ||
|
||
purger: | ||
object_store_type: filesystem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this redundant? Cannot we use object_store
from schema
, or does it make sense to use different value here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we can have different schemas for different time range, we would have to infer which schema user would want to use. If we go with the active/lastest one then moving to a different store by adding a new schema would abandon old schemas. Which is why I wanted it to be explicit. Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, didn't realize that. Thanks!
pkg/chunk/purger/purger.go
Outdated
if deleteRequest.CreatedAt.Add(24 * time.Hour).After(model.Now()) { | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check be first, to avoid debug-logging in the previous check? [nit]
@@ -0,0 +1,32 @@ | |||
syntax = "proto3"; | |||
|
|||
package purgeplan; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why extra package just for protobuf file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to keep it isolated but I think this won't be needed outside of purger package so I think I can move it there.
pkg/chunk/purger/purger.go
Outdated
|
||
executePlansChan chan deleteRequestWithLogger | ||
workerJobChan chan workerJob | ||
workerJobExecutionStatusChan chan workerJobExecutionStatus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why workerJobExecutionStatusChan
is needed at all. Same cleanup can be done by worker itself, after it is finished with the plan? Why does it need to be done on separate goroutine?
Furthermore, as it is now, when Stop is called, goroutine reading from workerJobExecutionStatusChan
will exit quickly (because it's not doing anything that takes a long time), while workers are still executing the plan, but when they are finished, there is no channel reader anymore, so the result of plan execution will not be logged or handled in any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While there can be multiple plans, worker
just takes care of just 1 plan so there has to be another entity which does the cleanup after all the plans are executed successfully. I will take care of the exit issue that you pointed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While there can be multiple plans,
worker
just takes care of just 1 plan so there has to be another entity which does the cleanup after all the plans are executed successfully.
I understand this, but think that worker can simply check if it did the last piece on its own, without going through another goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are building per day plan, the last one could be smallest and could finish quickly which means we might mark a delete request completed before it executes all the plans successfully.
We also do not want to mark a delete request completed when any of its plans but the last one fails to execute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine... I think your checks are fine, I'm just objecting on using another goroutine and extra channel. Worker can do the same checks, and remove entry from the map only if there no other plan that runs. If you extract that code to new method as suggested elsewhere, I would just let worker call the method, instead of spawning new goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I got your point. I have done the changes and pushed them. Thanks!
@pstibrany Thank you so much for the review! |
4dbfeb9
to
54296da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Design of the DataPurger
feels very Java-like to me – with static number of workers and goroutines and communication through shared maps like inProcessRequestIDs
and pendingPlansCount
.
What about changing it such that purger selects next job to work on, and then spawns NumWorkers
goroutines to handle all plans, waits until they are finished, and then updates the request in database with the result. It seems to me that entire handling of single request could be much simplified that way, instead of tracking number of in-flight plans. What do you think?
Update: Notes from Slack communication:
- by static I meant that they are created once, and then run forever, regardless of whether they have work to do or not
- Alternative would be to have a method that handles entire request. This method would spawn NumWorkers goroutines to handle plans for request, and once all plans are finished, goroutines would die and method would finish the request in database.
- hat way, there is no need to have shared map with number of in-flight plans, and perhaps also no need to have shared map for users
In any case, thanks for addressing my feedback!
pkg/chunk/purger/purger.go
Outdated
logger log.Logger | ||
} | ||
|
||
type workerJobExecutionStatus struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can get rid of this struct now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be resolved @sandeepsukhani?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving as the current design will work as well. According to Sandeep, more design changes are coming in subsequent PRs.
54296da
to
c7221b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Sandeep, good work. Have a bunch of nits mostly, after which I'll give the LGTM
pkg/chunk/purger/purger.go
Outdated
logger log.Logger | ||
} | ||
|
||
type workerJobExecutionStatus struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be resolved @sandeepsukhani?
delete_store manages delete requests and purge plan records in stores purger builds delete plans(delete requests sharded by day) and executes them paralelly only one requests per user would be in execution phase at a time delete requests gets picked up for deletion after they get older by more than a day Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
…e only component that needs it Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
23fe072
to
3c74dc7
Compare
Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
What this PR does:
Adds support for deleting series. This PR just includes the following pieces from PR #1906
Implementation details
New entities
DeleteStore
- Handles storing, updating and fetching delete requests.DataPurger
- Builds delete plans for requests older than 24h and executes them using a configurable worker pool.StorageClient
- Used by DataPurger for storing delete plans in protobuf format.Delete Request Lifecycle
Delete request could have one following states at a time:
Received
- No actions are done on request yet.BuildingPlan
- Request picked up for processing and building plans for it.Deleting
- Plans built already, running delete operations.Processed
- All requested data deleted.Delete requests keep moving forward from one state to another. They are moved from
Received
toBuildingPlan
state only after they are 24+ hours old, thereafter it keeps moving forward as it gets processed. We do not want to keep this period configurable since users might set it to extremely low and we might not want to get into the rabbit hole of deletion of live data from ingesters.Workers performing delete operation could die in the middle, which can cause issues in queries if chunks are deleted first or end up in stale chunks if an index is deleted first. To avoid such issues a delete plan is built which include labels and chunks which are supposed to be deleted. To perform deletion in parallel, delete plan is sharded by day i.e they include all the labels and chunks that are supposed to be deleted for a day.
New APIs
/delete_series
- Accepts Prometheus style delete request for deleting series/get_all_delete_requests
- Get all delete requestsTradeoffs
Notes
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]