delete series api and purger to purge data requested for deletion #2103

sandeepsukhani · 2020-02-10T13:51:22Z

What this PR does:
Adds support for deleting series. This PR just includes the following pieces from PR #1906

Delete Requests Handler: It accepts requests for creating new delete requests and listing all the delete requests from the user.
Data Purger: It does the actual deletion of data when they become more than 24 hours old.

Implementation details

New entities
- DeleteStore - Handles storing, updating and fetching delete requests.
- DataPurger - Builds delete plans for requests older than 24h and executes them using a configurable worker pool.
- StorageClient - Used by DataPurger for storing delete plans in protobuf format.
Delete Request Lifecycle
Delete request could have one following states at a time:

Received - No actions are done on request yet.
BuildingPlan - Request picked up for processing and building plans for it.
Deleting - Plans built already, running delete operations.
Processed - All requested data deleted.

Delete requests keep moving forward from one state to another. They are moved from Received to BuildingPlan state only after they are 24+ hours old, thereafter it keeps moving forward as it gets processed. We do not want to keep this period configurable since users might set it to extremely low and we might not want to get into the rabbit hole of deletion of live data from ingesters.

Delete Plans
Workers performing delete operation could die in the middle, which can cause issues in queries if chunks are deleted first or end up in stale chunks if an index is deleted first. To avoid such issues a delete plan is built which include labels and chunks which are supposed to be deleted. To perform deletion in parallel, delete plan is sharded by day i.e they include all the labels and chunks that are supposed to be deleted for a day.

New APIs

/delete_series - Accepts Prometheus style delete request for deleting series
/get_all_delete_requests - Get all delete requests

Tradeoffs

Parallel execution of delete requests from same users is not supported - Parallel execution can invalidate another delete request's delete plan if they are touching same chunks.

Notes

This feature is kept behind a feature flag until we stabilize it.
For now, this PR implements new methods or clients for only boltdb and filesystem to avoid increasing the size of the PR. Support for other storages will be done in separate PRs.

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

sandeepsukhani · 2020-02-10T13:52:26Z

Forgot to make it a draft PR while I work on tests 🤦‍♂️

jtlisi

Overall this looks good. I have a few comments.

pkg/chunk/purger/purger.go

pkg/chunk/gcp/bigtable_object_client.go

pkg/chunk/purger/request_handler.go

pkg/cortex/modules.go

jtlisi

A few more minor nits.

pkg/chunk/purger/purgeplan/delete_plan.proto

pkg/chunk/purger/purger.go

jtlisi

I have some minor nits but overall this LGTM

pkg/chunk/purger/purger.go

pkg/cortex/modules.go

pracucci · 2020-02-25T10:22:55Z

docs/configuration/config-file-reference.md

@@ -104,6 +104,20 @@ Where default_value is the value to use if the environment variable is undefined
 # The table_manager_config configures the Cortex table-manager.
 [table_manager: <table_manager_config>]

+data_purger:


Do you mind adding a new root block to the doc generator, so that the data_purger_config will have its own dedicated section in this reference file? It's 3 lines config at the top of tools/doc-generator/main.go (look at other root blocks as a reference)

pracucci · 2020-02-25T10:23:56Z

docs/configuration/config-file-reference.md

@@ -104,6 +104,20 @@ Where default_value is the value to use if the environment variable is undefined
 # The table_manager_config configures the Cortex table-manager.
 [table_manager: <table_manager_config>]

+data_purger:


We try to keep consistency between YAML config and CLI flags. The YAML config block is called data_purger while CLI flags have the prefix -purger.. Have you considered renaming the YAML config block from data_purger to purger?

Right, I missed doing that somehow. Pushed the changes. Thanks!

pstibrany

Good job Sandeep! I've left some questions and comments you may want to look at, some more important than others.

pstibrany · 2020-02-27T07:05:38Z

pkg/chunk/purger/purgeplan/delete_plan.proto

+option (gogoproto.marshaler_all) = true;
+option (gogoproto.unmarshaler_all) = true;
+
+message Plan {


Messages in this proto file are not immediately clear to me, and would appreciate some comments to understand how they are supposed to be used and work.

(Update: it made bit more sense after reading purger.go, but still – some comments would be nice)

pstibrany · 2020-02-27T07:07:14Z

pkg/cortex/modules.go

+	if err != nil {
+		return err
+	}
+


This is not needed, because named return variable err is used here. It's a pattern we use in this file at many places.

pstibrany · 2020-02-27T07:10:56Z

pkg/cortex/modules.go

+		return
+	}
+
+	adminRouter := t.server.HTTP.PathPrefix(cfg.HTTPPrefix + "/api/v1/admin/tsdb").Subrouter()


Is this some well-known API endpoint? Isn't "tsdb" going to be confusing as this is only used by chunks storage?

While we want to keep APIs the same as what Prometheus provides, I agree with your point. I also think we would have to eventually support delete API for blocks storage as well so I think for consistency with Prometheus APIs we can keep it like this. Thoughts?

Sounds good.

pstibrany · 2020-02-27T07:11:58Z

pkg/cortex/modules.go

 	All: {
-		deps: []moduleName{Querier, Ingester, Distributor, TableManager},
+		deps: []moduleName{Querier, Ingester, Distributor, TableManager, DataPurger},


Can multiple data-purgers run at the same time? (If someone runs multiple single-binary instances of Cortex)

Yes but I think chances of that would be less. I think we can address that issue with future PRs if needed. We can for now make it clear in docs that there should be only 1 Purger running at a time.

pstibrany · 2020-02-27T07:13:42Z

docs/configuration/single-process-config.yaml

+    store: boltdb
+
+purger:
+  object_store_type: filesystem


Isn't this redundant? Cannot we use object_store from schema, or does it make sense to use different value here?

Since we can have different schemas for different time range, we would have to infer which schema user would want to use. If we go with the active/lastest one then moving to a different store by adding a new schema would abandon old schemas. Which is why I wanted it to be explicit. Does it make sense?

I see, didn't realize that. Thanks!

pstibrany · 2020-02-27T07:57:46Z

pkg/chunk/purger/purger.go

+		if deleteRequest.CreatedAt.Add(24 * time.Hour).After(model.Now()) {
+			continue
+		}


Should this check be first, to avoid debug-logging in the previous check? [nit]

pkg/chunk/purger/purger.go

pstibrany · 2020-02-27T08:22:11Z

pkg/chunk/purger/purgeplan/delete_plan.proto

@@ -0,0 +1,32 @@
+syntax = "proto3";
+
+package purgeplan;


Why extra package just for protobuf file?

I wanted to keep it isolated but I think this won't be needed outside of purger package so I think I can move it there.

pstibrany · 2020-02-27T08:25:30Z

pkg/chunk/purger/purger.go

+
+	executePlansChan             chan deleteRequestWithLogger
+	workerJobChan                chan workerJob
+	workerJobExecutionStatusChan chan workerJobExecutionStatus


I don't see why workerJobExecutionStatusChan is needed at all. Same cleanup can be done by worker itself, after it is finished with the plan? Why does it need to be done on separate goroutine?

Furthermore, as it is now, when Stop is called, goroutine reading from workerJobExecutionStatusChan will exit quickly (because it's not doing anything that takes a long time), while workers are still executing the plan, but when they are finished, there is no channel reader anymore, so the result of plan execution will not be logged or handled in any way.

While there can be multiple plans, worker just takes care of just 1 plan so there has to be another entity which does the cleanup after all the plans are executed successfully. I will take care of the exit issue that you pointed out.

While there can be multiple plans, worker just takes care of just 1 plan so there has to be another entity which does the cleanup after all the plans are executed successfully.

I understand this, but think that worker can simply check if it did the last piece on its own, without going through another goroutine.

Since we are building per day plan, the last one could be smallest and could finish quickly which means we might mark a delete request completed before it executes all the plans successfully.
We also do not want to mark a delete request completed when any of its plans but the last one fails to execute.

That's fine... I think your checks are fine, I'm just objecting on using another goroutine and extra channel. Worker can do the same checks, and remove entry from the map only if there no other plan that runs. If you extract that code to new method as suggested elsewhere, I would just let worker call the method, instead of spawning new goroutine.

Sorry, I got your point. I have done the changes and pushed them. Thanks!

sandeepsukhani · 2020-02-27T14:21:55Z

@pstibrany Thank you so much for the review!
I have done the requested changes.

pstibrany

Design of the DataPurger feels very Java-like to me – with static number of workers and goroutines and communication through shared maps like inProcessRequestIDs and pendingPlansCount.

What about changing it such that purger selects next job to work on, and then spawns NumWorkers goroutines to handle all plans, waits until they are finished, and then updates the request in database with the result. It seems to me that entire handling of single request could be much simplified that way, instead of tracking number of in-flight plans. What do you think?

Update: Notes from Slack communication:

by static I meant that they are created once, and then run forever, regardless of whether they have work to do or not
Alternative would be to have a method that handles entire request. This method would spawn NumWorkers goroutines to handle plans for request, and once all plans are finished, goroutines would die and method would finish the request in database.
hat way, there is no need to have shared map with number of in-flight plans, and perhaps also no need to have shared map for users

In any case, thanks for addressing my feedback!

pstibrany · 2020-02-28T12:39:52Z

pkg/chunk/purger/purger.go

+	logger          log.Logger
+}
+
+type workerJobExecutionStatus struct {


I think we can get rid of this struct now.

Can this be resolved @sandeepsukhani?

pkg/chunk/purger/purger.go

pstibrany

Approving as the current design will work as well. According to Sandeep, more design changes are coming in subsequent PRs.

gouthamve

LGTM Sandeep, good work. Have a bunch of nits mostly, after which I'll give the LGTM

pkg/chunk/storage/factory.go

pkg/chunk/purger/purger.go

gouthamve · 2020-03-03T08:14:40Z

pkg/chunk/purger/purger.go

+	logger          log.Logger
+}
+
+type workerJobExecutionStatus struct {


Can this be resolved @sandeepsukhani?

pkg/chunk/purger/purger.go

delete_store manages delete requests and purge plan records in stores purger builds delete plans(delete requests sharded by day) and executes them paralelly only one requests per user would be in execution phase at a time delete requests gets picked up for deletion after they get older by more than a day Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

…e only component that needs it Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

sandeepsukhani force-pushed the delete-series-store-purger branch 4 times, most recently from b736036 to 1d23841 Compare February 16, 2020 17:12

jtlisi reviewed Feb 16, 2020

View reviewed changes

sandeepsukhani force-pushed the delete-series-store-purger branch from 671706a to 3e35a6d Compare February 17, 2020 12:15

bastjan mentioned this pull request Feb 17, 2020

Delete log streams via API grafana/loki#577

Closed

jtlisi reviewed Feb 18, 2020

View reviewed changes

pkg/cortex/modules.go Outdated Show resolved Hide resolved

pkg/cortex/modules.go Outdated Show resolved Hide resolved

sandeepsukhani force-pushed the delete-series-store-purger branch 2 times, most recently from d3446bc to be92f93 Compare February 18, 2020 08:28

jtlisi reviewed Feb 18, 2020

View reviewed changes

sandeepsukhani force-pushed the delete-series-store-purger branch from 5ec0a39 to 01e2422 Compare February 19, 2020 09:20

sandeepsukhani mentioned this pull request Feb 19, 2020

chunk slice support #2157

Merged

1 task

sandeepsukhani force-pushed the delete-series-store-purger branch 2 times, most recently from e4fab35 to 5511880 Compare February 20, 2020 10:31

sandeepsukhani mentioned this pull request Feb 20, 2020

delete chunk and seriesIDs #2168

Merged

1 task

sandeepsukhani force-pushed the delete-series-store-purger branch 3 times, most recently from 0428f10 to 022c0e7 Compare February 23, 2020 06:10

jtlisi approved these changes Feb 25, 2020

View reviewed changes

pkg/chunk/purger/purger.go Outdated Show resolved Hide resolved

pkg/cortex/modules.go Outdated Show resolved Hide resolved

sandeepsukhani force-pushed the delete-series-store-purger branch from 022c0e7 to 9232530 Compare February 25, 2020 06:44

pracucci reviewed Feb 25, 2020

View reviewed changes

pstibrany reviewed Feb 27, 2020

View reviewed changes

sandeepsukhani force-pushed the delete-series-store-purger branch from 4dbfeb9 to 54296da Compare February 28, 2020 12:44

pstibrany reviewed Feb 28, 2020

View reviewed changes

pstibrany approved these changes Feb 28, 2020

View reviewed changes

pull-request-size bot added the size/XXL label Mar 2, 2020

sandeepsukhani force-pushed the delete-series-store-purger branch from 54296da to c7221b8 Compare March 2, 2020 07:05

gouthamve reviewed Mar 3, 2020

View reviewed changes

sandeepsukhani added 18 commits March 3, 2020 18:51

moved delete store creation from initStore to initPurger, which is th…

f4ae08d

…e only component that needs it Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

implemented new methods in MockStorage for writing tests

0d4384a

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

removed DeleteClient and using IndexClient in DeleteStore

c7e828d

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

refactored some code, added some tests for chunk store and purger

a5110a0

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

add some tests and fixed some issues found during tests

8c80fdd

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

changes suggested in PR review

d6d6936

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

rebased and fixed conflicts

f402290

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

updated route for delete handler to look same as prometheus

ee23718

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

added test for purger restarts and fixed some issues

39bad2d

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

suggested changes from PR review and fixed linter, tests

b916e03

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

fixed panic in modules when stopping purger

759f945

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

changes suggested from PR review

58a1244

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

config changes suggested in PR review

7920d1d

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

changes suggested from PR review

0126592

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

updated config doc

c3fc2fe

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

updated changelog

9fedd16

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

some changes suggested from PR review

3c74dc7

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

sandeepsukhani force-pushed the delete-series-store-purger branch from 23fe072 to 3c74dc7 Compare March 3, 2020 13:25

made init in Purger public to call it from modules to fail early

2a780db

Signed-off-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>

gouthamve approved these changes Mar 3, 2020

View reviewed changes

gouthamve merged commit 1a9d546 into cortexproject:master Mar 3, 2020

bboreham mentioned this pull request Aug 20, 2020

RFC: Tombstones based deletes in Cortex #1449

Closed

delete series api and purger to purge data requested for deletion #2103

delete series api and purger to purge data requested for deletion #2103

Conversation

sandeepsukhani commented Feb 10, 2020 • edited Loading

sandeepsukhani commented Feb 10, 2020

jtlisi left a comment

Choose a reason for hiding this comment

jtlisi left a comment

Choose a reason for hiding this comment

jtlisi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pstibrany left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandeepsukhani Feb 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandeepsukhani commented Feb 27, 2020

pstibrany left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pstibrany left a comment

Choose a reason for hiding this comment

gouthamve left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandeepsukhani commented Feb 10, 2020 •

edited

Loading

sandeepsukhani Feb 27, 2020 •

edited

Loading

pstibrany left a comment •

edited

Loading