feat(v2): dlq recovery #3595

korniltsev · 2024-09-27T11:46:06Z

No description provided.

…rniltsev/dlq_recovery

kolesnikovae · 2024-10-01T05:34:50Z

pkg/experiment/metastore/metastore.go

 func (m *Metastore) stopping(_ error) error {
 	close(m.done)
 	m.wg.Wait()
+	m.dlq.Stop()
 	return m.Shutdown()
 }


Stop is called twice: here and in m.Shutdown, where I believe it should live. Let's remove if from here

NB: I guess we should stop DLQ before raft just to minimize chances of concurrent recovery

👍 applied

pkg/experiment/metastore/metastore.go

kolesnikovae · 2024-10-01T05:49:07Z

pkg/experiment/metastore/dlq/recovery.go

+const pathAnon = tenant.DefaultTenantID
+const pathBlock = "block.bin"
+const pathMetaPB = "meta.pb"
+const pathDLQ = "dlq"


Let's find a place for the constants (we already have them defined in other places)

created a separate tiny package

pkg/experiment/metastore/metastore.go

pkg/experiment/metastore/dlq/recovery.go

kolesnikovae · 2024-10-01T06:44:49Z

pkg/experiment/metastore/dlq/recovery.go

+func (r *Recovery) recoverLoop(ctx context.Context) {
+	ticker := time.NewTicker(r.cfg.Period)
+	for {
+		select {
+		case <-ctx.Done():
+			return
+		case <-ticker.C:
+			r.recoverTick(ctx)
+		}
+	}
+}
+
+func (r *Recovery) recoverTick(ctx context.Context) {
+	err := r.bucket.Iter(ctx, pathDLQ, func(metaPath string) error {
+		if ctx.Err() != nil {
+			return ctx.Err()
+		}
+		r.recover(ctx, metaPath)
+		return nil
+	}, objstore.WithRecursiveIter)
+	if err != nil {
+		level.Error(r.l).Log("msg", "failed to iterate over dlq", "err", err)
+	}
+}


NB: One danger I see is that we mix old and new blocks in compaction jobs, which may increase the read amplification factor. I think we need to implement some logic to prevent this – just one more rule for the compaction planner. It’s not really critical, as we don't expect this to happen frequently.

/cc @aleks-p

I'll add s little more context here.

Basically, when we try to add the same md entry twice there are three cases:

We still hold the entry in memory in the metastore. That's simple: we just reject it.

We've already compacted the block and we still have tombstone for it. That's simple: we just reject it.

We've already compacted the block and removed it from object storage and don't have tombstone for it. Somewhat tricky – I propose the following:

Handle the md entry as usual: add it to the index, add it to the compaction job, create a tombstone afterwards, etc. The entry is visible to readers.

Ignore "object not found" in compactors and queriers (reflect it in logs and metrics). Clearly, even if the object is missing due to any other reason, it's probably the best strategy. Better than failing the whole query/job at last.

Ensure that we only remove tombstone if the object is actually deleted from object storage.

kolesnikovae

LGTM

kolesnikovae · 2024-10-02T04:53:52Z

pkg/experiment/metastore/dlq/recovery.go

+func isRaftLeadershipError(err error) bool {
+	return errors.Is(err, raft.ErrLeadershipLost) ||
+		errors.Is(err, raft.ErrNotLeader) ||
+		errors.Is(err, raft.ErrLeadershipTransferInProgress) ||
+		errors.Is(err, raft.ErrRaftShutdown)
+}


nit: we could probably refactor out this function to the raftleader package

pyroscope/pkg/experiment/metastore/metastore_fsm.go

Lines 251 to 256 in 363d453

func shouldRetryCommand(err error) bool {

return errors.Is(err, raft.ErrLeadershipLost) ||

errors.Is(err, raft.ErrNotLeader) ||

errors.Is(err, raft.ErrLeadershipTransferInProgress) ||

errors.Is(err, raft.ErrRaftShutdown)

}

pkg/experiment/metastore/dlq/recovery.go

kolesnikovae · 2024-10-02T05:12:28Z

pkg/scheduler/queue/queue_test.go

@@ -294,7 +294,7 @@ func assertChanReceived(t *testing.T, c chan struct{}, timeout time.Duration, ms
 	select {
 	case <-c:
 	case <-time.After(timeout):
-		t.Fatalf(msg)
+		t.Fatalf("%s", msg)


nit: I'm curious if a simple t.Fatal(msg) would do the trick

Co-authored-by: Anton Kolesnikov <anton.e.kolesnikov@gmail.com>

korniltsev added 9 commits September 18, 2024 17:02

chore(v2): add raft details tests

9db435a

add sw-dlq-recovery test

9926ec5

add recovery start stop test

894df1e

Merge branch 'reaft_details_test' into korniltsev/dlq_recovery

60e8b0b

fix test

76a6ab0

refactor metastore creation to a package

7920434

add integration test

f50ce77

add integration test

fea1a1f

static discovery

f1b083f

korniltsev requested a review from a team as a code owner September 27, 2024 11:46

korniltsev marked this pull request as draft September 27, 2024 11:46

korniltsev added 3 commits September 27, 2024 18:46

Merge branch 'main' into korniltsev/dlq_recovery

98fe88e

make fmt

b8d7a0e

Merge remote-tracking branch 'origin/korniltsev/dlq_recovery' into ko…

0227c08

…rniltsev/dlq_recovery

korniltsev marked this pull request as ready for review September 30, 2024 12:56

kolesnikovae reviewed Oct 1, 2024

View reviewed changes

korniltsev added 3 commits October 1, 2024 13:37

review fix

aba0dd4

review fix

f328d9a

review fix

e8b3c97

kolesnikovae reviewed Oct 1, 2024

View reviewed changes

korniltsev added 4 commits October 1, 2024 14:04

review fix

a1c5a19

review fix: break recovery loop on raft leadership errors

b2344f0

review fix: separate AddRecoveredBlock

b5da246

make fmt

31b3aab

korniltsev force-pushed the korniltsev/dlq_recovery branch from 79e57d2 to 31b3aab Compare October 1, 2024 08:28

korniltsev mentioned this pull request Oct 1, 2024

chore(v2): add raft details tests #3567

Closed

korniltsev requested a review from kolesnikovae October 1, 2024 13:31

update raftleader.HealthObserver

4b5c31a

kolesnikovae approved these changes Oct 2, 2024

View reviewed changes

Update pkg/experiment/metastore/dlq/recovery.go

a1b1ef1

Co-authored-by: Anton Kolesnikov <anton.e.kolesnikov@gmail.com>

review fix

67a9d95

korniltsev merged commit 3cc5bd8 into main Oct 2, 2024
18 checks passed

korniltsev deleted the korniltsev/dlq_recovery branch October 2, 2024 06:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(v2): dlq recovery #3595

feat(v2): dlq recovery #3595

korniltsev commented Sep 27, 2024

kolesnikovae Oct 1, 2024

korniltsev Oct 1, 2024

kolesnikovae Oct 1, 2024

korniltsev Oct 1, 2024

kolesnikovae Oct 1, 2024

kolesnikovae Oct 2, 2024

kolesnikovae left a comment

kolesnikovae Oct 2, 2024

kolesnikovae Oct 2, 2024

	func shouldRetryCommand(err error) bool {
	return errors.Is(err, raft.ErrLeadershipLost) \|\|
	errors.Is(err, raft.ErrNotLeader) \|\|
	errors.Is(err, raft.ErrLeadershipTransferInProgress) \|\|
	errors.Is(err, raft.ErrRaftShutdown)
	}

feat(v2): dlq recovery #3595

feat(v2): dlq recovery #3595

Conversation

korniltsev commented Sep 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kolesnikovae left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment