Skip to content
This repository has been archived by the owner on Mar 4, 2024. It is now read-only.

src/log.c:87: refsTryInsert: Assertion `next_slot->term != term' failed. #470

Closed
MathieuBordere opened this issue Aug 23, 2023 · 9 comments
Assignees
Labels
Bug Confirmed to be a bug

Comments

@MathieuBordere
Copy link
Contributor

Detected in Jepsen: https://github.com/canonical/jepsen.dqlite/actions/runs/5953700449/job/16148585347
jepsen-data-bank-partition,disk-failure(2).zip

@MathieuBordere MathieuBordere added the Bug Confirmed to be a bug label Aug 23, 2023
@cole-miller
Copy link
Contributor

raft/src/log.c

Lines 84 to 87 in 1e5465d

/* It should never happen that two entries with the same index and term
* get appended. So no existing slot in this bucket must track an entry
* with the same term as the given one. */
assert(next_slot->term != term);

@cole-miller
Copy link
Contributor

cole-miller commented Aug 23, 2023

  • n1 becomes leader in term 3; it has an existing log entry at index 3969 from term 2, and it replicates a barrier entry (term 3) at index 3970
  • n3 receives an append request from n1 for just the barrier entry, and refuses it because it doesn't have an entry at index 3969 (based on logLastIndex)
  • n1 tries again, sending a request for two entries (indices 3969 and 3970)
  • this time, there is at least a trace on n3 of an entry at index 3969 (presumably not 3970), also with term 2, because n3 finds a slot that tracks such an entry in its refcount hashmap for the in-memory log

It seems like the refcount hashmap on n3 is getting out of sync with the value reported by logLastIndex...

@cole-miller
Copy link
Contributor

Ahh, and importantly:

  • n3 is the leader in term 2, and its disk write for the entry at 3969 fails, causing it to step down

So I think the cleanup after the disk write failure must be buggy, missing a refcount decrement maybe

@cole-miller cole-miller self-assigned this Aug 23, 2023
@cole-miller
Copy link
Contributor

It's possible that this was exposed by #460

@freeekanayaka
Copy link
Contributor

Ahh, and importantly:

  • n3 is the leader in term 2, and its disk write for the entry at 3969 fails, causing it to step down

So I think the cleanup after the disk write failure must be buggy, missing a refcount decrement maybe

My 2c: when struct raft_io reports a disk write failure (i.e. raft_io->append()'s callback fires with a non-zero status), struct raft should somehow schedule a raft_io->append retry later in the future, rather than rolling back its own in-memory log and rely on the leader to resend everything and retrigger the whole behavior. Rolling back is always hard and it's easy to miss details, while retrying should be relatively straighforward (or at least it limits any rollback logic to the internals of struct raft_io).

In case the write failure is ENOSPC the follower struct raft could also take note of that and stop accepting new AppendEntries messages until the situation resolves.

@cole-miller
Copy link
Contributor

Hmm, couldn't the same goal be achieved by refusing to append the entries to the in-memory log until they've been persisted?

In any case, I'd like to fully understand what's going on to cause this specific broken invariant; there might be more to the story than what I've been able to deduce so far...

@cole-miller
Copy link
Contributor

This seems to be somewhat reproducible in the scheduled runs, e.g. https://github.com/canonical/jepsen.dqlite/actions/runs/5977971227/job/16219112428

@cole-miller
Copy link
Contributor

@cole-miller
Copy link
Contributor

(fixed by #483)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Bug Confirmed to be a bug
Projects
None yet
Development

No branches or pull requests

3 participants