src/log.c:87: refsTryInsert: Assertion `next_slot->term != term' failed. #470

MathieuBordere · 2023-08-23T17:20:29Z

Detected in Jepsen: https://github.com/canonical/jepsen.dqlite/actions/runs/5953700449/job/16148585347
jepsen-data-bank-partition,disk-failure(2).zip

cole-miller · 2023-08-23T18:23:20Z

raft/src/log.c

Lines 84 to 87 in 1e5465d

    
                   /* It should never happen that two entries with the same index and term 
        
                    * get appended. So no existing slot in this bucket must track an entry 
        
                    * with the same term as the given one. */ 
        
                   assert(next_slot->term != term);

cole-miller · 2023-08-23T20:51:49Z

n1 becomes leader in term 3; it has an existing log entry at index 3969 from term 2, and it replicates a barrier entry (term 3) at index 3970
n3 receives an append request from n1 for just the barrier entry, and refuses it because it doesn't have an entry at index 3969 (based on logLastIndex)
n1 tries again, sending a request for two entries (indices 3969 and 3970)
this time, there is at least a trace on n3 of an entry at index 3969 (presumably not 3970), also with term 2, because n3 finds a slot that tracks such an entry in its refcount hashmap for the in-memory log

It seems like the refcount hashmap on n3 is getting out of sync with the value reported by logLastIndex...

cole-miller · 2023-08-23T20:55:52Z

Ahh, and importantly:

n3 is the leader in term 2, and its disk write for the entry at 3969 fails, causing it to step down

So I think the cleanup after the disk write failure must be buggy, missing a refcount decrement maybe

cole-miller · 2023-08-23T22:03:04Z

It's possible that this was exposed by #460

freeekanayaka · 2023-08-23T22:22:47Z

Ahh, and importantly:

n3 is the leader in term 2, and its disk write for the entry at 3969 fails, causing it to step down

So I think the cleanup after the disk write failure must be buggy, missing a refcount decrement maybe

My 2c: when struct raft_io reports a disk write failure (i.e. raft_io->append()'s callback fires with a non-zero status), struct raft should somehow schedule a raft_io->append retry later in the future, rather than rolling back its own in-memory log and rely on the leader to resend everything and retrigger the whole behavior. Rolling back is always hard and it's easy to miss details, while retrying should be relatively straighforward (or at least it limits any rollback logic to the internals of struct raft_io).

In case the write failure is ENOSPC the follower struct raft could also take note of that and stop accepting new AppendEntries messages until the situation resolves.

cole-miller · 2023-08-24T01:43:26Z

Hmm, couldn't the same goal be achieved by refusing to append the entries to the in-memory log until they've been persisted?

In any case, I'd like to fully understand what's going on to cause this specific broken invariant; there might be more to the story than what I've been able to deduce so far...

cole-miller · 2023-08-25T17:46:17Z

This seems to be somewhat reproducible in the scheduled runs, e.g. https://github.com/canonical/jepsen.dqlite/actions/runs/5977971227/job/16219112428

cole-miller · 2023-09-06T13:18:23Z

https://github.com/canonical/jepsen.dqlite/actions/runs/6091570581/job/16528374348

cole-miller · 2023-11-28T22:05:17Z

(fixed by #483)

MathieuBordere added the Bug Confirmed to be a bug label Aug 23, 2023

cole-miller mentioned this issue Aug 23, 2023

Reliable way to debug core dumps canonical/jepsen.dqlite#132

Open

cole-miller self-assigned this Aug 23, 2023

cole-miller mentioned this issue Sep 7, 2023

Use branch with extra tracing for scheduled runs canonical/jepsen.dqlite#133

Closed

cole-miller mentioned this issue Sep 26, 2023

replication: Try to reinstate entries that have been truncated away #483

Merged

cole-miller closed this as completed Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/log.c:87: refsTryInsert: Assertion `next_slot->term != term' failed. #470

src/log.c:87: refsTryInsert: Assertion `next_slot->term != term' failed. #470

MathieuBordere commented Aug 23, 2023

cole-miller commented Aug 23, 2023

cole-miller commented Aug 23, 2023 •

edited

Loading

cole-miller commented Aug 23, 2023

cole-miller commented Aug 23, 2023

freeekanayaka commented Aug 23, 2023

cole-miller commented Aug 24, 2023

cole-miller commented Aug 25, 2023

cole-miller commented Sep 6, 2023

cole-miller commented Nov 28, 2023

src/log.c:87: refsTryInsert: Assertion `next_slot->term != term' failed. #470

src/log.c:87: refsTryInsert: Assertion `next_slot->term != term' failed. #470

Comments

MathieuBordere commented Aug 23, 2023

cole-miller commented Aug 23, 2023

cole-miller commented Aug 23, 2023 • edited Loading

cole-miller commented Aug 23, 2023

cole-miller commented Aug 23, 2023

freeekanayaka commented Aug 23, 2023

cole-miller commented Aug 24, 2023

cole-miller commented Aug 25, 2023

cole-miller commented Sep 6, 2023

cole-miller commented Nov 28, 2023

cole-miller commented Aug 23, 2023 •

edited

Loading