nsqd: per-topic WAL #625

mreiferson · 2015-08-09T18:11:31Z

This introduces a write-ahead log for nsqd topics using https://github.com/mreiferson/wal. This is the first part of #510.

At a high level, the WAL is a filesystem backed FIFO queue with monotonically increasing IDs, optional periodic flushing semantics (off by default), and the ability to open a "cursor" at any index within its retention window. Many cursors can be open simultaneously at different indices.

Each topic has its own WAL and each channel has its own cursor into it. The filesystem replaces the per-topic goroutine messagePump() that previously copied messages to channels. Instead, we use Go's low-level sync.Condition to "wake up" any open cursors so they can advance when events are appended.

There are quite a few remaining items to figure out here, namely:

mreiferson · 2015-08-09T18:21:35Z

This is a good sign so far:

mreiferson · 2015-08-09T18:34:13Z

Updated description, let me know if any of this sounds insane @jehiah.

/cc @stephensearles @judwhite

mreiferson · 2015-08-09T18:55:23Z

depth, FIN accounting, retention windows

The current plan is for each topic to maintain a RangeSet (thanks @rolyatmax). A RangeSet maintains a list of contiguous ranges of message IDs. For the topic, it will store the ranges of IDs of messages that have been published. For each channel, it will store the ranges of IDs of messages that have been FINd (for that channel). Thus, the difference in "count" between the two sets should be the current depth of a given channel. Topics will no longer have non-zero depth unless there are no channels present, in which case it is the count of its RangeSet.

I don't really know what to do about BackendDepth - what do you think @jehiah?

This is also the mechanism we can use to determine what the retention window for a topic should be. By scanning each channel's RangeSets, we can determine the lowest non-contiguous index. The lowest index across channels is the retention point for the topic. Since we'll probably provide some configurable options around how much data we keep around, the largest of these values will be chosen as the actual retention point.

Dieterbe · 2015-08-10T10:29:46Z

any thoughts on how the performance of this compares for single channel, 100% of msg going into diskqueue vs into WAL, and 100% of reads going out of diskqueue vs WAL ?

(i presume once multiple channels come into play, the shared WAL vs individual diskqueues makes the WAL significicantly better)

also: will there be (optional) sync-ack semantics? so that the producer only gets an ack when the data has been synced to disk. needless to say this comes with a performance tradeoff but can be worth it if message size is big enough.

mreiferson · 2015-08-10T13:58:02Z

@Dieterbe working on correctness right now 😁. Performance should be better.

Dieterbe · 2015-08-10T14:36:16Z

@mreiferson is ack-on-sync something you think should make it into nsqd as well?
i think it would be great if my producers could know a message is safely stored in the diskqueue/WAL and synced to disk. (i'm looking into nsqd for a new project and kinda need this feature. if you think it sounds good i would happily work on this and PR it)

mreiferson · 2015-08-10T14:39:10Z

@Dieterbe that already is the case in both the current production implementation (IF the message overflowed to disk) and this PR. If it were not the case there would be no back-pressure!

Dieterbe · 2015-08-10T15:00:47Z

@mreiferson I doesn't look like it.. or there is a misunderstanding.
(d *diskQueue) Put(data []byte) writes to d.writeChan and returns when it reads from d.writeResponseChan. and those ops only have a count++ and a writeOne() (which only syncs when we open a new file) in between, the sync is only enforced at the next iteration of the loop. so Put() returns before the sync to disk happens (unless a new file is opened), so message loss can still happen. which is what i'm trying to avoid. am i missing something? (also, sorry to distract slightly from your WAL work)

mreiferson · 2015-08-10T15:29:58Z

Sorry, I misunderstood what you meant by "sync" - no, there is no mechanism to deliver fsync acks to producers.

The complexity doesn't feel like something that would be worth it. The edge cases where you would lose messages, although they certainly exist, are probably better handled via replication.

You could also run with --sync-every=1 but that would also be quite aggressive.

mreiferson · 2015-08-10T15:31:23Z

NOTE: there is still back-pressure from the periodic fsync because DiskQueue cannot make progress during a sync and individual writes are synchronized with the PUB.

stephensearles · 2015-08-10T16:24:16Z

Nice work! I'll dig into it a bit more later this week

Dieterbe · 2015-08-20T09:39:17Z

@mreiferson: does this mean that

if this is enabled, we have guaranteed FIFO message ordering semantics?
will consumers be able to seek to arbitrary positions? for example if i have a log of the last 24h, and my consumer hasn't consumed anything in the last 4 hours, but it always prefers data 5 minutes old and more recent, and i know that i have a message per second will by consumer be able to say "seek to 300 messages before the end of the queue , and consume from there until the end" ? (and then later that consumer, or a different one, will make sure to also read the entire 4h range) something like this would be really useful to me.

mreiferson · 2015-08-20T15:44:19Z

No, not exactly. On a per-node basis, despite the WAL itself being FIFO, there are still requeues that will be redelivered based on their expiration. And, across a cluster in aggregate, there is no coordination and no guaranteed ordering.
That will not be in this first pass, but is a feature we could consider providing down the road based on this work.

Dieterbe · 2015-08-20T15:49:39Z

oh yes sure, i just meant single nsqd and assuming no requeues. a bit of a bold assumption i know.. let's call it "best case scenario" ;-)
cool thanks. i thought it wouldn't be possible due to messages having varying sizes so you wouldn't know where to seek to unless you start reading from the beginning. would this require adding an extra index datastructure or something? or is that already included in the design?

mreiferson · 2015-08-20T15:53:19Z

(2) It's already required to be able to seek to a certain index

mreiferson added the feature label Aug 9, 2015

mreiferson mentioned this pull request Aug 10, 2015

Refactor guid, message, queue #626

Closed

mreiferson changed the title ~~nsqd: per-topic WAL~~ [dev] nsqd: per-topic WAL Aug 10, 2015

mreiferson force-pushed the wal_625 branch from 56a8028 to 6bb055f Compare September 5, 2015 17:40

mreiferson force-pushed the wal_625 branch 3 times, most recently from edbb15e to ffa6b45 Compare September 19, 2015 20:34

mreiferson force-pushed the wal_625 branch from ffa6b45 to 81a4d7e Compare October 4, 2015 21:33

mreiferson force-pushed the wal_625 branch 2 times, most recently from 2898aec to dc5221b Compare October 18, 2015 16:59

Dieterbe mentioned this pull request Nov 5, 2015

NMT doesn't properly handle out of order data grafana/metrictank#41

Closed

mreiferson force-pushed the wal_625 branch from fc72e02 to af55c2a Compare December 19, 2015 16:22

ploxiln mentioned this pull request Apr 15, 2016

nsqd: retry health check #594

Closed

mreiferson force-pushed the wal_625 branch from af55c2a to 66d1a7f Compare April 16, 2016 17:19

mreiferson mentioned this pull request Apr 18, 2016

nsqd: per-topic message IDs #741

Merged

mreiferson added 24 commits February 9, 2019 09:51

event

a909133

DPUB

7b964f7

meh

e96afc3

renaming

4214198

timestamp

ee8d24c

meh

8229906

pause

03d59a8

point to master

b33d673

cleanup channel.flush

540975b

racey todo

66a5902

topic pausing

600245a

topic's don't need a rangeset

98ba552

meh

f939f80

rebase

8ab61a9

meh

1834f19

compile

4733b5a

rebase

9787a35

rebase

fc8b9c7

rebase

10e381b

rebase

63f9155

stray test cruft

44d72d1

test: workaround go 1.10 test caching

4440c6d

flakey metadata test

3a29c98

rebase

673072e

mreiferson force-pushed the wal_625 branch from e854c2e to 673072e Compare February 9, 2019 18:09

ploxiln mentioned this pull request Jun 25, 2019

nsqd: set memoryMsgChan as nil when --mem-queue-size=0 #1159

Merged

mreiferson mentioned this pull request Jun 11, 2020

nsqd: raft-based HA message persistence #1169

Closed

mreiferson mentioned this pull request Jun 21, 2020

nsqd: switch to monotonic clocks for ID generation #1249

Closed

ploxiln mentioned this pull request Oct 22, 2021

nsqd: there seems to be a possibility that the message may be lost #1386

Closed

ploxiln mentioned this pull request Dec 9, 2021

when set mem-queue-size=0, maybe lost a message. #1391

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsqd: per-topic WAL #625

nsqd: per-topic WAL #625

mreiferson commented Aug 9, 2015 •

edited

Loading

mreiferson commented Aug 9, 2015

mreiferson commented Aug 9, 2015

mreiferson commented Aug 9, 2015

Dieterbe commented Aug 10, 2015

mreiferson commented Aug 10, 2015

Dieterbe commented Aug 10, 2015

mreiferson commented Aug 10, 2015

Dieterbe commented Aug 10, 2015

mreiferson commented Aug 10, 2015

mreiferson commented Aug 10, 2015

stephensearles commented Aug 10, 2015

Dieterbe commented Aug 20, 2015

mreiferson commented Aug 20, 2015

Dieterbe commented Aug 20, 2015

mreiferson commented Aug 20, 2015

nsqd: per-topic WAL #625

Are you sure you want to change the base?

nsqd: per-topic WAL #625

Conversation

mreiferson commented Aug 9, 2015 • edited Loading

mreiferson commented Aug 9, 2015

mreiferson commented Aug 9, 2015

mreiferson commented Aug 9, 2015

Dieterbe commented Aug 10, 2015

mreiferson commented Aug 10, 2015

Dieterbe commented Aug 10, 2015

mreiferson commented Aug 10, 2015

Dieterbe commented Aug 10, 2015

mreiferson commented Aug 10, 2015

mreiferson commented Aug 10, 2015

stephensearles commented Aug 10, 2015

Dieterbe commented Aug 20, 2015

mreiferson commented Aug 20, 2015

Dieterbe commented Aug 20, 2015

mreiferson commented Aug 20, 2015

mreiferson commented Aug 9, 2015 •

edited

Loading