Add initial WAL implementation and tests #3569

pauldix · 2015-08-05T22:31:59Z

This implements a WAL that is backed by the filesystem and disconnected from BoltDB.

Partitions

The WAL is broken into different partitions. The default number of partitions is 5. Each partition consists of a number of segment files. By default these files will get up to 2MB in size before a new segment file is opened. The files are numbered and start at 1. The number indicates the order in which the files should be read on startup to ensure data is recovered in the same order it was written.

Partitions are flushed and compacted individually. One of the goals with having multiple partitions was to be able to flush only a portion of the WAL at a time.

Compaction

The WAL does not flush everything in a partition when it comes time. It will only flush series that are over a given threshold (30kb by default). Here's the order of steps for compaction:

Lock the partition to determine which series will be flush and move these series to a flush cache. This is against in memory structures and should be very fast.
Start a new segment file so that all the old files can be compacted without blocking writes to this partition.
Write any series that are over the flush threshold to the index (BoltDB)
Read all the old segment files pulling out entries that are in series that weren't flushed to the index and write them to a temporary compaction file
Delete the old segment files as they are read
Rename the compaction file so that it is segment file number 1

The reason for having a flush cache and the regular cache is so that queries can be answered while a compaction is occurring. Also, writes can come into the series that are in the middle of being flushed. They will simply get written into the new segment file and added to the in-memory cache for the next flush.

Task List

benbjohnson · 2015-08-06T03:21:46Z

tsdb/engine/wal/wal.go

+	DefaultSegmentSize         = 2 * 1024 * 1024 // 2MB, default size of a segment file in a partitions
+	DefaultReadySeriesSize     = 32 * 1024       // 32KB, series is ready to flush once it reaches this size
+	DefaultCompactionThreshold = 0.5             // flush and compact a partition once this ratio of keys are over the flush size
+	DefaultMaxSeriesSize       = 2 * 1024 * 1024 // if any key in a WAL partition reaches this size, we'll force compaction of the partition


Can you put the comments above the identifiers? :)

otoolep · 2015-08-19T00:35:08Z

Took a first pass, mostly very small stuff. I don't have the full picture yet for this, but tried to look for common mistakes we make during coding etc. Generally looks pretty clean to me though.

Did have one real question about two fmt.Println statements which don't look right, and it seems we might lose some important logging.

pauldix · 2015-08-19T00:42:00Z

Updated to fix @otoolep's comment about logging. I'm going to merge this in so I can cut RC1 since it's important to get this out for people to test against. However, keep making comments on this PR and I'll address in a separate one.

Add initial WAL implementation and tests

pauldix · 2015-08-19T00:58:05Z

@jhorwit2 just cut 0.9.3-rc1. Interested in hearing how it performs for you: https://groups.google.com/forum/#!topic/influxdb/T6mF4_WeNkM

jhorwit2 · 2015-08-19T01:11:19Z

@pauldix the RPM link doesn't work :I

I'll try to run some tests tomorrow.

otoolep · 2015-08-19T01:25:24Z

@jhorwit2 -- please try the link below for the RPM:

https://s3.amazonaws.com/influxdb/influxdb-0.9.3_rc1-1.x86_64.rpm

jhorwit2 · 2015-08-19T01:30:12Z

Thanks, are WAL flushes still suppose to cause latency in writes?

jhorwit2 · 2015-08-19T01:34:14Z

Decided to give it a quick shot. I am noticing that WAL flushes are sometimes taking a while (like 5+ seconds) which causes writes to 500 timeout. Besides that issue i'm seeing better performance off the bat.

jhorwit2 · 2015-08-19T01:54:53Z

Also, I noticed the config has /var/log/influxdb/wal as a new option, but I don't see anything at that location. Is that normal??

pauldix · 2015-08-19T02:44:51Z

@jhorwit2 you sure you're upgraded? What version does it say it is? WAL flushes shouldn't cause writes to timeout at all. Writes should still happen while flush and compact is running.

If that changed then it has to be due to something I merged into the WAL branch earlier today, because I tested it extensively and definitely wasn't getting any delay due to WAL flush.

The new location should have the WAL files, but it does a complete flush of a given partition if you haven't written data in 5 minutes. But there should still be directories in there for the partitions.

pauldix · 2015-08-19T02:47:43Z

@jhorwit2 actually, if you upgraded, but have old data around, then you're probably still writing to a b1 engine, which doesn't include this update. You'll need to either start with a fresh database, or wait for a new shard group to get created for current data, which will use the new engine + WAL

kenzodeluxe · 2015-08-19T10:40:49Z

I can confirm that the performance enhancement appears to be great. I'm currently writing 10k points/sec. peak without any issues (it appears Influx is sweating way less than before) and have done so for a few hours. Before the change, it would stop working latest after ~30 min.

pauldix · 2015-08-19T11:20:05Z

@kenzodeluxe What kind of hardware? Are you writing a single data point per request? I'd expect you to be able to get 10x that if you batch points together in requests in 1k to 10k batches.

jhorwit2 · 2015-08-19T11:22:40Z

I did some preliminary testing this last weekend on your branch and was getting up to 50k/s with pretty standard points (few tags and 1 field float value).. This was on my home linux box so a 4 core, 16gb ram, ssd computer.

I'm updating our test box (32gb ram, 16 core, ssd) this morning and i'll run more tests.

@pauldix you were right about using the old b1. I made a new database and things seem better

kenzodeluxe · 2015-08-19T12:39:28Z

@pauldix This is right now a VM running on a PowerEdge R720 with 4GB memory and 4 cores (for testing); the requests are batched, but we had reduced the number of points per batch based on our past experience (with the actual hardware and running natively). With 90k entries/batch, I can see 3s time to write the data which is great. Once we're done with our preliminary testing we'll move this over to actual hardware (and I can share further details then).

benbjohnson · 2015-08-19T14:33:44Z

tsdb/engine/wal/wal.go

+	length := make([]byte, 8)
+	for {
+		// get the length of the compressed seriesAndFields blob
+		_, err := f.Read(length)


Read() can read less than len(length). I think you want to use io.ReadFull(f, length).

fixed in 370f008

jhorwit2 · 2015-08-19T21:16:52Z

@pauldix - I'm seeing some really weird things with queries. I'm getting non-deterministic results from my select queries. I sometimes get back something super small like 15 and sometimes i get 2000+ (i expect this #)

It appears to only be in pretty recent data, which in our use case isn't good because we use this for near real-time insights, but the last few minutes of data is spotty.

jhorwit2 · 2015-08-19T21:19:17Z

You can see some of that here. These are screenshots after a refresh.

jhorwit2 · 2015-08-19T22:35:07Z

@pauldix - ran some tests with this script.

Times are in milliseconds.

metrics: 18:32:14.247656 timer requests
metrics: 18:32:14.247679   count:            1200
metrics: 18:32:14.247696   min:         145.223
metrics: 18:32:14.247719   max:         783.381
metrics: 18:32:14.247736   mean:        550.923
metrics: 18:32:14.247755   stddev:      126.184
metrics: 18:32:14.247770   median:      586.401
metrics: 18:32:14.247785   75%:         636.971
metrics: 18:32:14.247799   95%:         691.052
metrics: 18:32:14.247814   99%:         735.061
metrics: 18:32:14.247829   99.9%:       783.269
metrics: 18:32:14.247846   1-min rate:         18.40
metrics: 18:32:14.247862   5-min rate:         16.67
metrics: 18:32:14.247878   15-min rate:        16.24
metrics: 18:32:14.247894   mean rate:          19.76

My rate was about ~18.40 requests per second over a minute which at a batch of 5k is ~92k points per second. 9.2 definitely couldn't handle this scale even with a tweaked config.

This was on my home box which is a quad core, 16gb ram and ssd. (didn't get a chance to spin up a server today. hopefully will tomorrow)

Here is 8k batches at same request rate per second. (so ~147k/s)

metrics: 18:39:25.595658   count:            1200
metrics: 18:39:25.595677   min:         208.609
metrics: 18:39:25.595694   max:         1248.26
metrics: 18:39:25.595748   mean:        791.554
metrics: 18:39:25.595771   stddev:      176.45
metrics: 18:39:25.595788   median:      848.963
metrics: 18:39:25.595805   75%:         909.971
metrics: 18:39:25.595822   95%:         971.757
metrics: 18:39:25.595839   99%:         1021.95
metrics: 18:39:25.595856   99.9%:       1246.05
metrics: 18:39:25.595873   1-min rate:         18.41
metrics: 18:39:25.595890   5-min rate:         16.67
metrics: 18:39:25.595908   15-min rate:        16.24
metrics: 18:39:25.595926   mean rate:          19.69

kenzodeluxe · 2015-08-20T13:15:24Z

influxd was killed by OOM yesterday; I increased available memory to 64G, running stable since then. Had 5 timeouts on writes that took longer than 5s in a time frame with heavy querying from Grafana. Without that load, a batch of 10k points usually takes ~250ms to process (according to the logs).

pkittenis · 2015-08-21T10:27:58Z

Posted this on the mailing list, re-posting here. In short, there are still write timeouts with many clients sending small-ish batches of metrics.

To replicate write timeouts, the following load test client command can be used with influxdb graphite plugin listening on tcp port 2003. Graphite plugin batch size is 102400, though adjusting it makes no difference.

# 100 persistent concurrent connections sending batches of 10 metrics with no delay between batches
$ python graphite_test_client.py -c 100 -m 10 -d 0 127.0.0.1 3600

Effect on influxdb:

[graphite] 2015/08/21 10:19:53 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:19:58 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:20:03 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:20:08 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:20:13 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:20:19 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:20:24 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:20:29 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:20:34 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:20:39 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:20:44 failed to write point batch to database "graphite": timeout

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                  
28019 influxdb  20   0  586m 285m 6948 S 100.6  1.8   6:20.88 influxd

To install load test client dependencies run pip install -r requirements.txt after cloning the gist.

Hope this helps.

beckettsean · 2015-08-21T18:03:43Z

@pkittenis Are your test results from 0.9.3-rc1 or another version of the database?

pkittenis · 2015-08-26T15:02:13Z

@beckettsean - 0.9.3-rc1, yes.

pauldix added the 2 - Working label Aug 5, 2015

pauldix force-pushed the pd-wal branch 2 times, most recently from 55a4443 to 3c988b8 Compare August 5, 2015 22:59

benbjohnson reviewed Aug 6, 2015
View reviewed changes

Fix compaction logging, make default idle flush interval 5 minutes.

028d0a6

pauldix added a commit that referenced this pull request Aug 19, 2015

Merge pull request #3569 from influxdb/pd-wal

c31b88d

Add initial WAL implementation and tests

pauldix merged commit c31b88d into master Aug 19, 2015

pauldix mentioned this pull request Aug 19, 2015

Move WAL out of BoltDB #3514

Closed

bobuss mentioned this pull request Aug 19, 2015

[0.9.3-rc1] panic after a 4M+ "in the past" points import. Unable to restart #3730

Closed

benbjohnson reviewed Aug 19, 2015
View reviewed changes

otoolep deleted the pd-wal branch October 28, 2015 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial WAL implementation and tests #3569

Add initial WAL implementation and tests #3569

pauldix commented Aug 5, 2015

benbjohnson Aug 6, 2015

otoolep commented Aug 19, 2015

pauldix commented Aug 19, 2015

pauldix commented Aug 19, 2015

jhorwit2 commented Aug 19, 2015

otoolep commented Aug 19, 2015

jhorwit2 commented Aug 19, 2015

jhorwit2 commented Aug 19, 2015

jhorwit2 commented Aug 19, 2015

pauldix commented Aug 19, 2015

pauldix commented Aug 19, 2015

kenzodeluxe commented Aug 19, 2015

pauldix commented Aug 19, 2015

jhorwit2 commented Aug 19, 2015

kenzodeluxe commented Aug 19, 2015

benbjohnson Aug 19, 2015

pauldix Aug 21, 2015

jhorwit2 commented Aug 19, 2015

jhorwit2 commented Aug 19, 2015

jhorwit2 commented Aug 19, 2015

kenzodeluxe commented Aug 20, 2015

pkittenis commented Aug 21, 2015

beckettsean commented Aug 21, 2015

pkittenis commented Aug 26, 2015

Add initial WAL implementation and tests #3569

Add initial WAL implementation and tests #3569

Conversation

pauldix commented Aug 5, 2015

Partitions

Compaction

Task List

benbjohnson Aug 6, 2015

Choose a reason for hiding this comment

otoolep commented Aug 19, 2015

pauldix commented Aug 19, 2015

pauldix commented Aug 19, 2015

jhorwit2 commented Aug 19, 2015

otoolep commented Aug 19, 2015

jhorwit2 commented Aug 19, 2015

jhorwit2 commented Aug 19, 2015

jhorwit2 commented Aug 19, 2015

pauldix commented Aug 19, 2015

pauldix commented Aug 19, 2015

kenzodeluxe commented Aug 19, 2015

pauldix commented Aug 19, 2015

jhorwit2 commented Aug 19, 2015

kenzodeluxe commented Aug 19, 2015

benbjohnson Aug 19, 2015

Choose a reason for hiding this comment

pauldix Aug 21, 2015

Choose a reason for hiding this comment

jhorwit2 commented Aug 19, 2015

jhorwit2 commented Aug 19, 2015

jhorwit2 commented Aug 19, 2015

kenzodeluxe commented Aug 20, 2015

pkittenis commented Aug 21, 2015

beckettsean commented Aug 21, 2015

pkittenis commented Aug 26, 2015