Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial WAL implementation and tests #3569

Merged
merged 20 commits into from
Aug 19, 2015
Merged

Add initial WAL implementation and tests #3569

merged 20 commits into from
Aug 19, 2015

Conversation

pauldix
Copy link
Member

@pauldix pauldix commented Aug 5, 2015

This implements a WAL that is backed by the filesystem and disconnected from BoltDB.

Partitions

The WAL is broken into different partitions. The default number of partitions is 5. Each partition consists of a number of segment files. By default these files will get up to 2MB in size before a new segment file is opened. The files are numbered and start at 1. The number indicates the order in which the files should be read on startup to ensure data is recovered in the same order it was written.

Partitions are flushed and compacted individually. One of the goals with having multiple partitions was to be able to flush only a portion of the WAL at a time.

Compaction

The WAL does not flush everything in a partition when it comes time. It will only flush series that are over a given threshold (30kb by default). Here's the order of steps for compaction:

  1. Lock the partition to determine which series will be flush and move these series to a flush cache. This is against in memory structures and should be very fast.
  2. Start a new segment file so that all the old files can be compacted without blocking writes to this partition.
  3. Write any series that are over the flush threshold to the index (BoltDB)
  4. Read all the old segment files pulling out entries that are in series that weren't flushed to the index and write them to a temporary compaction file
  5. Delete the old segment files as they are read
  6. Rename the compaction file so that it is segment file number 1

The reason for having a flush cache and the regular cache is so that queries can be answered while a compaction is occurring. Also, writes can come into the series that are in the middle of being flushed. They will simply get written into the new segment file and added to the in-memory cache for the next flush.

Task List

  • WAL writes to disk
  • Keep entries in memory cache with a method to get a Cursor on them
  • Read WAL segments from disk on startup to populate cache
  • Recover from corrupted WAL segment files
  • Periodically compact WAL based on max series size, percentage of series ready to flush, or max partition size
  • Gradually delay writes when the WAL is getting backed up
  • Force flush on max memory size for WAL partitions
  • Force flush if no writes have happened in some configurable interval of time
  • Integrate with BZ engine
  • Ensure that two points in the same series can't have the same time in the WAL
  • Add saving series and fields to WAL and update bz1 to store metadata in compressed format
  • Recover from incomplete compaction
  • Add option to force flush on startup
  • Add logging options
  • Make default WAL directory different than shard dir

DefaultSegmentSize = 2 * 1024 * 1024 // 2MB, default size of a segment file in a partitions
DefaultReadySeriesSize = 32 * 1024 // 32KB, series is ready to flush once it reaches this size
DefaultCompactionThreshold = 0.5 // flush and compact a partition once this ratio of keys are over the flush size
DefaultMaxSeriesSize = 2 * 1024 * 1024 // if any key in a WAL partition reaches this size, we'll force compaction of the partition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put the comments above the identifiers? :)

@otoolep
Copy link
Contributor

otoolep commented Aug 19, 2015

Took a first pass, mostly very small stuff. I don't have the full picture yet for this, but tried to look for common mistakes we make during coding etc. Generally looks pretty clean to me though.

Did have one real question about two fmt.Println statements which don't look right, and it seems we might lose some important logging.

@pauldix
Copy link
Member Author

pauldix commented Aug 19, 2015

Updated to fix @otoolep's comment about logging. I'm going to merge this in so I can cut RC1 since it's important to get this out for people to test against. However, keep making comments on this PR and I'll address in a separate one.

pauldix added a commit that referenced this pull request Aug 19, 2015
Add initial WAL implementation and tests
@pauldix pauldix merged commit c31b88d into master Aug 19, 2015
@pauldix pauldix mentioned this pull request Aug 19, 2015
@pauldix
Copy link
Member Author

pauldix commented Aug 19, 2015

@jhorwit2 just cut 0.9.3-rc1. Interested in hearing how it performs for you: https://groups.google.com/forum/#!topic/influxdb/T6mF4_WeNkM

@jhorwit2
Copy link
Contributor

@pauldix the RPM link doesn't work :I

I'll try to run some tests tomorrow.

@otoolep
Copy link
Contributor

otoolep commented Aug 19, 2015

@jhorwit2 -- please try the link below for the RPM:

https://s3.amazonaws.com/influxdb/influxdb-0.9.3_rc1-1.x86_64.rpm

@jhorwit2
Copy link
Contributor

Thanks, are WAL flushes still suppose to cause latency in writes?

@jhorwit2
Copy link
Contributor

Decided to give it a quick shot. I am noticing that WAL flushes are sometimes taking a while (like 5+ seconds) which causes writes to 500 timeout. Besides that issue i'm seeing better performance off the bat.

@jhorwit2
Copy link
Contributor

Also, I noticed the config has /var/log/influxdb/wal as a new option, but I don't see anything at that location. Is that normal??

@pauldix
Copy link
Member Author

pauldix commented Aug 19, 2015

@jhorwit2 you sure you're upgraded? What version does it say it is? WAL flushes shouldn't cause writes to timeout at all. Writes should still happen while flush and compact is running.

If that changed then it has to be due to something I merged into the WAL branch earlier today, because I tested it extensively and definitely wasn't getting any delay due to WAL flush.

The new location should have the WAL files, but it does a complete flush of a given partition if you haven't written data in 5 minutes. But there should still be directories in there for the partitions.

@pauldix
Copy link
Member Author

pauldix commented Aug 19, 2015

@jhorwit2 actually, if you upgraded, but have old data around, then you're probably still writing to a b1 engine, which doesn't include this update. You'll need to either start with a fresh database, or wait for a new shard group to get created for current data, which will use the new engine + WAL

@kenzodeluxe
Copy link

I can confirm that the performance enhancement appears to be great. I'm currently writing 10k points/sec. peak without any issues (it appears Influx is sweating way less than before) and have done so for a few hours. Before the change, it would stop working latest after ~30 min.

@pauldix
Copy link
Member Author

pauldix commented Aug 19, 2015

@kenzodeluxe What kind of hardware? Are you writing a single data point per request? I'd expect you to be able to get 10x that if you batch points together in requests in 1k to 10k batches.

@jhorwit2
Copy link
Contributor

I did some preliminary testing this last weekend on your branch and was getting up to 50k/s with pretty standard points (few tags and 1 field float value).. This was on my home linux box so a 4 core, 16gb ram, ssd computer.

I'm updating our test box (32gb ram, 16 core, ssd) this morning and i'll run more tests.

@pauldix you were right about using the old b1. I made a new database and things seem better

@kenzodeluxe
Copy link

@pauldix This is right now a VM running on a PowerEdge R720 with 4GB memory and 4 cores (for testing); the requests are batched, but we had reduced the number of points per batch based on our past experience (with the actual hardware and running natively). With 90k entries/batch, I can see 3s time to write the data which is great. Once we're done with our preliminary testing we'll move this over to actual hardware (and I can share further details then).

length := make([]byte, 8)
for {
// get the length of the compressed seriesAndFields blob
_, err := f.Read(length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read() can read less than len(length). I think you want to use io.ReadFull(f, length).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 370f008

@jhorwit2
Copy link
Contributor

@pauldix - I'm seeing some really weird things with queries. I'm getting non-deterministic results from my select queries. I sometimes get back something super small like 15 and sometimes i get 2000+ (i expect this #)

It appears to only be in pretty recent data, which in our use case isn't good because we use this for near real-time insights, but the last few minutes of data is spotty.

@jhorwit2
Copy link
Contributor

You can see some of that here. These are screenshots after a refresh.

image

image

image

@jhorwit2
Copy link
Contributor

@pauldix - ran some tests with this script.

Times are in milliseconds.

metrics: 18:32:14.247656 timer requests
metrics: 18:32:14.247679   count:            1200
metrics: 18:32:14.247696   min:         145.223
metrics: 18:32:14.247719   max:         783.381
metrics: 18:32:14.247736   mean:        550.923
metrics: 18:32:14.247755   stddev:      126.184
metrics: 18:32:14.247770   median:      586.401
metrics: 18:32:14.247785   75%:         636.971
metrics: 18:32:14.247799   95%:         691.052
metrics: 18:32:14.247814   99%:         735.061
metrics: 18:32:14.247829   99.9%:       783.269
metrics: 18:32:14.247846   1-min rate:         18.40
metrics: 18:32:14.247862   5-min rate:         16.67
metrics: 18:32:14.247878   15-min rate:        16.24
metrics: 18:32:14.247894   mean rate:          19.76

My rate was about ~18.40 requests per second over a minute which at a batch of 5k is ~92k points per second. 9.2 definitely couldn't handle this scale even with a tweaked config.

This was on my home box which is a quad core, 16gb ram and ssd. (didn't get a chance to spin up a server today. hopefully will tomorrow)

Here is 8k batches at same request rate per second. (so ~147k/s)

metrics: 18:39:25.595658   count:            1200
metrics: 18:39:25.595677   min:         208.609
metrics: 18:39:25.595694   max:         1248.26
metrics: 18:39:25.595748   mean:        791.554
metrics: 18:39:25.595771   stddev:      176.45
metrics: 18:39:25.595788   median:      848.963
metrics: 18:39:25.595805   75%:         909.971
metrics: 18:39:25.595822   95%:         971.757
metrics: 18:39:25.595839   99%:         1021.95
metrics: 18:39:25.595856   99.9%:       1246.05
metrics: 18:39:25.595873   1-min rate:         18.41
metrics: 18:39:25.595890   5-min rate:         16.67
metrics: 18:39:25.595908   15-min rate:        16.24
metrics: 18:39:25.595926   mean rate:          19.69

@kenzodeluxe
Copy link

influxd was killed by OOM yesterday; I increased available memory to 64G, running stable since then. Had 5 timeouts on writes that took longer than 5s in a time frame with heavy querying from Grafana. Without that load, a batch of 10k points usually takes ~250ms to process (according to the logs).

@pkittenis
Copy link

Posted this on the mailing list, re-posting here. In short, there are still write timeouts with many clients sending small-ish batches of metrics.

To replicate write timeouts, the following load test client command can be used with influxdb graphite plugin listening on tcp port 2003. Graphite plugin batch size is 102400, though adjusting it makes no difference.

# 100 persistent concurrent connections sending batches of 10 metrics with no delay between batches
$ python graphite_test_client.py -c 100 -m 10 -d 0 127.0.0.1 3600

Effect on influxdb:

[graphite] 2015/08/21 10:19:53 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:19:58 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:20:03 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:20:08 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:20:13 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:20:19 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:20:24 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:20:29 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:20:34 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:20:39 failed to write point batch to database "graphite": timeout
[graphite] 2015/08/21 10:20:44 failed to write point batch to database "graphite": timeout

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                  
28019 influxdb  20   0  586m 285m 6948 S 100.6  1.8   6:20.88 influxd                                                                 

To install load test client dependencies run pip install -r requirements.txt after cloning the gist.

Hope this helps.

@beckettsean
Copy link
Contributor

@pkittenis Are your test results from 0.9.3-rc1 or another version of the database?

@pkittenis
Copy link

@beckettsean - 0.9.3-rc1, yes.

@otoolep otoolep deleted the pd-wal branch October 28, 2015 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants