[0.12] Retention policy cleanup does not remove series #6457

cnelissen · 2016-04-23T00:21:17Z

I have a dataset that contains an unbounded number of endpoints that need to be queried, namely guest devices on a transient network (think hotel lobby or coffee shop). I am trying to store a small number of metrics over a very large and constantly changing number of devices. At any given time I have approximately 50,000 unique devices connected to the system, and add about 500~1000 unique devices per minute. I need to be able to display metrics on a particular device, so each unique device becomes a unique series, i.e:

measurement: "connected-devices"
tags: clientId, macAddress, accessPointId, etc
values: downloadBytes, uploadBytes, signalStrength, etc

I have created a 1hr retention policy to store the per-device data so the number of series does not become unwieldy:

CREATE RETENTION POLICY "rp-1h"   ON "db1" DURATION 1h    REPLICATION 1 SHARD DURATION 2m

I have setup the retention check-interval to run every 5 minutes so I should be evicting old data fairly regularly and thereby keeping the number of series to a minimum. (50,000 initial + (1000 per/min * 60 min) = ~110,000 series total for retention policy)

The issue is that the number of series increases linearly until the system runs out of memory and eventually crashes. The retention policy runs as expected but the number of series that the system reports does not decrease along with the removal of the shards.

[retention] 2016/04/22 23:56:45 retention policy shard deletion check commencing
[retention] 2016/04/22 23:56:45 deleted shard group 617 from database db1, retention policy rp-1h
[retention] 2016/04/22 23:56:45 deleted shard group 618 from database db1, retention policy rp-1h
[retention] 2016/04/22 23:56:45 deleted shard group 616 from database db1, retention policy rp-1h
[retention] 2016/04/22 23:56:46 shard ID 615 from database db1, retention policy rp-1h, deleted
[retention] 2016/04/22 23:56:47 shard ID 614 from database db1, retention policy rp-1h, deleted

Number of series

time                             numSeries
2016-04-22T23:56:00              200760
2016-04-22T23:57:00              200950

If I restart the system it will immediately drop off 10's of thousands of series even though the retention policy check just completed.

After restart:

time                             numSeries
2016-04-22T23:58:45              104727

Expected behavior:
The retention policy should drop shards older than 1 hour every 5 minutes which should reduce the number of series accordingly leaving the number of series relatively stable over a long duration.

Actual behavior:
The retention policy runs but the number of series remains unchanged, eventually overloading the system. Restarting the system reduces the number of series to the expected level.

Here is a chronograf chart showing the unexepected rise in the number of series:

(Retention policy check runs every 5 minutes)

This was generated using the following query:

SELECT MAX(numSeries) AS numSeries FROM "database" WHERE "database" = 'db1' AND tmpltime() GROUP BY time(1m)

The text was updated successfully, but these errors were encountered:

sofixa · 2016-04-25T08:40:13Z

I +1 this.
On 0.11, the retention polciy doesn't seem to have any effect when used post-factum(with already existing data).

I have

[retention] 2016/04/25 10:26:25 retention policy shard deletion check commencing

entries in the logs, but the number of series in the databse continues to mount(it was ~6million when I added the retention policy (last tuesday), now it's over 8 million).

I haven't tried restarting, because i am scared it will never start up properly.

cnelissen · 2016-04-25T15:32:42Z

You are having a different issue @sofixa. Creating a new retention policy does not apply retroactively to existing data, it will only apply to new data that is expressly written to it, or you have to set the new retention policy as default and then migrate existing data to the new retention policy.

The structural order of InfluxDB is something like:

    Database
        -> Retention Policy
            -> Measurement
                -> Series

Measurements are contained within a retention policy, which is also obvious when you look at how queries are actually performed. The full syntax of a select query is:

SELECT * FROM "my_db"."my_retention_policy"."my_measurement_name" WHERE ...

cnelissen · 2016-04-25T16:33:10Z

I let the system run over the weekend, and as of this morning the number of series had crept way up over 500K. Immediately following a restart the number of series is down to around 200K.

When a shard is closed and removed due to retention policy enforcement, the series contained in the shard would still exists in the index causing a memory leak. Restarting the server would cause them not to be loaded. Fixes #6457

jwilder added area/performance area/retention policies labels Apr 27, 2016

jwilder added this to the 0.13.0 milestone Apr 27, 2016

jwilder mentioned this issue Apr 27, 2016

Remove series from index when shard is closed #6485

Merged

4 tasks

jwilder self-assigned this Apr 28, 2016

jwilder closed this as completed in #6485 Apr 28, 2016

MikeSchroll mentioned this issue May 7, 2016

RP/CQ cause memory leak #6087

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.12] Retention policy cleanup does not remove series #6457

[0.12] Retention policy cleanup does not remove series #6457

cnelissen commented Apr 23, 2016

sofixa commented Apr 25, 2016

cnelissen commented Apr 25, 2016

cnelissen commented Apr 25, 2016

[0.12] Retention policy cleanup does not remove series #6457

[0.12] Retention policy cleanup does not remove series #6457

Comments

cnelissen commented Apr 23, 2016

sofixa commented Apr 25, 2016

cnelissen commented Apr 25, 2016

cnelissen commented Apr 25, 2016