Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.12] Retention policy cleanup does not remove series #6457

Closed
cnelissen opened this issue Apr 23, 2016 · 3 comments · Fixed by #6485
Closed

[0.12] Retention policy cleanup does not remove series #6457

cnelissen opened this issue Apr 23, 2016 · 3 comments · Fixed by #6485

Comments

@cnelissen
Copy link

I have a dataset that contains an unbounded number of endpoints that need to be queried, namely guest devices on a transient network (think hotel lobby or coffee shop). I am trying to store a small number of metrics over a very large and constantly changing number of devices. At any given time I have approximately 50,000 unique devices connected to the system, and add about 500~1000 unique devices per minute. I need to be able to display metrics on a particular device, so each unique device becomes a unique series, i.e:

measurement: "connected-devices"
tags: clientId, macAddress, accessPointId, etc
values: downloadBytes, uploadBytes, signalStrength, etc

I have created a 1hr retention policy to store the per-device data so the number of series does not become unwieldy:

CREATE RETENTION POLICY "rp-1h"   ON "db1" DURATION 1h    REPLICATION 1 SHARD DURATION 2m

I have setup the retention check-interval to run every 5 minutes so I should be evicting old data fairly regularly and thereby keeping the number of series to a minimum. (50,000 initial + (1000 per/min * 60 min) = ~110,000 series total for retention policy)

The issue is that the number of series increases linearly until the system runs out of memory and eventually crashes. The retention policy runs as expected but the number of series that the system reports does not decrease along with the removal of the shards.

[retention] 2016/04/22 23:56:45 retention policy shard deletion check commencing
[retention] 2016/04/22 23:56:45 deleted shard group 617 from database db1, retention policy rp-1h
[retention] 2016/04/22 23:56:45 deleted shard group 618 from database db1, retention policy rp-1h
[retention] 2016/04/22 23:56:45 deleted shard group 616 from database db1, retention policy rp-1h
[retention] 2016/04/22 23:56:46 shard ID 615 from database db1, retention policy rp-1h, deleted
[retention] 2016/04/22 23:56:47 shard ID 614 from database db1, retention policy rp-1h, deleted

Number of series

time                             numSeries
2016-04-22T23:56:00              200760
2016-04-22T23:57:00              200950

If I restart the system it will immediately drop off 10's of thousands of series even though the retention policy check just completed.

After restart:

time                             numSeries
2016-04-22T23:58:45              104727

Expected behavior:
The retention policy should drop shards older than 1 hour every 5 minutes which should reduce the number of series accordingly leaving the number of series relatively stable over a long duration.

Actual behavior:
The retention policy runs but the number of series remains unchanged, eventually overloading the system. Restarting the system reduces the number of series to the expected level.

Here is a chronograf chart showing the unexepected rise in the number of series:

graph

(Retention policy check runs every 5 minutes)

This was generated using the following query:

SELECT MAX(numSeries) AS numSeries FROM "database" WHERE "database" = 'db1' AND tmpltime() GROUP BY time(1m)
@sofixa
Copy link

sofixa commented Apr 25, 2016

I +1 this.
On 0.11, the retention polciy doesn't seem to have any effect when used post-factum(with already existing data).

I have

[retention] 2016/04/25 10:26:25 retention policy shard deletion check commencing

entries in the logs, but the number of series in the databse continues to mount(it was ~6million when I added the retention policy (last tuesday), now it's over 8 million).

I haven't tried restarting, because i am scared it will never start up properly.

@cnelissen
Copy link
Author

You are having a different issue @sofixa. Creating a new retention policy does not apply retroactively to existing data, it will only apply to new data that is expressly written to it, or you have to set the new retention policy as default and then migrate existing data to the new retention policy.

The structural order of InfluxDB is something like:

    Database
        -> Retention Policy
            -> Measurement
                -> Series

Measurements are contained within a retention policy, which is also obvious when you look at how queries are actually performed. The full syntax of a select query is:

SELECT * FROM "my_db"."my_retention_policy"."my_measurement_name" WHERE ...

@cnelissen
Copy link
Author

I let the system run over the weekend, and as of this morning the number of series had crept way up over 500K. Immediately following a restart the number of series is down to around 200K.

graph

jwilder added a commit that referenced this issue Apr 27, 2016
When a shard is closed and removed due to retention policy enforcement,
the series contained in the shard would still exists in the index causing
a memory leak.  Restarting the server would cause them not to be loaded.

Fixes #6457
@jwilder jwilder added this to the 0.13.0 milestone Apr 27, 2016
@jwilder jwilder self-assigned this Apr 28, 2016
jwilder added a commit that referenced this issue Apr 28, 2016
When a shard is closed and removed due to retention policy enforcement,
the series contained in the shard would still exists in the index causing
a memory leak.  Restarting the server would cause them not to be loaded.

Fixes #6457
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants