Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSM files not closed when shard deleted #9786

Closed
jacobmarble opened this issue Apr 28, 2018 · 2 comments · Fixed by #9792 or #9866
Closed

TSM files not closed when shard deleted #9786

jacobmarble opened this issue Apr 28, 2018 · 2 comments · Fixed by #9792 or #9866

Comments

@jacobmarble
Copy link
Member

InfluxDB current master 28-apr-2018 running on current MacOS.

The service removes, but does not close, TSM files when a shard is dropped while the TSM files are in use by a query.

  • start influxdb
  • create a 1h retention policy
  • write hundreds of megabytes to the new RP
  • select sum(foo) from m takes a long time to complete
  • run that ^^ query in an infinite loop, 10 goroutines
  • drop shard X
  • shard directory and contained files disappear from filesystem as viewed via ls
  • lsof -p PID shows shard .tsm files still open
  • stop query load
  • lsof still shows .tsm files open
@jacobmarble
Copy link
Member Author

jacobmarble commented Apr 28, 2018

Found three ways that TSM files are removed: delete shard, retention policy, and compaction.

This bug affects delete shard and retention policy, which both remove the entire shard.

Compaction doesn't remove the entire shard, it operates within a shard. When compaction converts n old TSM files into 1 new TSM file, it tries to close the old files, and add any currently-referenced TSM file to a purger, which closes and removes the file later.

That approach could work here, but perhaps should be managed at the Store level. A hung query could cause the open file to stay open until the service is shut down, and Store.Close() can be used to force queued files to be closed and removed.

Not sure if drop shard X should receive a partial error when TSM files are queued, rather than actually closed and removed. It would be nice to implement the drop shard and retention by simply queueing all shard deletes.

@jacobmarble
Copy link
Member Author

@rbetts the bug exists in 1.4.3 and 1.3.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment