-
Notifications
You must be signed in to change notification settings - Fork 6.4k
Delete A Range Of Keys
Note: the techniques described in this page are obsolete. For users of RocksDB 5.18+, the native DeleteRange function is a better alternative for all known use cases.
In many cases, people want to drop a range of keys. For example, in MyRocks, we encoded rows from one table by prefixing with table ID, so that when we need to drop a table, all keys with the prefix needs to be dropped. For another example, if we store different attributes of a user with key with the format [user_id][attribute_id]
, then if a user deletes the account, we need to delete all the keys prefixing [user_id]
.
The standard way of deleting those keys is to iterate through all of them and issue Delete()
to them one by one. This approach works when the number of keys to delete is not large. However, there are two potential drawbacks of this solution:
- the space occupied by the data will not be reclaimed immediately. We'll wait for compaction to clean up the data. This is usually a concern when the range to delete takes a significant amount of space of the database.
- the chunk of tombstones may slow down iterators.
There are two other ways you can do to delete keys from a range:
The first way is to issue a DeleteFilesInRange()
to the range. The command will remove all SST files only containing keys in the range to delete. For a large chunk, it will immediately reclaim most space, so it is a good solution to problem 1. One thing needs to be taken care of is that after the operations, some keys in the range may still exist in the database. If you want to remove all of them, you should follow up with other operations, but it can be done in a slower pace. Also, be aware that, DeleteFilesInRange()
will be removed despite of existing snapshots, so you shouldn't expect to be able to read data from the range using existing snapshots any more.
Another way is to apply compaction filter together with CompactRange()
. You can write a compaction filter that can filter out keys from deleted range. When you want to delete keys from a range, call CompactRange()
for the range to delete. While the compaction finishes, the keys will be dropped. We recommend you to turn CompactionFilter::IgnoreSnapshots()
to true to make sure keys are dropped even if you have outstanding snapshots. Otherwise, you may not be able to fully remove all the keys in the range from the system. This approach can also solve the problem of reclaiming data, but it introduces more I/Os than the DeleteFilesInRange() approach. However, DeleteFilesInRange()
cannot remove all the data in the range. So a better way is to first apply DeleteFilesInRange()
, and then issue CompactRange()
with compaction filter.
Problem 2 is a harder problem to solve. One way is to apply DeleteFilesInRange()
+ CompactRange()
so that all keys and tombstones for the range are dropped. It works for large ranges, but it is too expensive if we frequently drop small ranges. The reason is that DeleteFilesInRange()
is unlikely to delete any file and CompactRange()
will delete far more data than needed because it needs to execute compactions against all SST files containing any key in the deletion rang. For the use cases where CompactRange()
is too expensive, there are still two ways to reduce the harm:
- if you never overwrite existing keys, you can try to use
DB::SingleDelete()
instead ofDelete()
to kill tombstones sooner. Tombstones will be dropped after it meets the original keys, rather than compacted to the last level. - use
NewCompactOnDeletionCollectorFactory()
to speed up compaction when there are chunks of tombstones.
Contents
- RocksDB Wiki
- Overview
- RocksDB FAQ
- Terminology
- Requirements
- Contributors' Guide
- Release Methodology
- RocksDB Users and Use Cases
- RocksDB Public Communication and Information Channels
-
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
- Options
- MemTable
- Journal
- Cache
- Write Buffer Manager
- Compaction
- SST File Formats
- IO
- Compression
- Full File Checksum and Checksum Handoff
- Background Error Handling
- Huge Page TLB Support
- Tiered Storage (Experimental)
- Logging and Monitoring
- Known Issues
- Troubleshooting Guide
- Tests
- Tools / Utilities
-
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
- Extending RocksDB
- RocksJava
- Lua
- Performance
- Projects Being Developed
- Misc