-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to select only keys without values? #68
Comments
Hi @florish , There is only one difficulty with it: normally, deleted entries are removed from the If I add a |
Ah! That's good to know. Meanwhile, I've implemented a quite naive (and non-streaming) keys listing myself, which is pretty fast (testing with a database with ± 20K records), but certainly does not account for the tombstone issue. Not sure if it helps, but here's the code (just replace @doc "Lists all keys in database without loading any values."
@spec keys() :: [any()]
def keys() do
CubDB.with_snapshot(__MODULE__, fn %{btree: btree} ->
{btree.root, btree.store}
|> collect_keys([])
|> Enum.reverse()
end)
end
defp collect_keys({{:b, children}, store}, acc) do
children
|> Enum.reduce(acc, fn {_key, loc}, acc ->
child = CubDB.Store.get_node(store, loc)
collect_keys({child, store}, acc)
end)
end
defp collect_keys({{:l, children}, _store}, acc) do
children
|> Enum.reduce(acc, fn {key, _loc}, acc ->
[key | acc]
end)
end Maybe there's a way to avoid having to check (and load) the value when all you need to know is whether the record is a tombstone or not? |
One solution would be to mark entries as deleted in the branch node (for example by setting the location to -1) instead of using a special Assuming that such a strategy is what I end up implementing, one option would be to release it as part of a major release, and document that upon upgrading one has to run at least one compaction to upgrade the data format. Possibly, an intermediate version could support both formats, to facilitate upgrading. |
Sounds good! Maybe adding something like Just let me know if you want me to test some ideas on real-world data! |
Yes, definitely. I anyway want to introduce a header to the file, which would allow for more customization (format versioning, possibly customization of options like page size and Btree fan-out which are fixed at the moment). |
Hi @lucaong, let me start by saying that this is not a bug, just a question which could turn into a feature request.
As mentioned in issue #67 , I'm working with large amounts of data in
CubDB
. Sometimes for a singlekey
, avalue
can have a size of 10 megabytes or more. (Whether or not this is a good idea in itself is a good question, but out of scope for this issue ;) )In order to do a periodic cleanup of stale data, what I want to do is list all
keys
currently present in myCubDB
database. I noticed that this is taking quite long (a couple of seconds at least) even for a database with just 30 records (each 10MB+ in size).My conclusion is that the reason for this is that
CubDB
has no way of listing only thekey
part of a record – thevalue
is always loaded, too.I've been hacking around a bit and notice that the
%CubDB.Btree{}
struct does seem to have the keys present. Example with some random UUID keys:While I can take this data out a
Btree
struct myself, it feels a bit hackish to use this internal data structure just to get a list of keys without having to load hundreds of megabytes of data into memory.Is it correct that there is currently no public API (e.g.
CubDB.keys/2
, orCubDB.keys/3
for:min_key
and:max_key
support) available to list only the record keys without loading all values?The text was updated successfully, but these errors were encountered: