Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expected Usage Question #37

Open
ccollie opened this issue Sep 17, 2024 · 13 comments
Open

Expected Usage Question #37

ccollie opened this issue Sep 17, 2024 · 13 comments
Labels
question Further information is requested

Comments

@ccollie
Copy link

ccollie commented Sep 17, 2024

I have a library that needs to index Prometheus like metrics (e.g. call_latency(instance="101", region="us-east1", env="staging"}
Each metric is indexed by name as well as labels (region="us-east1"), with each entry being a roaring bitmap of the metric ids containing the label.

One reason for looking at an ART is because of path compression, since there is great redundancy in label names. The proposed schema is
"${label}_${value}" -> bitmap # all metrics with the key value pair
"${label} -> bitmap # all metrics with a given label, irrespective of value

I've tried to implement this in BLART but it seems like this is not supported as I get Prefix errors when adding "region" then "region=us-east1" (or the inverse).

My assumption was that it functioned as a trie, which means my use case would be possible. Is this a valid assumption or am I using the API incorrectly (try_insert)

@ccollie
Copy link
Author

ccollie commented Sep 17, 2024

p.s. I realize i could use the prefix iterator to gather all metrics with a given label. I also wanted a separate key to account for labels with high cardinality values (which admittedly is not a good idea)

@Gab-Menezes
Copy link
Collaborator

Hey @ccollie, since ART is a prefix tree when you try to insert a key that is prefix of other this will result in a error during insertion. For example "hell" is prefix of "hello", so when inserting there is not enough bytes to distinguish between this two.

If you know your labels are ascii only use a CString (you can convert from String <-> CString), this allows you to use the insert method, since CString have a null byte at the end it's impossible for them to be prefix of each other. If the strings need to be utf-8, the paper recommends using sort strings ucol_getSortKey, unfortunatly icu4x doesn't implement somthing similar to this, so you need to use the C bindings.

@Gab-Menezes
Copy link
Collaborator

The trait tha dictates if a type is safe to not use the try_* methods is NonPrefixBytes. As you can see this trait is implement for types that are sure to not generate prefixes.
If you somehow find a way to encode your utf-8 strings you can use BytesMapping to map your keys. Or create a type and implement AsBytes for this type

@ccollie
Copy link
Author

ccollie commented Sep 17, 2024

Metric names and values are utf-8. I'm hesitant to bring in something like ICU for this though.
What do you think of my idea of using prefix searching to aggregate values ? In that case, keys would be like

region=us-east-1
region=us-west-1
environment=staging | production | test etc

@ccollie
Copy link
Author

ccollie commented Sep 17, 2024

Another thought. I can suffix my keys with a non-utf8 sentinel byte if that makes it function similar to Cstring

@Gab-Menezes
Copy link
Collaborator

Gab-Menezes commented Sep 17, 2024

IDK if you are going to have more keys than this, but it's obvious that this is ascii only so CString would work. But yeah using prefix iterator is a good idea, recently @declanvk reimplemented it and it's even faster, it's not yet published, but it will at some point.
Using a sentinel byte is a good idea and to be type safe I would recommend using BytesMapping.

@Gab-Menezes
Copy link
Collaborator

Gab-Menezes commented Sep 17, 2024

Also in the new version we will have a range iterator, we also have a fuzzy iterator, they may help you with your task. Also happy to see someone giving this crate a shot.
If you are nightly user you can enable the nightly flag and enjoy a massive speed up, also recommend compile with target-cpu=native to use simd

@ccollie
Copy link
Author

ccollie commented Sep 17, 2024

Also in the new version we will have a range iterator, we also have a fuzzy iterator, they may help you with your task. Also happy to see someone giving this crate a shot. If you are nightly user you can enable the nightly flag and enjoy a massive speed up, also recommend compile with taget-cpu=native to use simd

Very nice ! I'm also a nightly user, and since it's a metric library, I'm targeting simd as well.

@ccollie
Copy link
Author

ccollie commented Sep 17, 2024

Last question: is there a cheap way to get a count for entries with a given prefix ? If so I'd like to raise it as a feature request.

@declanvk
Copy link
Owner

Another thought. I can suffix my keys with a non-utf8 sentinel byte if that makes it function similar to Cstring

Yeah typically if you're going to have variable length keys, you're going to need some sort of suffix value that is not present in any other part of the key. I like your idea of using non-UTF8 sentinel byte, or you could append a \0 if you're confident that character will never be present in your inputs.

This does make the string a sorta subset of UTF-8.

Last question: is there a cheap way to get a count for entries with a given prefix ? If so I'd like to raise it as a feature request.

At the moment no, the internal nodes of the trie don't track "number of leaf node descendants" as a field. The check for number of entries would need to essentially count the leaf nodes. Feel free to cut a separate ticket as a feature request (doesn't need to be too detailed), I'd be happy to discuss that feature more. I think it would be a tradeoff between increasing the size of the trie inner nodes and being able to speed up "num children" functions for the various iterators

it's not yet published, but it will at some point.

I'm working on this this week, I'll try to cut a release quickly!

@Gab-Menezes
Copy link
Collaborator

Gab-Menezes commented Sep 17, 2024

@declanvk was faster than me haha, but yeah we don't keep track of this number only at the root, but with the new implementation doing a count at the prefix iter should be way faster, since it would be basically iterating over a linked list

@ccollie
Copy link
Author

ccollie commented Sep 17, 2024

Last question: is there a cheap way to get a count for entries with a given prefix ? If so I'd like to raise it as a feature request.

At the moment no, the internal nodes of the trie don't track "number of leaf node descendants" as a field. The check for number of entries would need to essentially count the leaf nodes. Feel free to cut a separate ticket as a feature request (doesn't need to be too detailed), I'd be happy to discuss that feature more. I think it would be a tradeoff between increasing the size of the trie inner nodes and being able to speed up "num children" functions for the various iterators

I wasn't expecting tracking descendents, only that we iterate nodes and sum its children. I'll put in a request...

@declanvk declanvk added the question Further information is requested label Sep 20, 2024
@declanvk
Copy link
Owner

I released version 0.3.0, let me know what you think! Any other problems or questions are welcome. I was also thinking about some usability improvements on the tree APIs have the where K: Borrow<Q> + AsBytes, Q: AsBytes bounds, if either of you had thoughts about that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants