-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add length of encoding to String indexed columns #188
Comments
I think second option would be is easier to understand.
|
Thanks for the feedback! Personally I think the second approach is also better in terms of readability and implementation. |
I think that passing the argument through df.write.format("qbeast")
.option("columnsToIndex", "column:length_hashing...")
.option("columnStats": """{ "column_length": 10 }""")
.save(...) If the user does not provide |
Correct me if I am wrong (@Adricu8), but another feature that could be added to this is the ability to split the column value in sequence of characters of In this case, we would produce different dimensions for each sequence and query each one independently. For example, in a case in which we have a web domain such as "www.something.com", which is a much larger string than 11 characters, we can split in two different groups:
The goal is to implement this behaviour under the hood, but first we need to test if this works out correctly. It is possible that instead of splitting in groups of equal length, we need to make the partition after each "."? (Or after some special character that users can configure, such as: |
Regarding how to index a string column containing a hostname. We need to design a way to index hostname column to provide the following use-cases: The use-cases that we should provide are:
We want to try two different approaches and compare the performance:
Constraints:
I think we can test right away option 2) where we split the columns outside of qbeast. |
I've created a PR on my own fork to work on that. osopardo1#2 It is supposed to be merged with #186 , so we can have a more clear |
A couple of comments:
But then, we have to output a positive Double, so we proceed by: Could this affect the ordering? |
This issue is no longer relevant for the current String Indexing implementation in #215. Closing it. |
With Qbeast it is possible to index String columns with a Hash. This is an easy solution to transform text values into numbers, but it becomes a drawback when ordering the data.
The value of the
Murmur3Hash
does not respect the alphabetical order of the string. Meaning that values x = "a" and y = "b" could have random hashes like H(x) = 30 and H(y) = 4.A possible solution is to pre-process the string before indexing, giving a fixed length that allows to convert those bytes into Doubles (11 characters is the maximum that we can fit).
Or we can either:
This new transformer or feature could be specified with the Spark DataFrame API when writing:
Or if we choose to add an Spark Option:
The text was updated successfully, but these errors were encountered: