Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for concatenated field type #4773

Merged
merged 13 commits into from
Apr 12, 2024
Merged

Conversation

trinity-1686a
Copy link
Contributor

Description

fix #4548
implement concatenated field type: a field which search inside other fields, in a more efficient way than querying each sub-field independently. Currently doesn't support having date and ip sub-fields. These can be added later, but may require some trickery on the search path to get working properly.

How was this PR tested?

added integration test testing we use the right tokenizer, and are able to search for supported types (int, bool, text, things inside json field, and in dynamic field)

Concatenate fields don't support fast fields, and are never stored. They uses their own tokenizer, independantly of the
tokenizer configured on the individual fields.
At query time, concatenate fields don't support range queries.
Only the following types are supported inside a concatenate field: text, bool, i64, u64, json. Other types are reject
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Only the following types are supported inside a concatenate field: text, bool, i64, u64, json. Other types are reject
Only the following types are supported inside a concatenate field: text, bool, i64, u64, json. Other types are rejected


for value in json_obj_values {
for concate_dynamic_field in &self.concate_dynamic_fields {
document.add_field_value(*concate_dynamic_field, value.clone());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know a simple trick to remove the last clone?

I haven't found anything simple in the library.
It feels sad if we have a single concate_dynamic_fields (the most common case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a good way to do that, so I created a custom iterator which does the trick (ZipCloneable). Let me know if you think that's better or not

text: val.clone(),
tokens: vec![Token {
offset_from: 0,
offset_to: 0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've set the offset_to to 1, is that alright, or should that be text.len()?

@@ -54,6 +55,124 @@ pub enum LeafType {
Text(QuickwitTextOptions),
}

fn value_to_pretokenized<T: ToString>(val: T) -> PreTokenizedString {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a discussion somewhere about why we are going for the string route? I think we discussed it but forgot the rationale.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we had that discussion. I based my work on

For numeric types or dates, we should coerce them to string.

(from the ticket).
The rational I see would be that if we want to do point search (not range &co), casting everything to string is the simplest way, and work rather well. Maybe we could leverange json field instead, but I don't think there is much benefit, and you end up doing twice as many requests to storage when you don't know if a term is an integer or a string.

@trinity-1686a trinity-1686a requested a review from fulmicoton April 3, 2024 15:19
Copy link
Collaborator

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add unit tests.

@fulmicoton fulmicoton requested review from PSeitz and fulmicoton and removed request for fulmicoton April 5, 2024 07:38
@fulmicoton
Copy link
Collaborator

fulmicoton commented Apr 5, 2024

@trinity-1686a I don't have more comments apart from the need for unit test. integration tests are not a replacement for UT.
Cool! This is quite an important feature for 0.9! I look forward to it being shipped

@PSeitz can you have a look at the CR? If you see optimization hanging fruit, and they can be shipped separately, I think we can revisit that later, but I want you to be aware of the addition of this PR.

@trinity-1686a trinity-1686a requested a review from fulmicoton April 7, 2024 14:20
record: Option<IndexRecordOption>,
) -> anyhow::Result<Self> {
let text_index_options_opt = Self::from_parts_text(true, tokenizer, record, false)?;
let text_index_options = text_index_options_opt.expect("concatenate field must be indexed");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be triggered by user input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self::from_parts_text the line above will always return Ok(Some(_)) if it's first parameter is true

for sub_field in &options.concatenate_fields {
for matched_field in
mapping_node
.find_field_mapping_leaf(sub_field)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens for json field?

If I have a json field called attributes and I concat
attributes.color, is this an error, or is not an error but working in strange ways?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the spec is acceptable, can we have a unit test for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now, it's an error at index creation time. In the future, I hope it may become possible to implement. I've added a commented-out test

Copy link
Collaborator

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see minor comments

@trinity-1686a trinity-1686a enabled auto-merge (squash) April 12, 2024 15:15
@fmassot fmassot dismissed fulmicoton’s stale review April 12, 2024 15:16

Ok if fmassot approves.

@trinity-1686a trinity-1686a merged commit 144074d into main Apr 12, 2024
4 checks passed
@trinity-1686a trinity-1686a deleted the trinity/field-concat branch April 12, 2024 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Field concatenation
4 participants