-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to index prefixes on text fields #28290
Add ability to index prefixes on text fields #28290
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I was the one who suggested it, but I'm still wondering whether we should go with the double dot approach or just ${field}._prefix
like you did in the first PR. The fact that you now return the prefix mapper in iterator()
should make mapping updates fail if a user tries to add a multi-field with prefix
as a name since there would be two fields with the same name. Maybe this is a better approach? cc @rjernst @jimczi
public Iterator<Mapper> iterator() { | ||
if (prefixFieldMapper == null) | ||
return super.iterator(); | ||
return Iterators.concat(multiFields.iterator(), Collections.singleton(prefixFieldMapper).iterator()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe replace multiFields.iterator()
with super.iterator()
to be more future-proof in case the super impl is updated in the future?
} else if (propName.equals("index_prefix")) { | ||
Map<?, ?> indexPrefix = (Map<?, ?>) propNode; | ||
int minChars = XContentMapValues.nodeIntegerValue(indexPrefix.remove("min_chars"), 0); | ||
int maxChars = XContentMapValues.nodeIntegerValue(indexPrefix.remove("max_chars"), 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this got me wondering whether we should call it min_gram and max_gram like the option on the (edge) ngram tokenizer/filter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone back and forth on this a bit. It would be more consistent, but I think it doesn't make as much sense outside of the context of the ngram tokenizer - if you're reading { "index_prefix" : { "min_gram" : 0 } }
it's not obvious what a 'gram' actually is?
@@ -113,6 +125,12 @@ public Builder fielddataFrequencyFilter(double minFreq, double maxFreq, int minS | |||
return builder; | |||
} | |||
|
|||
public Builder indexPrefixes(int minChars, int maxChars) { | |||
this.prefixFieldType = new PrefixFieldType(name() + "..prefix", minChars, maxChars); | |||
fieldType().setPrefixFieldType(this.prefixFieldType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we validate that minChars <= maxChars
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and also that minChars >= 1
?
@@ -161,18 +181,116 @@ public TextFieldMapper build(BuilderContext context) { | |||
builder.fielddataFrequencyFilter(minFrequency, maxFrequency, minSegmentSize); | |||
DocumentMapperParser.checkNoRemainingFields(propName, frequencyFilter, parserContext.indexVersionCreated()); | |||
iterator.remove(); | |||
} else if (propName.equals("index_prefix")) { | |||
Map<?, ?> indexPrefix = (Map<?, ?>) propNode; | |||
int minChars = XContentMapValues.nodeIntegerValue(indexPrefix.remove("min_chars"), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be 1
rather than 0
?
Identical field names are detected at the Json-parser level rather than at the mappings level, unfortunately, so adding a multi-field called |
We are supposed to have validation at the mappings level too for the reason you mentioned, which leverages the |
|
This is true, but merges are the only way to modify mappings, so it should cover all cases. For instance at index creation time, we merge the provided mappings into an empty instance, dynamic introduction of new fields creates mapping updates that are merged before we index documents, etc. |
OK, if I create the DocumentMapper via MapperService.merge() then I can get an IllegalArgumentException if there's a 'prefix' subfield. My only concern here is that it's not entirely obvious to a user why there are multiple definitions of the prefix field - after all, they've only defined it once. |
Yes, I agree we should validate names of multi-fields explicitly, but I still like to have this safety as a fallback. |
Maybe the field should be called |
👍 it would make it clearer where this field comes from if users get an error message about it. Maybe still prefix with an underscore like other fields that Elasticsearch manages by itself, ie. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some additional comments
if (minChars > maxChars) | ||
throw new IllegalArgumentException("min_chars [" + minChars + "] must be less than max_chars [" + maxChars + "]"); | ||
if (minChars < 1) | ||
throw new IllegalArgumentException("min_chars [" + minChars + "] must be greater than zero"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure that we should let the user defines min and max or we should at least have a limit on maxChars ?
We could also have sane default values (2
to 5
should be good enough for 99% of the case ?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't been able to find any particularly useful stats on prefix distributions. I'm guessing that it will be different for different languages, and depends a bit on whether the analysis chain decompounds things like Danish article prefixes. 2 to 5 sounds like a reasonable default though.
Maybe a max limit of something like 20? I've done this elsewhere, where we had base64 encoded images stuffed into a text field and the ngram filter spent about an hour trying to tokenize it.
For the minimum, I can see arguments for both 1 and 2 chars, so I think it's reasonable to allow configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a max limit of something like 20
+1
2 to 5 sounds like a reasonable default though.
+1 too, changing these values should be considered as advanced (expert usage) ?
final int maxChars; | ||
|
||
PrefixFieldType(String name, int minChars, int maxChars) { | ||
setTokenized(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can deactivate norms, positions and frequencies ?
if (prefixFieldType == null || prefixFieldType.accept(value.length()) == false) { | ||
return super.prefixQuery(value, method, context); | ||
} | ||
return prefixFieldType.termQuery(value, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use constant score as well if the rewrite method is null or constant_score ?
q: shor* | ||
df: text | ||
|
||
- match: {hits.total: 1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe check the score too?
return "prefix"; | ||
} | ||
} | ||
|
||
public static final class TextFieldType extends StringFieldType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you will need to implement checkfieldtype so that users get an error if they try to update the index_prefix
settings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And add the modifier in TextFieldTypeTests.
return new TextFieldMapper( | ||
name, fieldType, defaultFieldType, positionIncrementGap, | ||
name, fieldType, defaultFieldType, positionIncrementGap, prefixMapper, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should fail if the field is not indexed but prefixes are indexed?
@Override | ||
public Iterator<Mapper> iterator() { | ||
if (prefixFieldMapper == null) | ||
return super.iterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we usually use brackets even for single-line if/else statements
Writing the docs has made me query the configuration format - passing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some nit-picks but it looks great to me overall. Something that I'm not a fan of is the fact that we accept true/false in addition to an object. It requires more testing and will need more work if we ever change the way that this option is exposed if we want to maintain bw compat. Also, it doesn't make it much shorter in my opinion since you can also do index_prefix: {}
to do the same thing as index_prefix: true
?
throw new IllegalArgumentException("min_chars [" + minChars + "] must be less than max_chars [" + maxChars + "]"); | ||
if (minChars < 1) | ||
throw new IllegalArgumentException("min_chars [" + minChars + "] must be greater than zero"); | ||
if (maxChars >= 20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add brackets
: new PrefixFieldMapper(prefixFieldType.setAnalyzer(fieldType.indexAnalyzer()), context.indexSettings()); | ||
if (prefixFieldType != null && fieldType().isSearchable() == false) { | ||
throw new IllegalArgumentException("Cannot set index_prefix on unindexed field [" + name() + "]"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this if statement before the creation of the prefix field mapper?
} | ||
else if (this.prefixFieldMapper != null || mw.prefixFieldMapper != null) { | ||
throw new IllegalArgumentException("mapper [" + name() + "] has different index_prefix settings, current [" | ||
+ this.prefixFieldMapper + "], merged [" + mw.prefixFieldMapper + "]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does prefixFieldMapper have a nice toString?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added one, plus a line to the test
I removed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left minor comments but the change looks good to me overall. Please make sure PR CI is green before merging.
@@ -161,18 +196,143 @@ public TextFieldMapper build(BuilderContext context) { | |||
builder.fielddataFrequencyFilter(minFrequency, maxFrequency, minSegmentSize); | |||
DocumentMapperParser.checkNoRemainingFields(propName, frequencyFilter, parserContext.indexVersionCreated()); | |||
iterator.remove(); | |||
} else if (propName.equals("index_prefix")) { | |||
Map<?, ?> indexPrefix = (Map<?, ?>) propNode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe validate that propNode is indeed not null and a map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The block above (for fielddataFrequencyFilter) does no checking here. I think if somebody passes in null
explicitly, then returning a NullPointerException is fair enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might agree if this was a library but there have been several requests that we produce error messages that are as meaningful as possible so I think it's better to check explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added a test for this, and it turns out it's already dealt with by TypeParsers.parseTextField()
which checks that none of the passed-in parameters are null, so I think we're OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect. Thanks for checking!
-------------------------------- | ||
// CONSOLE | ||
<1> `min_chars` must be greater than zero, defaults to 2 | ||
<2> `max_chars` must be greater than `min_chars` and less than 20, defaults to 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/greater than/greater than or equal to/
I think?
|
||
To configure the index prefix field, use the following syntax. Either or both | ||
of `min_chars` and `max_chars` may be excluded. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should re-explain what it does here and also make it clear that both min_chars and max_chars are inclusive?
This adds the ability to index term prefixes into a hidden subfield, enabling prefix queries to be run without multitermquery rewrites. The subfield reuses the analysis chain of its parent text field, appending an EdgeNGramTokenFilter. It can be configured with minimum and maximum ngram lengths. Query terms with lengths outside this min-max range fall back to using prefix queries against the parent text field. The mapping looks like this: "my_text_field" : { "type" : "text", "analyzer" : "english", "index_prefix" : { "min_chars" : 1, "max_chars" : 10 } } Relates to #27049
This adds the ability to index term prefixes into a hidden subfield, enabling prefix queries to be run without multitermquery rewrites. The subfield reuses the analysis chain of its parent text field, appending an EdgeNGramTokenFilter. It can be configured with minimum and maximum ngram lengths. Query terms with lengths outside this min-max range fall back to using prefix queries against the parent text field.
The mapping looks like this:
"my_text_field" : {
"type" : "text",
"analyzer" : "english",
"index_prefix" : { "min_chars" : 1, "max_chars" : 10 }
}
This implementation uses a dedicated FieldType and FieldMapper within TextFieldMapper
Supersedes #28222