Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to index prefixes on text fields #28290

Merged
merged 26 commits into from
Jan 30, 2018
Merged

Add ability to index prefixes on text fields #28290

merged 26 commits into from
Jan 30, 2018

Conversation

romseygeek
Copy link
Contributor

This adds the ability to index term prefixes into a hidden subfield, enabling prefix queries to be run without multitermquery rewrites. The subfield reuses the analysis chain of its parent text field, appending an EdgeNGramTokenFilter. It can be configured with minimum and maximum ngram lengths. Query terms with lengths outside this min-max range fall back to using prefix queries against the parent text field.

The mapping looks like this:

"my_text_field" : {
"type" : "text",
"analyzer" : "english",
"index_prefix" : { "min_chars" : 1, "max_chars" : 10 }
}

This implementation uses a dedicated FieldType and FieldMapper within TextFieldMapper

Supersedes #28222

@romseygeek romseygeek requested review from jimczi and jpountz January 18, 2018 11:00
@romseygeek romseygeek added :Search/Search Search-related issues that do not fall into other categories v7.0.0 v6.3.0 >enhancement labels Jan 18, 2018
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I was the one who suggested it, but I'm still wondering whether we should go with the double dot approach or just ${field}._prefix like you did in the first PR. The fact that you now return the prefix mapper in iterator() should make mapping updates fail if a user tries to add a multi-field with prefix as a name since there would be two fields with the same name. Maybe this is a better approach? cc @rjernst @jimczi

public Iterator<Mapper> iterator() {
if (prefixFieldMapper == null)
return super.iterator();
return Iterators.concat(multiFields.iterator(), Collections.singleton(prefixFieldMapper).iterator());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe replace multiFields.iterator() with super.iterator() to be more future-proof in case the super impl is updated in the future?

} else if (propName.equals("index_prefix")) {
Map<?, ?> indexPrefix = (Map<?, ?>) propNode;
int minChars = XContentMapValues.nodeIntegerValue(indexPrefix.remove("min_chars"), 0);
int maxChars = XContentMapValues.nodeIntegerValue(indexPrefix.remove("max_chars"), 10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this got me wondering whether we should call it min_gram and max_gram like the option on the (edge) ngram tokenizer/filter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone back and forth on this a bit. It would be more consistent, but I think it doesn't make as much sense outside of the context of the ngram tokenizer - if you're reading { "index_prefix" : { "min_gram" : 0 } } it's not obvious what a 'gram' actually is?

@@ -113,6 +125,12 @@ public Builder fielddataFrequencyFilter(double minFreq, double maxFreq, int minS
return builder;
}

public Builder indexPrefixes(int minChars, int maxChars) {
this.prefixFieldType = new PrefixFieldType(name() + "..prefix", minChars, maxChars);
fieldType().setPrefixFieldType(this.prefixFieldType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we validate that minChars <= maxChars?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also that minChars >= 1?

@@ -161,18 +181,116 @@ public TextFieldMapper build(BuilderContext context) {
builder.fielddataFrequencyFilter(minFrequency, maxFrequency, minSegmentSize);
DocumentMapperParser.checkNoRemainingFields(propName, frequencyFilter, parserContext.indexVersionCreated());
iterator.remove();
} else if (propName.equals("index_prefix")) {
Map<?, ?> indexPrefix = (Map<?, ?>) propNode;
int minChars = XContentMapValues.nodeIntegerValue(indexPrefix.remove("min_chars"), 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be 1 rather than 0?

@romseygeek
Copy link
Contributor Author

Identical field names are detected at the Json-parser level rather than at the mappings level, unfortunately, so adding a multi-field called prefix silently replaces the internal prefix field.

@jpountz
Copy link
Contributor

jpountz commented Jan 18, 2018

We are supposed to have validation at the mappings level too for the reason you mentioned, which leverages the Mapper.iterator method and makes sure that two fields never declare the same field name, see MapperService.checkFieldUniqueness.

@romseygeek
Copy link
Contributor Author

MapperService.checkFieldUniqueness is only called during a merge, I think? So it's too late for checking that we don't have multiple field definitions within a single DocumentMapper.

@jpountz
Copy link
Contributor

jpountz commented Jan 18, 2018

MapperService.checkFieldUniqueness is only called during a merge

This is true, but merges are the only way to modify mappings, so it should cover all cases. For instance at index creation time, we merge the provided mappings into an empty instance, dynamic introduction of new fields creates mapping updates that are merged before we index documents, etc.

@romseygeek
Copy link
Contributor Author

OK, if I create the DocumentMapper via MapperService.merge() then I can get an IllegalArgumentException if there's a 'prefix' subfield.

My only concern here is that it's not entirely obvious to a user why there are multiple definitions of the prefix field - after all, they've only defined it once.

@jpountz
Copy link
Contributor

jpountz commented Jan 18, 2018

Yes, I agree we should validate names of multi-fields explicitly, but I still like to have this safety as a fallback.

@romseygeek
Copy link
Contributor Author

Maybe the field should be called index_prefix, to match with the DSL field name?

@jpountz
Copy link
Contributor

jpountz commented Jan 18, 2018

👍 it would make it clearer where this field comes from if users get an error message about it. Maybe still prefix with an underscore like other fields that Elasticsearch manages by itself, ie. _index_prefix?

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some additional comments

if (minChars > maxChars)
throw new IllegalArgumentException("min_chars [" + minChars + "] must be less than max_chars [" + maxChars + "]");
if (minChars < 1)
throw new IllegalArgumentException("min_chars [" + minChars + "] must be greater than zero");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure that we should let the user defines min and max or we should at least have a limit on maxChars ?
We could also have sane default values (2 to 5 should be good enough for 99% of the case ?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been able to find any particularly useful stats on prefix distributions. I'm guessing that it will be different for different languages, and depends a bit on whether the analysis chain decompounds things like Danish article prefixes. 2 to 5 sounds like a reasonable default though.

Maybe a max limit of something like 20? I've done this elsewhere, where we had base64 encoded images stuffed into a text field and the ngram filter spent about an hour trying to tokenize it.

For the minimum, I can see arguments for both 1 and 2 chars, so I think it's reasonable to allow configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a max limit of something like 20

+1

2 to 5 sounds like a reasonable default though.

+1 too, changing these values should be considered as advanced (expert usage) ?

final int maxChars;

PrefixFieldType(String name, int minChars, int maxChars) {
setTokenized(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can deactivate norms, positions and frequencies ?

if (prefixFieldType == null || prefixFieldType.accept(value.length()) == false) {
return super.prefixQuery(value, method, context);
}
return prefixFieldType.termQuery(value, context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use constant score as well if the rewrite method is null or constant_score ?

q: shor*
df: text

- match: {hits.total: 1}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe check the score too?

return "prefix";
}
}

public static final class TextFieldType extends StringFieldType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will need to implement checkfieldtype so that users get an error if they try to update the index_prefix settings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And add the modifier in TextFieldTypeTests.

return new TextFieldMapper(
name, fieldType, defaultFieldType, positionIncrementGap,
name, fieldType, defaultFieldType, positionIncrementGap, prefixMapper,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should fail if the field is not indexed but prefixes are indexed?

@Override
public Iterator<Mapper> iterator() {
if (prefixFieldMapper == null)
return super.iterator();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually use brackets even for single-line if/else statements

@romseygeek
Copy link
Contributor Author

Writing the docs has made me query the configuration format - passing {} as the default seems a bit odd. Maybe we should accept true or false as standard, and then the object as an expert setting?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some nit-picks but it looks great to me overall. Something that I'm not a fan of is the fact that we accept true/false in addition to an object. It requires more testing and will need more work if we ever change the way that this option is exposed if we want to maintain bw compat. Also, it doesn't make it much shorter in my opinion since you can also do index_prefix: {} to do the same thing as index_prefix: true?

throw new IllegalArgumentException("min_chars [" + minChars + "] must be less than max_chars [" + maxChars + "]");
if (minChars < 1)
throw new IllegalArgumentException("min_chars [" + minChars + "] must be greater than zero");
if (maxChars >= 20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add brackets

: new PrefixFieldMapper(prefixFieldType.setAnalyzer(fieldType.indexAnalyzer()), context.indexSettings());
if (prefixFieldType != null && fieldType().isSearchable() == false) {
throw new IllegalArgumentException("Cannot set index_prefix on unindexed field [" + name() + "]");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this if statement before the creation of the prefix field mapper?

}
else if (this.prefixFieldMapper != null || mw.prefixFieldMapper != null) {
throw new IllegalArgumentException("mapper [" + name() + "] has different index_prefix settings, current ["
+ this.prefixFieldMapper + "], merged [" + mw.prefixFieldMapper + "]");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does prefixFieldMapper have a nice toString?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added one, plus a line to the test

@romseygeek
Copy link
Contributor Author

I removed the true/false option, I agree it was just adding extra complication.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left minor comments but the change looks good to me overall. Please make sure PR CI is green before merging.

@@ -161,18 +196,143 @@ public TextFieldMapper build(BuilderContext context) {
builder.fielddataFrequencyFilter(minFrequency, maxFrequency, minSegmentSize);
DocumentMapperParser.checkNoRemainingFields(propName, frequencyFilter, parserContext.indexVersionCreated());
iterator.remove();
} else if (propName.equals("index_prefix")) {
Map<?, ?> indexPrefix = (Map<?, ?>) propNode;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe validate that propNode is indeed not null and a map

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block above (for fielddataFrequencyFilter) does no checking here. I think if somebody passes in null explicitly, then returning a NullPointerException is fair enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might agree if this was a library but there have been several requests that we produce error messages that are as meaningful as possible so I think it's better to check explicitly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a test for this, and it turns out it's already dealt with by TypeParsers.parseTextField() which checks that none of the passed-in parameters are null, so I think we're OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. Thanks for checking!

--------------------------------
// CONSOLE
<1> `min_chars` must be greater than zero, defaults to 2
<2> `max_chars` must be greater than `min_chars` and less than 20, defaults to 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/greater than/greater than or equal to/ I think?


To configure the index prefix field, use the following syntax. Either or both
of `min_chars` and `max_chars` may be excluded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should re-explain what it does here and also make it clear that both min_chars and max_chars are inclusive?

@romseygeek romseygeek merged commit 424ecb3 into elastic:master Jan 30, 2018
@romseygeek romseygeek deleted the topic/27049-prefix-fieldtype branch January 30, 2018 08:27
romseygeek added a commit that referenced this pull request Jan 30, 2018
This adds the ability to index term prefixes into a hidden subfield, enabling prefix queries to be run without multitermquery rewrites. The subfield reuses the analysis chain of its parent text field, appending an EdgeNGramTokenFilter. It can be configured with minimum and maximum ngram lengths. Query terms with lengths outside this min-max range fall back to using prefix queries against the parent text field.

The mapping looks like this:

"my_text_field" : {
"type" : "text",
"analyzer" : "english",
"index_prefix" : { "min_chars" : 1, "max_chars" : 10 }
}

Relates to #27049
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories v6.3.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants