Add ability to index prefixes on text fields #28290

romseygeek · 2018-01-18T11:00:38Z

This adds the ability to index term prefixes into a hidden subfield, enabling prefix queries to be run without multitermquery rewrites. The subfield reuses the analysis chain of its parent text field, appending an EdgeNGramTokenFilter. It can be configured with minimum and maximum ngram lengths. Query terms with lengths outside this min-max range fall back to using prefix queries against the parent text field.

The mapping looks like this:

"my_text_field" : {
"type" : "text",
"analyzer" : "english",
"index_prefix" : { "min_chars" : 1, "max_chars" : 10 }
}

This implementation uses a dedicated FieldType and FieldMapper within TextFieldMapper

Supersedes #28222

…index-field

…fieldtype

jpountz

I know I was the one who suggested it, but I'm still wondering whether we should go with the double dot approach or just ${field}._prefix like you did in the first PR. The fact that you now return the prefix mapper in iterator() should make mapping updates fail if a user tries to add a multi-field with prefix as a name since there would be two fields with the same name. Maybe this is a better approach? cc @rjernst @jimczi

jpountz · 2018-01-18T11:15:46Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

+    public Iterator<Mapper> iterator() {
+        if (prefixFieldMapper == null)
+            return super.iterator();
+        return Iterators.concat(multiFields.iterator(), Collections.singleton(prefixFieldMapper).iterator());


maybe replace multiFields.iterator() with super.iterator() to be more future-proof in case the super impl is updated in the future?

jpountz · 2018-01-18T11:17:33Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

+                } else if (propName.equals("index_prefix")) {
+                    Map<?, ?> indexPrefix = (Map<?, ?>) propNode;
+                    int minChars = XContentMapValues.nodeIntegerValue(indexPrefix.remove("min_chars"), 0);
+                    int maxChars = XContentMapValues.nodeIntegerValue(indexPrefix.remove("max_chars"), 10);


this got me wondering whether we should call it min_gram and max_gram like the option on the (edge) ngram tokenizer/filter

I've gone back and forth on this a bit. It would be more consistent, but I think it doesn't make as much sense outside of the context of the ngram tokenizer - if you're reading { "index_prefix" : { "min_gram" : 0 } } it's not obvious what a 'gram' actually is?

jpountz · 2018-01-18T11:18:53Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

@@ -113,6 +125,12 @@ public Builder fielddataFrequencyFilter(double minFreq, double maxFreq, int minS
            return builder;
        }

+        public Builder indexPrefixes(int minChars, int maxChars) {
+            this.prefixFieldType = new PrefixFieldType(name() + "..prefix", minChars, maxChars);
+            fieldType().setPrefixFieldType(this.prefixFieldType);


should we validate that minChars <= maxChars?

and also that minChars >= 1?

jpountz · 2018-01-18T11:19:20Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

@@ -161,18 +181,116 @@ public TextFieldMapper build(BuilderContext context) {
                    builder.fielddataFrequencyFilter(minFrequency, maxFrequency, minSegmentSize);
                    DocumentMapperParser.checkNoRemainingFields(propName, frequencyFilter, parserContext.indexVersionCreated());
                    iterator.remove();
+                } else if (propName.equals("index_prefix")) {
+                    Map<?, ?> indexPrefix = (Map<?, ?>) propNode;
+                    int minChars = XContentMapValues.nodeIntegerValue(indexPrefix.remove("min_chars"), 0);


should it be 1 rather than 0?

romseygeek · 2018-01-18T11:28:08Z

Identical field names are detected at the Json-parser level rather than at the mappings level, unfortunately, so adding a multi-field called prefix silently replaces the internal prefix field.

jpountz · 2018-01-18T13:23:27Z

We are supposed to have validation at the mappings level too for the reason you mentioned, which leverages the Mapper.iterator method and makes sure that two fields never declare the same field name, see MapperService.checkFieldUniqueness.

romseygeek · 2018-01-18T13:29:09Z

MapperService.checkFieldUniqueness is only called during a merge, I think? So it's too late for checking that we don't have multiple field definitions within a single DocumentMapper.

jpountz · 2018-01-18T13:41:16Z

MapperService.checkFieldUniqueness is only called during a merge

This is true, but merges are the only way to modify mappings, so it should cover all cases. For instance at index creation time, we merge the provided mappings into an empty instance, dynamic introduction of new fields creates mapping updates that are merged before we index documents, etc.

romseygeek · 2018-01-18T13:57:30Z

OK, if I create the DocumentMapper via MapperService.merge() then I can get an IllegalArgumentException if there's a 'prefix' subfield.

My only concern here is that it's not entirely obvious to a user why there are multiple definitions of the prefix field - after all, they've only defined it once.

…ate fields

jpountz · 2018-01-18T14:15:43Z

Yes, I agree we should validate names of multi-fields explicitly, but I still like to have this safety as a fallback.

romseygeek · 2018-01-18T14:18:11Z

Maybe the field should be called index_prefix, to match with the DSL field name?

jpountz · 2018-01-18T14:20:00Z

👍 it would make it clearer where this field comes from if users get an error message about it. Maybe still prefix with an underscore like other fields that Elasticsearch manages by itself, ie. _index_prefix?

jimczi

I left some additional comments

jimczi · 2018-01-18T17:40:05Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

+            if (minChars > maxChars)
+                throw new IllegalArgumentException("min_chars [" + minChars + "] must be less than max_chars [" + maxChars + "]");
+            if (minChars < 1)
+                throw new IllegalArgumentException("min_chars [" + minChars + "] must be greater than zero");


not sure that we should let the user defines min and max or we should at least have a limit on maxChars ?
We could also have sane default values (2 to 5 should be good enough for 99% of the case ?).

I haven't been able to find any particularly useful stats on prefix distributions. I'm guessing that it will be different for different languages, and depends a bit on whether the analysis chain decompounds things like Danish article prefixes. 2 to 5 sounds like a reasonable default though.

Maybe a max limit of something like 20? I've done this elsewhere, where we had base64 encoded images stuffed into a text field and the ngram filter spent about an hour trying to tokenize it.

For the minimum, I can see arguments for both 1 and 2 chars, so I think it's reasonable to allow configuration.

Maybe a max limit of something like 20

+1

2 to 5 sounds like a reasonable default though.

+1 too, changing these values should be considered as advanced (expert usage) ?

jimczi · 2018-01-18T17:41:44Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

+        final int maxChars;
+
+        PrefixFieldType(String name, int minChars, int maxChars) {
+            setTokenized(true);


we can deactivate norms, positions and frequencies ?

jimczi · 2018-01-18T17:42:55Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

+            if (prefixFieldType == null || prefixFieldType.accept(value.length()) == false) {
+                return super.prefixQuery(value, method, context);
+            }
+            return prefixFieldType.termQuery(value, context);


should we use constant score as well if the rewrite method is null or constant_score ?

jpountz · 2018-01-18T22:07:48Z

rest-api-spec/src/main/resources/rest-api-spec/test/search/190_index_prefix_search.yml

+        q: shor*
+        df: text
+
+  - match: {hits.total: 1}


maybe check the score too?

jpountz · 2018-01-18T22:10:55Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

+            return "prefix";
+        }
+    }
+
    public static final class TextFieldType extends StringFieldType {


you will need to implement checkfieldtype so that users get an error if they try to update the index_prefix settings

And add the modifier in TextFieldTypeTests.

jpountz · 2018-01-18T22:14:20Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

            return new TextFieldMapper(
-                    name, fieldType, defaultFieldType, positionIncrementGap,
+                    name, fieldType, defaultFieldType, positionIncrementGap, prefixMapper,


Maybe we should fail if the field is not indexed but prefixes are indexed?

jpountz · 2018-01-18T22:17:06Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

+    @Override
+    public Iterator<Mapper> iterator() {
+        if (prefixFieldMapper == null)
+            return super.iterator();


we usually use brackets even for single-line if/else statements

romseygeek · 2018-01-19T14:04:48Z

Writing the docs has made me query the configuration format - passing {} as the default seems a bit odd. Maybe we should accept true or false as standard, and then the object as an expert setting?

jpountz

I left some nit-picks but it looks great to me overall. Something that I'm not a fan of is the fact that we accept true/false in addition to an object. It requires more testing and will need more work if we ever change the way that this option is exposed if we want to maintain bw compat. Also, it doesn't make it much shorter in my opinion since you can also do index_prefix: {} to do the same thing as index_prefix: true?

jpountz · 2018-01-19T17:10:11Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

+                throw new IllegalArgumentException("min_chars [" + minChars + "] must be less than max_chars [" + maxChars + "]");
+            if (minChars < 1)
+                throw new IllegalArgumentException("min_chars [" + minChars + "] must be greater than zero");
+            if (maxChars >= 20)


please add brackets

jpountz · 2018-01-19T17:14:11Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

+                : new PrefixFieldMapper(prefixFieldType.setAnalyzer(fieldType.indexAnalyzer()), context.indexSettings());
+            if (prefixFieldType != null && fieldType().isSearchable() == false) {
+                throw new IllegalArgumentException("Cannot set index_prefix on unindexed field [" + name() + "]");
+            }


move this if statement before the creation of the prefix field mapper?

jpountz · 2018-01-19T17:16:55Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

+        }
+        else if (this.prefixFieldMapper != null || mw.prefixFieldMapper != null) {
+            throw new IllegalArgumentException("mapper [" + name() + "] has different index_prefix settings, current ["
+                + this.prefixFieldMapper + "], merged [" + mw.prefixFieldMapper + "]");


does prefixFieldMapper have a nice toString?

Have added one, plus a line to the test

romseygeek · 2018-01-23T13:04:31Z

I removed the true/false option, I agree it was just adding extra complication.

jpountz

I left minor comments but the change looks good to me overall. Please make sure PR CI is green before merging.

jpountz · 2018-01-23T17:13:49Z

server/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

@@ -161,18 +196,143 @@ public TextFieldMapper build(BuilderContext context) {
                    builder.fielddataFrequencyFilter(minFrequency, maxFrequency, minSegmentSize);
                    DocumentMapperParser.checkNoRemainingFields(propName, frequencyFilter, parserContext.indexVersionCreated());
                    iterator.remove();
+                } else if (propName.equals("index_prefix")) {
+                    Map<?, ?> indexPrefix = (Map<?, ?>) propNode;


maybe validate that propNode is indeed not null and a map

The block above (for fielddataFrequencyFilter) does no checking here. I think if somebody passes in null explicitly, then returning a NullPointerException is fair enough?

I might agree if this was a library but there have been several requests that we produce error messages that are as meaningful as possible so I think it's better to check explicitly?

Just added a test for this, and it turns out it's already dealt with by TypeParsers.parseTextField() which checks that none of the passed-in parameters are null, so I think we're OK.

Perfect. Thanks for checking!

jpountz · 2018-01-23T17:17:59Z

docs/reference/mapping/types/text.asciidoc

+--------------------------------
+// CONSOLE
+<1> `min_chars` must be greater than zero, defaults to 2
+<2> `max_chars` must be greater than `min_chars` and less than 20, defaults to 5


s/greater than/greater than or equal to/ I think?

jpountz · 2018-01-23T17:19:04Z

docs/reference/mapping/types/text.asciidoc

+
+To configure the index prefix field, use the following syntax.  Either or both
+of `min_chars` and `max_chars` may be excluded.
+


I think we should re-explain what it does here and also make it clear that both min_chars and max_chars are inclusive?

…fieldtype

This adds the ability to index term prefixes into a hidden subfield, enabling prefix queries to be run without multitermquery rewrites. The subfield reuses the analysis chain of its parent text field, appending an EdgeNGramTokenFilter. It can be configured with minimum and maximum ngram lengths. Query terms with lengths outside this min-max range fall back to using prefix queries against the parent text field. The mapping looks like this: "my_text_field" : { "type" : "text", "analyzer" : "english", "index_prefix" : { "min_chars" : 1, "max_chars" : 10 } } Relates to #27049

romseygeek added 8 commits January 15, 2018 13:47

Add index_prefix option to text fields

794e00b

Move PrefixWrappedAnalyzer into private class

592f501

checkstyle

0bf3370

Merge remote-tracking branch 'origin/master' into topic/27049-prefix-…

36d0a9f

…index-field

Use double-dot fieldname to prevent mapping clashes

f7f5b84

WIP

23ea676

Merge remote-tracking branch 'origin/master' into topic/27049-prefix-…

055e946

…fieldtype

Use prefix fieldtype and mapper

e30ef05

romseygeek requested review from jimczi and jpountz January 18, 2018 11:00

romseygeek added :Search/Search Search-related issues that do not fall into other categories v7.0.0 v6.3.0 >enhancement labels Jan 18, 2018

jpountz reviewed Jan 18, 2018

View reviewed changes

romseygeek added 2 commits January 18, 2018 11:45

Address feedback

ed1780e

Fix checkstyle

c1d3bc9

Revert to single dot subfield, rely on MappingService to catch duplic…

8c5a4eb

…ate fields

romseygeek added 2 commits January 18, 2018 14:55

Change fieldname to _index_prefix, add rest-spec test

68d989b

Fix tests

dee5d5d

jimczi reviewed Jan 18, 2018

View reviewed changes

jpountz reviewed Jan 18, 2018

View reviewed changes

romseygeek added 2 commits January 19, 2018 13:43

More error checking, field merges, test scores

829140a

checkstyle

1e4f3a1

Add doc, allow 'true' or 'false' as a simple config option

06387a9

jpountz reviewed Jan 19, 2018

View reviewed changes

romseygeek added 2 commits January 22, 2018 20:21

Remove optional true/false setting

dbf397b

review comments

54a7532

jpountz approved these changes Jan 23, 2018

View reviewed changes

romseygeek added 2 commits January 29, 2018 09:50

Merge branch 'master' into topic/27049-prefix-fieldtype

80dc369

doc update

1cfec0a

romseygeek mentioned this pull request Jan 29, 2018

Add index_prefix option to text fields #28222

Closed

romseygeek added 6 commits January 29, 2018 10:27

Add test for null check; fix test compilation

a4393b1

Merge remote-tracking branch 'origin/master' into topic/27049-prefix-…

9c7fe6c

…fieldtype

Fix BWC test

6a88ab6

Correct BWC skip message

0e6812a

Merge remote-tracking branch 'origin/master' into topic/27049-prefix-…

d1c6e1c

…fieldtype

BWC

12c2462

romseygeek merged commit 424ecb3 into elastic:master Jan 30, 2018

romseygeek deleted the topic/27049-prefix-fieldtype branch January 30, 2018 08:27

jpountz mentioned this pull request Jun 4, 2018

Make it easier to optimize search with better analysis #27049

Closed

jimczi mentioned this pull request Jul 10, 2018

Optimize phrase_prefix match query #31921

Closed

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019


		To configure the index prefix field, use the following syntax. Either or both
		of `min_chars` and `max_chars` may be excluded.

Add ability to index prefixes on text fields #28290

Add ability to index prefixes on text fields #28290

Conversation

romseygeek commented Jan 18, 2018

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romseygeek commented Jan 18, 2018

jpountz commented Jan 18, 2018

romseygeek commented Jan 18, 2018

jpountz commented Jan 18, 2018

romseygeek commented Jan 18, 2018

jpountz commented Jan 18, 2018

romseygeek commented Jan 18, 2018

jpountz commented Jan 18, 2018

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romseygeek commented Jan 19, 2018

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romseygeek commented Jan 23, 2018

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment