Allow metadata fields in the _source #61590

csoulios · 2020-08-26T15:14:09Z

So far we don't allow metadata fields in the document _source. However, in the case of the _doc_count field mapper (#58339) we want to be able to set

This PR adds a method to the metadata field parsers that exposes if the field can be included in the document source or not.
This way each metadata field can configure if it can be included in the document _source

elasticmachine · 2020-08-26T15:14:11Z

Pinging @elastic/es-search (:Search/Mapping)

jtibshirani

I think that DocumentParser will require some additional changes -- it looks like FieldMapper#parse will never called on a metadata field, even if isAllowedInSource is enabled and the value is present in the _source. It'd be great to add a test using a dummy MetadataFieldMapper to check that everything is wired up correctly.

Added test testDocumentContainsAllowedMetadataField()

Fixed broken tests

csoulios · 2020-09-09T13:56:23Z

@elasticmachine run elasticsearch-ci/packaging-sample-windows

server/src/test/java/org/elasticsearch/index/mapper/MockMetadataMapperPlugin.java

jtibshirani · 2020-09-09T19:48:07Z

server/src/main/java/org/elasticsearch/index/mapper/MetadataFieldMapper.java

+        /**
+         *  @return Whether a metadata field can be included in the document _source.
+         */
+        default boolean isAllowedInSource() {


I'm wondering if we really need this new flag isAllowedInSource. Maybe we could avoid adding a new flag and instead do the following:

For metadata fields that are not allowed in _source, make sure that MetadataFieldMapper#parse throws an descriptive error.

Many metadata fields currently have logic in parse that's unrelated to _source parsing. We could make sure to move it into the special methods preParse or postParse.

@jtibshirani I agree with you that delegating this to MetadataFieldMapper#parse() would be a simpler solution. However I see two caveats:

The approach with isAllowedInSource() method detects that a metadata field is being parsed when parsing the field name. This is early in the parsing process. An exception is thrown and the field value parsing is skipped.

As you mention, I have seen classes (such as IgnoredFieldMapper and VersionFieldMapper) that override parse() method to do nothing on one hand, while calling super.parse() from preParse()/postParse() methods. This approach is too complicated imho, but I am not sure that refactoring this is very simple. WDYT?

This is early in the parsing process. An exception is thrown and the field value parsing is skipped.

I don't think we need to optimize performance in this case because it's an error condition (and should be a rare one too).

This approach is too complicated imho, but I am not sure that refactoring this is very simple.

I agree that the logic is complex/ hard to read! I haven't deeply looked into the refactoring myself -- maybe you could try it out and we can rediscuss the approach if you see a roadblock ?

jtibshirani · 2020-09-09T19:50:38Z

server/src/main/java/org/elasticsearch/index/mapper/DocumentParser.java

-                    throw new MapperParsingException("Field [" + currentFieldName + "] is a metadata field and cannot be added inside"
-                        + " a document. Use the index API request parameters.");
+                    if (context.mapperService().isFieldAllowedInSource(context.path().pathAsText(currentFieldName))) {
+                        // If token is a metadata field and is allowed in source, parse its value


What's the reasoning for rejecting null and non-leaf values here?

@jimczi and I had a discussion about the possible use cases we want to cover with this feature in the near future and agreed that for the sake of simplicity we should only accept non-null values (no objects or arrays).

I don't have any strong opinions and I am happy to revisit this.

I think null should be accepted but non-leaf values seemed more challenging. Now that I re-think about it, it shouldn't be an issue as long as we consume the value from the root (object or not).

So +1 to handle object, arrays and null values and let the metadata field mapper deals with it.

Ok, let me work on this again

Delegated this functionality to MetadataFieldMapper.parse()

csoulios · 2020-09-16T18:00:02Z

@jtibshirani I pushed changes that address your comments. Please let me know what you think. :)

jtibshirani · 2020-09-16T20:22:07Z

server/src/main/java/org/elasticsearch/index/mapper/MetadataFieldMapper.java

+    }
+
+    @Override
+    protected void parseCreateField(ParseContext context) throws IOException {


Instead of overriding parse here, could we override parseCreateField?

@Override protected void parseCreateField(ParseContext context) throws IOException { throw new MapperParsingException("Field [" + name() + "] is a metadata field and cannot be added inside" + " a document. Use the index API request parameters."); }

Then we could do the following:

For meta fields that cannot be specified in _source, move all the relevant logic out of parseCreateField and into preParse or postParse as appropriate.

Remove the new doParse method, as it's no longer needed.

Thanks for pointing this out. I had a hard time thinking a simple and clean way to refactor this part.

Initially, I tried the approach you suggested, but FieldMapper#parse() does all the exception handling by encapsulating all exceptions thrown by parseCreateField() offering a preview of the parsed value.

(See

elasticsearch/server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java

Line 249 in 9a127ad

} catch (Exception e) {

)

This means that the

throw new MapperParsingException("Field [" + name() + "] is a metadata field and cannot be added inside" + " a document. Use the index API request parameters.");

will be encapsulated in the following exception:

throw new MapperParsingException("failed to parse field [{}] of type [{}] in document with id '{}'. " + "Preview of field's value: '{}'", e, fieldType().name(), fieldType().typeName(), context.sourceToParse().id(), valuePreview);

On the other hand, methods preParse() and postParse() delegate parsing to parse() for the same reason (to handle parsing exception). If we were to move all parsing in those two methods, we would have to replicate exception handling in those methods as well.

Since all FieldMapper#parse() does is to delegate parsing to parseCreateField() and only do the exception handling, I decided to redirect this functionality in 'MetadataFieldMapper#doParse()and call this method frompreParse()andpostParse()`.

To me that approach you tried/ I suggested is the cleanest way to go. My reasoning...

This means that the ... will be encapsulated in the following exception:

This seems like an okay compromise to me, all the exception information is there so the user can determine the cause. It's a little confusing that we wrap the message, but this should be a rare situation.

On the other hand, methods preParse() and postParse() delegate parsing to parse() for the same reason (to handle parsing exception).

I think that exception handling code is only really helpful when dealing with incorrect user-supplied values. Since preParse and postParse work with internal data such as context.sourceToParse().id(), it would indicate a serious logic error. So I don't think we'd need the same general exception handling strategy where we create a nice user-facing message.

jtibshirani

This is looking good to me, I have one last comment. Thanks for all the iterations on this!

jtibshirani · 2020-09-17T19:34:15Z

server/src/main/java/org/elasticsearch/index/mapper/DocumentParser.java

+    private static Mapper getMapper(final ParseContext context, ObjectMapper objectMapper, String fieldName, String[] subfields) {
+        String fieldPath = context.path().pathAsText(fieldName);
+        if (context.mapperService().isMetadataField(fieldPath)) {
+            for (MetadataFieldMapper metadataFieldMapper : context.docMapper().mapping().getMetadataMappers()) {


I just noticed we're doing a linear scan through all the metadata mappers. Maybe we should make sure to store them as a map from name -> mapper to avoid this overhead?

You are right. I had this point in mind for improvement. I just saw that linear scan happens at:

elasticsearch/server/src/main/java/org/elasticsearch/index/mapper/DocumentParser.java

Line 101 in b8b3a9d

for (MetadataFieldMapper metadataMapper : metadataFieldsMappers) {

I understand that pre/postParse() internalParseDocument() is called once per document, while the linear scan in getMapper() is executed once per metadata field.

I will fix this asap!

Fixed this last bit. Can you please have one last look?

Thank you for your patience and excellent guidance.

The change makes sense to me.

It feels redundant that we store an array of metadata mappers, a map from class -> mapper, and a map from name -> mapper. This could be simplified, but we don't necessarily need to do it in this PR.

Instead of linear scan

Backports #61590 to 7.x So far we don't allow metadata fields in the document _source. However, in the case of the _doc_count field mapper (#58339) we want to be able to set This PR adds a method to the metadata field parsers that exposes if the field can be included in the document source or not. This way each metadata field can configure if it can be included in the document _source

andreidan · 2020-10-08T10:23:58Z

Removed backport pending as this was backported via #62616

Configurable metadata field mappers in the _source

8a65d1c

csoulios added :Search Foundations/Mapping Index mappings, including merging and defining field types v8.0.0 v7.10.0 labels Aug 26, 2020

elasticmachine added the Team:Search Meta label for search team label Aug 26, 2020

jtibshirani reviewed Aug 27, 2020

View reviewed changes

csoulios added 4 commits September 8, 2020 19:50

Changes to support metadata fields in _source

d83b677

Added test testDocumentContainsAllowedMetadataField()

Merge branch 'master' into metadata-source

76f3dc3

Merged DocumentParserTests from master

4934129

Fixed broken tests

Merge branch 'master' into metadata-source

a168a1b

csoulios requested review from jimczi and jtibshirani September 9, 2020 12:49

jtibshirani reviewed Sep 9, 2020

View reviewed changes

csoulios added 4 commits September 10, 2020 19:11

Handle non string values

2effa2f

Allow metadata fields to parse values/objects/arrays/null

1f5aa63

Removed MetadataFieldMapper.isAllowedInSource() method

78faca3

Delegated this functionality to MetadataFieldMapper.parse()

Merge branch 'master' into metadata-source

1346865

csoulios requested a review from jtibshirani September 16, 2020 17:56

Fixed bug that caused tests to break

39d19fe

jtibshirani reviewed Sep 16, 2020

View reviewed changes

csoulios added 3 commits September 17, 2020 18:41

Cleanup parsing for existing metadata fields

f0bb957

Cleanup parsing for existing metadata fields

a734ea8

Remove doParse() method

b8b3a9d

csoulios requested a review from jtibshirani September 17, 2020 19:23

jtibshirani reviewed Sep 17, 2020

View reviewed changes

csoulios added 2 commits September 17, 2020 22:41

Fix broken test

1560e73

Lookup metadata mapper by name

0c962c4

Instead of linear scan

csoulios requested a review from jtibshirani September 17, 2020 20:09

jtibshirani approved these changes Sep 17, 2020

View reviewed changes

csoulios merged commit 55294e5 into elastic:master Sep 18, 2020

csoulios deleted the metadata-source branch September 18, 2020 06:45

csoulios added the backport pending label Sep 18, 2020

csoulios mentioned this pull request Sep 18, 2020

[7.x] Allow metadata fields in the _source #62616

Merged

andreidan added >enhancement and removed backport pending labels Oct 8, 2020

Mpdreamz mentioned this pull request Nov 16, 2020

7.10.1 Meta Ticket elastic/elasticsearch-net#5096

Closed

61 tasks

stevejgordon mentioned this pull request Dec 17, 2020

7.11.0 Meta Ticket elastic/elasticsearch-net#5198

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Allow metadata fields in the _source #61590

Allow metadata fields in the _source #61590

Uh oh!

Conversation

csoulios commented Aug 26, 2020

Uh oh!

elasticmachine commented Aug 26, 2020

Uh oh!

jtibshirani left a comment

Choose a reason for hiding this comment

Uh oh!

csoulios commented Sep 9, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csoulios Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csoulios Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csoulios commented Sep 16, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtibshirani Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtibshirani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csoulios Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csoulios Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreidan commented Oct 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

csoulios Sep 10, 2020 •

edited

Loading

csoulios Sep 10, 2020 •

edited

Loading

jtibshirani Sep 17, 2020 •

edited

Loading

csoulios Sep 17, 2020 •

edited

Loading

csoulios Sep 17, 2020 •

edited

Loading