Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that fields copied using copy_to are not present in synthetic source #112625

Merged
merged 8 commits into from
Sep 12, 2024

Conversation

lkts
Copy link
Contributor

@lkts lkts commented Sep 6, 2024

This is a follow-up change to #112294 that fixes and edge case not covered there.

@lkts lkts force-pushed the copy_to_handling_ignore_source branch 2 times, most recently from fb4e59a to d07d1e1 Compare September 10, 2024 21:09
@lkts lkts changed the title Copy to handling ignore source Ensure that fields copied using copy_to are not present in synthetic source Sep 10, 2024
@lkts lkts force-pushed the copy_to_handling_ignore_source branch from d07d1e1 to d5435ad Compare September 10, 2024 21:10
private static final FieldMapper NO_OP_FIELDMAPPER = new FieldMapper(
"no-op",
new MappedFieldType("no-op", false, false, false, TextSearchInfo.NONE, Collections.emptyMap()) {
private static FieldMapper noopFieldMapper(String path) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to correctly implement fullPath() that is now used in context.isCopyToDestinationField check.

@@ -207,7 +213,7 @@ protected DocumentParserContext(
parent,
dynamic,
new HashSet<>(),
new HashSet<>(),
new HashSet<>(mappingLookup.fieldTypesLookup().getCopyToDestinationFields()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one existing user of this DocumentParser#postProcessDynamicArrayMapping but i think the intention of that code is the same as ours. It needs to have logic based on the fact that field is used as a copy_to destination. Previously that logic kicked in only if such a field was processed in this document already. With this, it will kick in the edge case of copy_to destination having actual values in the document. That seems to align with the intention of the code.
cc @kderusso

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is fine to only kick in if copy_to has values. Thank you for the ping.

@lkts lkts requested review from kkrik-es and martijnvg September 10, 2024 22:01
@lkts lkts marked this pull request as ready for review September 10, 2024 22:02
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Sep 10, 2024
@lkts lkts added >non-issue :StorageEngine/Mapping The storage related side of mappings labels Sep 10, 2024
@elasticsearchmachine elasticsearchmachine added Team:StorageEngine and removed needs:triage Requires assignment of a team area label labels Sep 10, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

&& context.sourceKeepModeFromIndexSettings() == Mapper.SourceKeepMode.ARRAYS;
boolean dynamicRuntimeContext = context.dynamic() == ObjectMapper.Dynamic.RUNTIME;
if (objectRequiresStoringSource || fieldWithFallbackSyntheticSource || dynamicRuntimeContext || fieldWithStoredArraySource) {
boolean copyToFieldHasValuesInDocument = context.isWithinCopyTo() == false
Copy link
Contributor

@kkrik-es kkrik-es Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: String fullPath = context.path().pathAsText(arrayFieldName); and use below.

@@ -103,19 +117,22 @@ static void decodeAndWrite(XContentBuilder b, BytesRef r) throws IOException {
* @throws IOException
*/
static void writeMerged(XContentBuilder b, String fieldName, List<BytesRef> encodedParts) throws IOException {
if (encodedParts.isEmpty()) {
var partsWithData = encodedParts.stream().filter(XContentDataHelper::isDataPresent).toList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc stream ops are fairly slow, consider replacing with regular list processing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably be optimistic and first check if there's an empty element, then create the copy on a second pass if one is found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote this but i don't have a feel if this is any faster. It skips allocating stream object i guess.

if (context.dynamic() == ObjectMapper.Dynamic.RUNTIME && context.canAddIgnoredField()) {
try {
context.addIgnoredField(
IgnoredSourceFieldMapper.NameValue.fromContext(context, path, XContentDataHelper.encodeToken(context.parser()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Freakin path, again. That should fix issues with runtime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specific to this change. It may also help with #111916 eventually.

k: ["5", "6"]
copy: ["7", "8"]
- match: { hits.hits.3.fields.copy: ["5", "6", "7", "8"] }

Copy link
Contributor

@kkrik-es kkrik-es Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a test where copy_to is applied within a context where the source has been stored already, e.g. in an array that has been stored.

@kkrik-es
Copy link
Contributor

Nice, do we need to add coverage in randomized testing too?

if (encodedParts.isEmpty()) {
var partsWithData = encodedParts.stream().filter(XContentDataHelper::isDataPresent).toList();

if (partsWithData.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace partsWithData with encodedParts and move the early exit above?

@@ -158,6 +175,10 @@ static void writeMerged(XContentBuilder b, String fieldName, List<BytesRef> enco
b.endArray();
}

public static boolean isDataPresent(BytesRef encoded) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this logic to check for empty and rename to isEmpty?

Copy link
Contributor Author

@lkts lkts Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that better? :)

Copy link
Contributor

@salvatore-campagna salvatore-campagna Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All collections work with the concept of "being empty" and have an isEmpty and I think also that usually writing logic like "if it is empty then exit or return" is easier to understand...no data, nothing to do. Of course it is not better from a functional perspective but I see it more "coherent" with the rest.

For instance encodedParts and partsWithData above use isEmpty.

I see that you use isDataPresent in the stream filter which works on the opposite of isEmpty...so that is probably the reason why you did it like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, filter is the reason.

@@ -1214,3 +1224,231 @@ fallback synthetic_source for text field:
hits.hits.0._source:
text: [ "world", "hello", "world" ]

---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also have a reindex test now that we have copy_to working in synthetic source?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add it in a separate PR.

- match: { hits.hits.1.fields.copy: ["5", "6", "7", "8"] }

---
synthetic_source with copy_to field from dynamic template having values in source:
Copy link
Contributor

@salvatore-campagna salvatore-campagna Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check that using copy_to results in error when used with object-like fields? Note: it is not just object and nested but also things like range or geo-like. I think this is important for BWC...we need to keep the error-behaviour.

@lkts
Copy link
Contributor Author

lkts commented Sep 11, 2024

@elasticmachine update branch

@lkts lkts merged commit 44c9271 into elastic:main Sep 12, 2024
15 checks passed
@lkts lkts deleted the copy_to_handling_ignore_source branch September 12, 2024 19:18
lkts added a commit to lkts/elasticsearch that referenced this pull request Sep 12, 2024
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants