Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geo Point parse error fix #40447

Conversation

henningandersen
Copy link
Contributor

When geo point parsing threw a parse exception, it did not consume
remaining tokens from the parser. This in turn meant that
indexing documents with malformed geo points into mappings with
ignore_malformed=true would fail in some cases, since DocumentParser
expects geo_point parsing to end on the END_OBJECT token.

Related to #17617

When geo point parsing threw a parse exception, it did not consume
remaining tokens from the parser. This in turn meant that
indexing documents with malformed geo points into mappings with
ignore_malformed=true would fail in some cases, since DocumentParser
expects geo_point parsing to end on the END_OBJECT token.

Related to elastic#17617
@henningandersen henningandersen added >bug :Analytics/Geo Indexing, search aggregations of geo points and shapes v8.0.0 labels Mar 26, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had the same issues on geoshape side and used a slightly different approach in geoshape parsing. We now have a special wrapper for XContentParser that allows skipping children in case of a parse error regardless on where the error occurred. It allows us to have one catch all section instead of thinking about every single path that things can go wrong. It also guarantees that we will never consume more tokens than we are supposed to or leave some hanging around. Would you be interested in trying this approach instead?

When geo point parsing threw a parse exception, it did not consume
remaining tokens from the parser. This in turn meant that
indexing documents with malformed geo points into mappings with
ignore_malformed=true would fail in some cases, since DocumentParser
expects geo_point parsing to end on the END_OBJECT token.

Related to elastic#17617
@henningandersen
Copy link
Contributor Author

Thanks for the tip on the sub parser, @imotov. I adjusted to use that approach in e4791a9.

I changed the sub parser to also support arrays (since they have the same fundamental issue) and chose to also support non-object/non-arrays (this makes it much simpler to use at least for the geo point case).

Forgot to fix XContentParserTests, will look into that tomorrow morning.

Improved XContentSubParser to allow any token, which is useful for
wrapping in cases where both object and values are allowed.

Related to elastic#17617
@henningandersen
Copy link
Contributor Author

@imotov this should now be ready for another review. A few notable changes:

  1. sub parser: currentToken is now the same value as last call to nextToken (including null) and fixed all methods to mimick end of doc behaviour.
  2. as mentioned above: allowing sub parser on both arrays and values (ended up allowing any token) is desirable at least in the geo points case.

Thinking more about this, it could be desirable to let the DocumentParser create the sub parser, ensuring that individual mappers do not have to consider this. I chose not to pursue this for now (larger effort).

@henningandersen henningandersen requested a review from imotov March 27, 2019 09:57
@imotov
Copy link
Contributor

imotov commented Mar 27, 2019

allowing sub parser on both arrays and values (ended up allowing any token) is desirable at least in the geo points case.

Could you explain the motivation behind supporting values a bit more? The original intent of the subparser was to create a parsing context that the parser cannot escape or underconsume in case of an error. In case of the object the context is the entire object until the corresponding closing bracket. I can clearly see how this logic can be extended to arrays, where the context is all elements until the closing ]. However, what is the context in case of the value? It seems like it is just this value and the parser wrapper seems to be a bit overkill here since you already have the value and you cannot really underconsume or overconsume it accidently.

@henningandersen
Copy link
Contributor Author

henningandersen commented Mar 27, 2019

There are two primary motivations for also supporting value:

  1. It makes it possible to wrap any "object", regardless of string, number or complex object. This in turn means that any parsing functionality can easily be wrapped in a sub parser. This is used in GeoUtils here: https://github.com/elastic/elasticsearch/pull/40447/files#diff-ac59342091d31c8a6b2b5a349712bac6R434 since parsing also supports string here: https://github.com/elastic/elasticsearch/pull/40447/files#diff-ac59342091d31c8a6b2b5a349712bac6R526 . The alternative would be to selectively construct the sub parser only for object and array, which makes the codd less clean.
  2. The subparser really does two things:
  • In case of errors close will find matching END token. (the original motivation).
  • It disallows parsing beyond the current object. This part is also desirable for values, to ensure that parsing a value does not call nextToken (exactly like we expect an object/array to stop on END).

@imotov
Copy link
Contributor

imotov commented Mar 27, 2019

I feel like considering the amount of complexity that supporting a single value adds, it might be easier to just wrap after we encountered an object or an array instead of blindly wrapping every time on the outside. Basically just wrap here and here. This case we can get away with all the complexity of handling values.

@henningandersen
Copy link
Contributor Author

I feel like considering the amount of complexity that supporting a single value adds,

I am not sure I follow you on this, the change to support values was really a one liner (initialize level to 1 or 0 depending on current token).

The rest of the checks were added to support that whatever comes out of nextToken() will also come out of currentToken(). Before this, you could forward the subparser beyond the END, get null back from nextToken() and then call currentToken() or any of the other accessor methods to get the value. This is an existing issue before this PR, but now that we start using the sub parser more it made sense to me to ensure this property holds also for the sub parser.

I do not mind following your suggestion, but I wanted to clarify above first to ensure this was included in your evaluation before I do the changes.

@imotov
Copy link
Contributor

imotov commented Mar 27, 2019

Before this, you could forward the subparser beyond the END

I agree that we should fix currentToken and currentName, but I think all other accessors method become only the real issue if you allow stopping on values not on the end of object or array. Otherwise, you are calling them on the "}" or "]", they return some nonsense, but I don't think we have to make them consistent with nonsense that is returned when you go beyond the end with normal parser.

In any case I would prefer not to wrap the values, since it gives too much runtime overhead for too little value in my opinion.

Reverted to a minimalistic change.

Related to elastic#17617
@henningandersen
Copy link
Contributor Author

@imotov, reverted my previous changes and made a minimalistic change to fix the issue reported. Please have another look.

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for doing this! Really appreciate it.

@henningandersen henningandersen merged commit bd9e9b3 into elastic:master Mar 28, 2019
jasontedor added a commit to rjernst/elasticsearch that referenced this pull request Mar 28, 2019
* master: (25 commits)
  [DOCS] Correct keystore commands for Email and Jira actions in Watcher (elastic#40417)
  [DOCS] Document common settings for snapshot repository plugins (elastic#40475)
  Remove with(out)-system-key tests (elastic#40547)
  Geo Point parse error fix (elastic#40447)
  Handle null retention leases in WaitForNoFollowersStep (elastic#40477)
  [DOCS] Adds anchors for ruby client (elastic#39867)
  Mute DataFrameAuditorIT#testAuditorWritesAudits
  Disable integTest when Docker is not available (elastic#40585)
  Add randomScore function in script_score query (elastic#40186)
  Test fixtures krb5 (elastic#40297)
  Correct ILM metadata minimum compatibility version (elastic#40569)
  SQL: Centralize SQL test dependencies version handling (elastic#40551)
  Mute testTracerLog
  Mute testHttpInput
  Include functions' aliases in the list of functions (elastic#40584)
  Optimise rejection of out-of-range `long` values (elastic#40325)
  Add docs for cluster.remote.*.proxy setting (elastic#40281)
  Migrate systemd packaging tests from bats to java (elastic#39954)
  Move top-level pipeline aggs out of QuerySearchResult (elastic#40319)
  Use openjdk 12 in packer cache script (elastic#40498)
  ...
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Mar 29, 2019
* elastic/master: (77 commits)
  Update ingest jdocs that a null return value will drop the current document. (elastic#40359)
  Remove -Xlint exclusions in the ingest-common module. (elastic#40505)
  Update docs for the DFR similarity (elastic#40579)
  Fix merging of search_as_you_type field mapper (elastic#40593)
  Support roles with application privileges against wildcard applications (elastic#40398)
  Remove obsolete security settings (elastic#40496)
  Remove Gradle deprecation warnings (elastic#40449)
  Run the build integ test in parallel (elastic#39788)
  Fix 3rd pary S3 tests (elastic#40588)
  lower bwc skip for search as you type (elastic#40599)
  Muting XContentParserTests#testSubParserArray
  Fixing typo in test error message (elastic#40611)
  Update max dims for vectors to 1024. (elastic#40597)
  Add start and stop time to cat recovery API (elastic#40378)
  [DOCS] Correct keystore commands for Email and Jira actions in Watcher (elastic#40417)
  [DOCS] Document common settings for snapshot repository plugins (elastic#40475)
  Remove with(out)-system-key tests (elastic#40547)
  Geo Point parse error fix (elastic#40447)
  Handle null retention leases in WaitForNoFollowersStep (elastic#40477)
  [DOCS] Adds anchors for ruby client (elastic#39867)
  ...
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Mar 29, 2019
* elastic/master: (129 commits)
  Update ingest jdocs that a null return value will drop the current document. (elastic#40359)
  Remove -Xlint exclusions in the ingest-common module. (elastic#40505)
  Update docs for the DFR similarity (elastic#40579)
  Fix merging of search_as_you_type field mapper (elastic#40593)
  Support roles with application privileges against wildcard applications (elastic#40398)
  Remove obsolete security settings (elastic#40496)
  Remove Gradle deprecation warnings (elastic#40449)
  Run the build integ test in parallel (elastic#39788)
  Fix 3rd pary S3 tests (elastic#40588)
  lower bwc skip for search as you type (elastic#40599)
  Muting XContentParserTests#testSubParserArray
  Fixing typo in test error message (elastic#40611)
  Update max dims for vectors to 1024. (elastic#40597)
  Add start and stop time to cat recovery API (elastic#40378)
  [DOCS] Correct keystore commands for Email and Jira actions in Watcher (elastic#40417)
  [DOCS] Document common settings for snapshot repository plugins (elastic#40475)
  Remove with(out)-system-key tests (elastic#40547)
  Geo Point parse error fix (elastic#40447)
  Handle null retention leases in WaitForNoFollowersStep (elastic#40477)
  [DOCS] Adds anchors for ruby client (elastic#39867)
  ...
@ywelsch
Copy link
Contributor

ywelsch commented Mar 29, 2019

Talked to @imotov and we think this should be backported up to 6.7.

henningandersen added a commit that referenced this pull request Mar 29, 2019
When geo point parsing threw a parse exception, it did not consume
remaining tokens from the parser. This in turn meant that
indexing documents with malformed geo points into mappings with
ignore_malformed=true would fail in some cases, since DocumentParser
expects geo_point parsing to end on the END_OBJECT token.

Related to #17617
henningandersen added a commit that referenced this pull request Mar 29, 2019
When geo point parsing threw a parse exception, it did not consume
remaining tokens from the parser. This in turn meant that
indexing documents with malformed geo points into mappings with
ignore_malformed=true would fail in some cases, since DocumentParser
expects geo_point parsing to end on the END_OBJECT token.

Related to #17617
@henningandersen
Copy link
Contributor Author

henningandersen commented Mar 29, 2019

Backported to 7.x, 7.0 and 6.7.

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Apr 1, 2019
* elastic/7.0:
  [TEST] Mute WebhookHttpsIntegrationTests.testHttps
  [DOCS] Add 'time value' links to several monitor settings (elastic#40633) (elastic#40687)
  Do not perform cleanup if Manifest write fails with dirty exception (elastic#40519)
  Remove mention of soft deletes from getting started (elastic#40668)
  Fix bug in detecting use of bundled JDK on macOS
  Reindex conflicts clarification (docs) (elastic#40442)
  SQL: [Tests] Enable integration tests for fixed issues (elastic#40664)
  Add information about the default sort mode (elastic#40657)
  SQL: [Docs] Fix example for CURDATE
  SQL: [Docs] Fix doc errors regarding CURRENT_DATE. (elastic#40649)
  Clarify using time_zone and date math in range query (elastic#40655)
  Add notice for bundled jdk (elastic#40576)
  disable kerberos test until kerberos fixture is working again
  [DOCS] Use "source" instead of "inline" in ML docs (elastic#40635)
  Unmute and fix testSubParserArray (elastic#40626)
  Geo Point parse error fix (elastic#40447)
  Increase suite timeout to 30 minutes for docs tests (elastic#40521)
  Fix repository-hdfs when no docker and unnecesary fixture
  Avoid building hdfs-fixure use an image that works instead
henningandersen added a commit that referenced this pull request Apr 8, 2019
When geo point parsing threw a parse exception, it did not consume
remaining tokens from the parser. This in turn meant that
indexing documents with malformed geo points into mappings with
ignore_malformed=true would fail in some cases, since DocumentParser
expects geo_point parsing to end on the END_OBJECT token.

Related to #17617
henningandersen added a commit that referenced this pull request Apr 8, 2019
Fixed compilation on 6.7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Geo Indexing, search aggregations of geo points and shapes >bug v6.7.1 v7.0.0-rc2 v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants