Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape #1017

Merged
merged 17 commits into from
Aug 9, 2022

Conversation

nknize
Copy link
Member

@nknize nknize commented Jul 12, 2022

Adds new doc value field to support LatLonShape and XYShape doc values. The
implementation is inspired by ComponentTree. A binary tree of tessellated
components (point, line, or triangle) is created. This tree is then DFS
serialized to a variable compressed DataOutput buffer to keep the doc value
format as compact as possible.

DocValue queries are performed on the serialized tree using a similar component
relation logic as found in SpatialQuery for BKD indexed shapes. To make this
possible some of the relation logic is refactored to make it accessible to the
doc value query counterpart.

Current limitations (to be addressed in follow up PR)

  1. Only Polygon Doc Values are tested
  2. CONTAINS relation not yet supported
  3. Only BoundingBox queries are supported (General Geometry Queries will be added in a follow on enhancement PR)

Highlights

  1. Standalone companion field independent of LatLonShape and XYShape
  2. Uses same test infrastructure as queries to ensure relation behavior matches

@nknize
Copy link
Member Author

nknize commented Jul 15, 2022

Hey @iverase; here's the PR related to the ShapeDocValuesField for the 9.3 release.

@iverase
Copy link
Contributor

iverase commented Jul 18, 2022

Hey @nknize!

I went through the PR and my feeling is that it is still half-cooked and I worry about adding a new encoding into core that it is not properly worked on. If later we need to change something we will have a hard time doing it once it is released. My suggestion is to either add it to the sandbox where there are no back-compact requirements or wait for another release so you have more time to work on it.

More focus feedback:

  • Early versions of XYShape where using encoded rectangles to perform bounding box queries. That proved to be wrong (LUCENE-8973: XYRectangle2D should work on float space lucene-solr#865) because the encoding is not linear, therefore the spatial relationships between geometries are not preserved in the encoded space. This PR seems to go back to that approach for XYShape doc values which is incorrect.

  • Lucene does not support any geo faceting, therefore for lucene users the real value for the doc values is the ability to perform IndexOrDocValues queries to speed their queries workload, still the current PR seems to be focusing on supporting faceting, e.g by adding centroid values. IMHO If we want to add those facets we should implement geo faceting in lucene alongside or we are not providing value to lucene users and testing would be very hard.

  • While the Codec API makes it easy to make changes across minors, we don't have the same flexibility for things that are done on top of the codec API like this binary encoding. Maybe we should take some more time to review the encoding instead of rushing it into 9.3. For instance, we used the index creation major version of the index to change the way facets store the taxonomy array, but this only allows changes on new major versions.

@nknize
Copy link
Member Author

nknize commented Jul 18, 2022

Early versions of XYShape where using encoded rectangles to perform bounding box queries. That proved to be wrong

That has to do with the query not the field. This does not affect the doc value format thus shouldn't prevent the PR. Also, the doc value queries are using the same relate logic as the index queries so they are consistent.

Lucene does not support any geo faceting

Correct. Solr implements faceting and since Solr is built on lucene we need a doc value format to support Solr faceting over the LatLonShape format.

still the current PR seems to be focusing on supporting faceting,

The PR has nothing to do with faceting since Lucene doesn't implement faceting. The PR is 100% focused on the doc value format and BoundingBox relations only.

Maybe we should take some more time to review the encoding instead of rushing it into 9.3.... My suggestion is to either add it to the sandbox where there are no back-compact requirements

The encoding is 100% a new field and marked as experimental. It is in core because since the sandbox module is now in an explicit .sandbox package, refactoring to sandbox would require refactoring core class modifiers and methods from package private to public so sandbox classes can extend them. This breaks the API.

The key take way here is that the coordinate encoding is the same as the index format that won't change unless we decide to change XYEncodingUtils or GeoEncodingUtils, at which time we will have to support BWC for much more than the doc value format. Furthermore, the field is marked as experimental (the proper mechanism to communicate that formats might change); there are currently far more critical classes marked as experimental (e.g., Lucene80DocValuesFormat) that make this less risky for 9.3.

@jpountz
Copy link
Contributor

jpountz commented Jul 19, 2022

I'm not familiar enough with the geo logic to comment on it but I left some comments on the Field and Query integration.

I support Ignacio's suggestion of not trying to fold it into 9.3, it's a big (and exciting!) change. On the bw compat aspect, it would still be nice to retain some ability to make changes in the future, not only in GeoEncodingUtils but also changing the order of the data we include in the representation or adding more information to it? Codecs have a natural way to make changes to serialization in a backward-compatible way, but not things that we encode on top of codecs like encoding of binary doc values. Thinking out loud, should we prepend a version to the binary representation to have more freedom to make changes in the future?

@@ -148,6 +234,18 @@ public static Query newBoxQuery(
return new LatLonShapeBoundingBoxQuery(field, queryRelation, rectangle);
}

/** create a docvalue query to find all geo shapes that intersect a defined bounding box * */
public static Query newDocValuesBoxQuery(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call it newSlowBoxQuery for consistency with e.g. NumericDocValuesField#newSlowRangeQuery?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately, I wonder if we could make things easier on users by making this method expert, and changing newBoxQuery to create a query that rewrites to the index query, doc values query, or an IndexOrDocValuesQuery depending on FieldInfos? (In the spirit of LUCENE-10162.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for newSlowBoxQuery

+1 for the separate idea but I was opting to go that route in a follow up incremental PR after the field settles.

// write root
TreeNode root = dfsSerialized.remove(0);
// write bounding box; convert to variable long by translating
output.writeVLong((long) root.minX - Integer.MIN_VALUE);
Copy link
Contributor

@iverase iverase Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if the information about the bounding box could contain information so it can be wrapped around the dateline. For example if a polygon crosses the dateline, it would be nice to describe the bounding box that crosses the dateline too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shapes cannot cross the dateline. They have to be split into two separate geometries at index time.

Copy link
Contributor

@iverase iverase Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to phrase it better: the case where you have a multipolygon with two polygons that they are just at each side of the dateline as described of what to do for polygons crossing the dateline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For multi geometries (and geometry collections) crossing dateline check (xdl) is handled by syntactic sugar minX > maxX to avoid superfluous encoding. I couldn't think of a case where explicitly encoding the xdl check would be needed; but I could've certainly missed it.

Copy link
Contributor

@iverase iverase Jul 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let see if I can explain my self. If we have a multipolygon where each of the polygons lie at each side of the dateline:

MULTIPOLYGON(((-170 -10, -175 -10, -175 10, -170 10, -170 -10)), ((175 -10, 170 -10, 170 10, 175 10, 175 -10)))

The current bounding box which is using absolute min/max would look like:

image

What I am hoping is that we can get as well the bounding box wrapping the dateline. In order to do that we need more information here, in particular we need the absolute max negative value and the absolute min positive value so the bounding box can look like:

image

** Note that the middle line is the dateline and it is an artifact, should not be there.

@nknize nknize force-pushed the spatial/shapeDocValues branch from 0df82a6 to e915e74 Compare July 20, 2022 18:40
Copy link
Member Author

@nknize nknize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the PR to derive from SpatialQuery so the doc values query results are consistent with the BKD index queries. I also derived the doc value testing framework to use the same randomized testing and the same validators used by the BKD query testing to explicitly ensure the query results match.

Point, Line, and Polygon tests are all complete and passing (verified locally with 1000 beast iterations).
I also added a version header to the doc value as a mechanism to ensure bwc while giving the freedom to change the binary layout. @lucene.experimental tags are also used to communicate the format may change.

The only thing missing at this point are the Multi* geometry tests which I planned to add in a follow up PR anyway.

As a first version goes, I think this is quite ready.

@nknize nknize force-pushed the spatial/shapeDocValues branch from e915e74 to ec0b884 Compare July 20, 2022 18:52
Copy link
Member Author

@nknize nknize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I haven't forgotten about the separation of the DocValues from the DocValuesField. Getting ready to push an update the teases out the doc value formatting logic into a separate package private ShapeDocValues class so that the ShapeDocValuesField only handles the index time logic and the queries instantiate the ShapeDocValues directly.

// set the highest dimensional type
root.highestType = highestType;

// compute centroid values for the root node so the centroid is consistent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this can ever work for multipolygons. Consider the following one:

MULTIPOLYGON(((-80 -10, -40 -10, -40 10, -80 10, -80 -10)),((10 -1, 11 -1, 11 1, 10 1, 10 -1)))

Which visually looks like:

image

I would expect the centroid of this shape to very close to the centroid of the big polygon as it should outweigh the small one.

In this algorithm we don't have information about the original shape so we are giving the same weight to all points so the centroid probably lies somewhere between the two shapes which is incorrect.

Could we add a test to check the behaviour?

Copy link
Member Author

@nknize nknize Jul 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The centroid calculation here uses geometric decomposition which computes a weighted centroid of polygon geometry(ies) by decomposing into smaller individual polygons (industry standard is to decompose complex geometries and collections into smaller triangles). It works for singluar or multiple geometries, including geometry collections, by incrementally weighting each polygon by their individual signed areas and dividing by the total sum area. The weights are applied individually in the TreeNode and globally computed here for the entire collection (where the highest dimension is used to omit lower dimension centroids).

FWIW, this is the same way qgis computes the centroid and I reference their foundation reference document in the source here. I can add an explicit validation test against the QGIS implementation for completeness but I'll need to port that calculation to the Test class.

Copy link
Contributor

@iverase iverase Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nknize for the references. I have no time to look in detail into the algorithms so to keep me sane I just try to compute the centroid using an ubiquitous library like JTS (which as well refers to the documentation you pointed above) to compute the centroid of the polygon I gave as example:

        String mp = "MULTIPOLYGON(((-80 -10, -40 -10, -40 10, -80 10, -80 -10)),((10 -1, 11 -1, 11 1, 10 1, 10 -1)))";
        WKTReader wktReader = new WKTReader();
        Coordinate c = Centroid.getCentroid(wktReader.read(mp));
         String wkt = "POINT(" + c.getOrdinate(0) + " " + c.getOrdinate(1) + ")";

And the result is:

POINT(-59.82418952618454 -0.0)

Which is where I was expecting the centroid to be located:

image

Then I wrote a test using your implementation:

  public void testLatLonMultiPolygonCentroid() throws Exception {
    String mp = "MULTIPOLYGON(((-80 -10, -40 -10, -40 10, -80 10, -80 -10)),((10 -1, 11 -1, 11 1, 10 1, 10 -1)))";
    Polygon[] p = (Polygon[]) SimpleWKTShapeParser.parse(mp);
    List<ShapeField.DecodedTriangle> tess = getTessellation(p[0]);
    tess.addAll(getTessellation(p[1]));
    ShapeDocValuesField dvField = LatLonShape.createDocValueField(FIELD_NAME, tess);
    assertEquals(0d, GeoEncodingUtils.decodeLatitude(dvField.getCentroidY()), 1E-8);
    // POINT(-59.82418952618454 0.0)
    assertEquals(-59.82418952618454 , GeoEncodingUtils.decodeLongitude(dvField.getCentroidX()), 1E-8);
    assertEquals(ShapeField.DecodedTriangle.TYPE.TRIANGLE, dvField.getHighestDimensionType());
  }

And the test fails:

Expected :-59.82418952618454
Actual   :4.51269194483757

So the centroid is way off the expected position, could you check if I am doing something wrong?

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. I'll add in the explicit check to verify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally had a moment to come back to this and the test doesn't appear genuine. It's misusing the LatLonShape.createDocValueField API in a way that tells me I need to strengthen the javadocs in this iteration. As you can see the Polygon method is: public static ShapeDocValuesField createDocValueField(String fieldName, Polygon polygon) which does not take a multi polygon. This is because ShapeDocValues computes the centroid on a per polygon basis and does not yet support multi geometries (this is why the Multi tests are not yet wired up; that's a todo for a follow on PR). The test you're proposing is "solving" the multi doc value problem by munging the tessellation of the two geometries into a single doc value field that was designed for singular geometries. I get why you're doing this, Lucene does not support multi binary doc values. But honestly, I disagree with munging the tessellations as a hack around this lucene limitation. Instead, I think we should improve lucene by adding multi binary doc value support like we have for Numeric and Sorted Numeric; it would be a nice symmetry. This may take some time which is why we should strengthen the docs here and not hold up the PR for "perfection".

Copy link
Member Author

@nknize nknize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed a commit to refactor the docvalue format and comparison logic from the field implementation to separate the Index vs Query time logic. Next up I'll have a look at the centroid and bounding box. I think we could add the visitor pattern in a follow up PR.

@iverase
Copy link
Contributor

iverase commented Jul 25, 2022

Hey Nick,

Why so much hurry with this change? I appreciate everything you are trying to do to signal the feature as experimental and adding the version on the doc value but still it will be nice to have something production ready from the beginning and this way of developing this exciting and complex feature is making things harder.

I am very much interested in this change because Elasticsearch has its own doc value implementation. I would like to eventually move to Lucene doc values so I want to make sure that the functionalities, the Elasticsearch implementation currently has, can be preserved when / if it makes sense. In order to achieve that we would like to donate our code or help with our experience in this feature as we have been running it in production for 2+ years with success.

The Elasticsearch implementation is pretty much similar to the one you are proposing, it has three parts, first we add all the information regarding the extent of the geometry, then we add centroid information and finally we have what I call the “triangle tree” which is exactly what you described in this issue. Here are the differences and how we would like to help adding them:

  • Our geometry extent contains more information than just plain min/max values of the coordinates that you are proposing. In particular, we capture the minimum positive value and the maximum negative value for the x coordinate so we can wrap those bounding boxes around the dateline in the geo case. This would be my proposal for a Extent object that captures all that information: https://github.com/iverase/lucene/blob/TriangleTree/lucene/core/src/java/org/apache/lucene/geo/Extent.java

  • In the case of the centroid, the biggest difference is that we are computing the centroid using the original geometry. I really like the algorithm you are proposing but you are working on the encoded space and I am wondering now if that would work for cartesian, remember that the encoding is not linear so in that case the centroids might be incorrect. I would like to discuss the benefits of the way you are encoding those values with a bit more care too.

  • The triangle tree is the same idea, it is just the serialisation of an interval tree that is composed of tessellation elements. One thing I realised while developing this feature is that it is necessary to be able to visit the tree in different ways and therefore adding a visitor pattern for the tree is a big win. Here is our current implementation which can be used as a good starting point: https://github.com/iverase/lucene/blob/TriangleTree/lucene/core/src/java/org/apache/lucene/geo/TriangleTreeReader.java

Finally, I would like to propose to initially focus on the data structure and once we are happy, we can start integrating the functionality, e.g support for queries and so on. I hope a more structured approach will make sure we get the right structure.

What do you think?

@nknize
Copy link
Member Author

nknize commented Jul 25, 2022

Thanks @iverase.

Why so much hurry with this change? ...it will be nice to have something production ready.

Just a few points here.

  1. I don't believe the proposed PR is a change. It's a new field that hasn't existed before despite Elasticsearch carrying it's own proprietary implementation for two years and only now proposing it as the preferred approach. I prefer a fresh Lucene focused implementation with input from Elasticsearch perspective. (e.g., geometry centroid and bounding box were added to help Elasticsearch geo_centroid and geo_boundingbox aggregations but really not needed for the docvalue format).
  2. This PR is a move of progress over perfection. I do feel it's important to do the best we can on the first iteration but there will always be needed improvements. We have mechanisms in place to enable us to unleash experimental features for the purpose of receiving feedback from production. The PR is using those mechanisms as intended. No need to iterate to what we think is "production ready".
  3. Akin to 2 I'd prefer to avoid waiting another two months+ before the next minor release to get this in the wild and start obtaining that feedback and iterating. Those iterations will improve with more feedback. If we do feel this is cutting it close for 9.3 then I'd prefer merging this PR to main and 9.4 and iterate on this code for 9.4.
  4. I am happy to hear Elasticsearch wanting to contribute their implementation but I think it's better to start with this foundation and iterate. We can merge any desirable properties from that Elasticsearch field in follow ups. I think that will strengthen the field as a whole and agree it's a great decision but do not believe it is a requirement before merging the current functioning PR. Again, progress over perfection.

this way of developing this exciting and complex feature is making things harder.

Harder for who?

I would like to propose to initially focus on the data structure and once we are happy, we can start integrating the functionality, e.g support for queries and so on.

Query support is already in this PR so I'm not sure what you mean by "start integrating the functionality". Regarding the desire to add a visitor access pattern that's a nice to have but not a requirement for this PR. I wrote an initial rough implementation (because I also thought it would be nice to mimic the query visitor pattern) but it's an improvement that is easier to add in a follow-up since (as this PR shows) it's not a requirement for supporting the queries in the first iteration. I agree with the bounding box improvements (which, again, could come later) and will add the centroid fix, but that is unrelated to the relation and query logic in this PR which already has parity with the BKD index queries and uses the same test scaffolding.

Here is our current implementation which can be used as a good starting point.

The current PR already has a starting point so I'm not sure why the proposal to scrap and start from the Elasticsearch proprietary implementation (which could've been proposed two years ago if parity is the concern). I took a quick look at the proposed code and have some differing of opinions on the implementation:

  • I didn't see any explicit relation visitors; so it doesn't look like the prototype code includes bounding box relation logic or tests against any BKD index queries to ensure functional parity.
  • I didn't look at the details of the serialized format. That shouldn't matter so long as the API is the same and results are correct. This PR now includes a VERSION byte to provide a mechanism to change the format and ensure backwards compatibility. This reduces Elasticsearch risk.
  • The PR results here are matching the BKD index queries so I'm confident the PR query results are correct. I'll add (or someone can add) the visitor access pattern in a follow up. Again, it's not a requirement for Lucene (even if it is one for Elasticsearch).
  • This PR isolates all ShapeDocValues logic (e.g., Readers and Writers) as private logic to the single pkg private abstract ShapeDocValues and only exposes field and query instantiation through the public LatLonShape and XYShape access classes. I prefer this approach (which is consistent w/ the BKD field API) over the prototype that conversely has that split into disparate separate Reader / Writer classes with public abstractions. I think we should keep the abstraction and internal API surface area tidy and be thoughtful about what's exposed (e.g., only pkg-private ShapeDocValues and ShapeDocValueField. Visitor member class and BaseQuery foundation classes for internal extensions, and LatLonShape and XYShape static factories for public access).

I'll add the centroid fixes and bounding box additions to this PR shortly.

Copy link
Member Author

@nknize nknize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the centroid calculation to ensure it's working in coordinate space instead of encoded space. I also updated the docs to be explicit about not supporting multi geometries. As I mention above I think we should add multi value support to binary doc values and handle multi geometry support using the multi value binary doc value mechanism instead of hacking around this limitation w/ a random bag of tessellated triangles. I imagine that might be controversial but I think it's the more correct way to go as it more accurately represents a multi collection of geometries much like LatLonPoint doc values does.

I agree with it being a little too late for 9.3 so I'd like to focus on getting this merged to main and backported to 9x and iterating w/ the visitor access pattern as a follow up while working multi binary doc values support in parallel. I will separately open an issue for adding multi value binary doc values support.

@nknize nknize force-pushed the spatial/shapeDocValues branch 2 times, most recently from 674b230 to ad176b9 Compare July 27, 2022 00:30
@jpountz
Copy link
Contributor

jpountz commented Jul 27, 2022

I chatted a bit with @iverase (we work together at Elastic) about this change. Like Ignacio wrote, we are interested in it since it maps to similar functionality that we have in Elasticsearch, but if Lucene gets the same functionality it would make more sense for Elasticsearch to move to it rather than keep a duplicate implementation. We suggested reusing the Elasticsearch implementation because it's been through many rounds of testing and performance benchmarks already, which we thought would help avoid duplicating work, and we'd be happy to do the small changes that would make it more naturally fit Lucene such as changing the encoding frob BE to LE. But we don't want to force it either and if this suggestion is not getting traction, then let's go iteratively as you suggest.

@mikemccand
Copy link
Member

I cannot really understand the exciting geo-speak about centroids and bounding boxes :) It all sounds really cool though.

But +1 to take a progress-not-perfection (PNP) approach here. This is exactly why we have the @lucene.experimental tag, to enable iterative innovation for fun new features like this.

Maybe we could implement both Elastic's battle-tested (2 years in production) approach, and this new approach, and then over time cross-fertilize between the two? We have multiple highlighters, query-parsers, suggesters, etc. -- it's healthy. Perhaps over time we boil it down to a single "best of both" solution, or perhaps not.

Basically we don't have to settle all of these differences right now at this PR. This PR is already adding an awesome feature to Lucene. We can commit this (once important feedback is addressed) and then PNP iterate.

@nknize
Copy link
Member Author

nknize commented Jul 28, 2022

We can commit this (once important feedback is addressed) and then PNP iterate.

+1

I think this PR is ready and the remaining two items (multi geometry, and visitor pattern) can be added in follow on PRs.
For multi geometry I opened LUCENE-10666 to discuss adding multi binary docvalues. I think this is the "logical" way to go for multi geometry docvalue support but can also see a technical argument for munging multi geometries into a single binary similar to BinaryRangeDocValues. I think we should discuss this and add multi support in a follow on improvement PR.

Maybe we could implement both Elastic's battle-tested (2 years in production) approach, and this new approach, and then over time cross-fertilize between the two?

Unless iterating on this field is simply not preferred I'm not seeing the need for two fields since closing any gaps in follow on PRs (that's not already addressed above) would achieve the same result and avoid API confusion?

@nknize
Copy link
Member Author

nknize commented Aug 8, 2022

This PR is fairly large, all issues have been addressed with passing tests, so I plan to merge this tomorrow so I can add the visitor pattern in a follow up PR. Multi geometries will be explored in a follow on depending on the progress of LUCENE-10666.

nknize added 5 commits August 9, 2022 10:20
Adds new doc value field to support LatLonShape and XYShape doc values. The
implementation is inspired by ComponentTree. A binary tree of tessellated
components (point, line, or triangle) is created. This tree is then DFS
serialized to a variable compressed DataOutput buffer to keep the doc value
format as compact as possible.

DocValue queries are performed on the serialized tree using a similar component
relation logic as found in SpatialQuery for BKD indexed shapes. To make this
possible some of the relation logic is refactored to make it accessible to the
doc value query counterpart.

Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
…ng doc

values

Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
nknize added 11 commits August 9, 2022 10:20
Refactored docvalues queries to inherit from SpatialQuery to leverage
SpatialVisitor for ensuring docvalue queries are consistent with index queries.
Added randomized testing for XYPolygon, LatLonLine, XYLine, LatLonPoint, and
XYPoint docvalue queries.

Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
@nknize nknize force-pushed the spatial/shapeDocValues branch from 45b9da3 to 1be8e3d Compare August 9, 2022 15:22
Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
@nknize nknize merged commit d7fd48c into apache:main Aug 9, 2022
nknize added a commit to nknize/lucene that referenced this pull request Aug 9, 2022
…apache#1017)

Adds new doc value field to support LatLonShape and XYShape doc values. The
implementation is inspired by ComponentTree. A binary tree of tessellated
components (point, line, or triangle) is created. This tree is then DFS
serialized to a variable compressed DataOutput buffer to keep the doc value
format as compact as possible.

DocValue queries are performed on the serialized tree using a similar component
relation logic as found in SpatialQuery for BKD indexed shapes. To make this
possible some of the relation logic is refactored to make it accessible to the
doc value query counterpart.

Note this does not support the following:

* Multi Geometries or Collections - This will be investigated by exploring
  the addition of multi binary doc values.
* General Geometry Queries - This will be added in a follow on improvement.

Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
nknize added a commit that referenced this pull request Aug 9, 2022
…#1017) (#1064)

Adds new doc value field to support LatLonShape and XYShape doc values. The
implementation is inspired by ComponentTree. A binary tree of tessellated
components (point, line, or triangle) is created. This tree is then DFS
serialized to a variable compressed DataOutput buffer to keep the doc value
format as compact as possible.

DocValue queries are performed on the serialized tree using a similar component
relation logic as found in SpatialQuery for BKD indexed shapes. To make this
possible some of the relation logic is refactored to make it accessible to the
doc value query counterpart.

Note this does not support the following:

* Multi Geometries or Collections - This will be investigated by exploring
  the addition of multi binary doc values.
* General Geometry Queries - This will be added in a follow on improvement.

Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
@msokolov msokolov added this to the 9.4.0 milestone Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants