Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store envelopes rather than S2 cells. #518

Merged
merged 5 commits into from
Dec 15, 2021
Merged

Store envelopes rather than S2 cells. #518

merged 5 commits into from
Dec 15, 2021

Conversation

olsen232
Copy link
Collaborator

@olsen232 olsen232 commented Dec 7, 2021

... when writing the index used for spatially-filtered cloning.
Also encodes envelopes using 20 bits per value.
Together, these changes make the index 25x smaller than it currently
is (or 3x smaller than the best tuned S2 index that I have seen).

Doesn't update the git filter extension, so cloning won't work yet.
This is therefore part one of a two or three part change.

Also doesn't improve reprojections - there are still theoretical (and probably one day practical issues) with precision loss due to reprojecting envelopes, rather than geometries. This will also be addressed in a future commit.

Related links:

#456

Checklist:

  • Have you reviewed your own change?
  • Have you included test(s)?
  • Have you updated the changelog?

@olsen232 olsen232 requested review from craigds and rcoup December 7, 2021 03:02
Copy link
Member

@rcoup rcoup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some CLI helpers for encoding/decoding envelopes for debugging/diagnostics:

# encoded index value to envelope
$ kart spatial-tree index --decode-envelope=<envelope-hex>
174.0,-36.9,174.1,-36.8

# envelope to encoded index value
$ kart spatial-tree index --encode-envelope=174.0,-36.9,174.1,-36.8
c0ffee5a4a0ff2348a17e4c1b5febe8035c0ffee

# blob oid to envelope
$ kart spatial-tree index --feature-envelope=<blob-oid>
Native envelope: 1000,100,2000,200
Dataset CRS:

EPSG:2193 transform: unknown id, Inverse of New Zealand Transverse Mercator 2000 + NZGD2000 to WGS 84 (1), 1 m
→ segmentized at 1%=100m, buffered 0.5%=50m: 950,50,2050,250
→ 174.0,-36.9,174.1,-36.8

EPSG:27200 transform: unknown id, Inverse of New Zealand Map Grid + NZGD49 to WGS 84 (3)
→ segmentized at 1%=100m, buffered 0.5%=50m: 950,50,2050,250
→ 178.0,-36,178.1,-35.9

Final envelope: 174.0,-36.9,178.1,-35.9
Encoded: c0ffee5a4a0ff2348a17e4c1b5febe8035c0ffee

kart/spatial_tree.py Outdated Show resolved Hide resolved
kart/spatial_tree.py Outdated Show resolved Hide resolved
kart/spatial_tree.py Outdated Show resolved Hide resolved
kart/spatial_tree.py Outdated Show resolved Hide resolved
kart/spatial_tree.py Outdated Show resolved Hide resolved
kart/spatial_tree.py Outdated Show resolved Hide resolved
kart/spatial_tree.py Outdated Show resolved Hide resolved
kart/spatial_tree.py Outdated Show resolved Hide resolved
kart/spatial_tree.py Outdated Show resolved Hide resolved
kart/spatial_tree.py Outdated Show resolved Hide resolved
@olsen232 olsen232 requested a review from rcoup December 8, 2021 04:09
@olsen232
Copy link
Collaborator Author

olsen232 commented Dec 8, 2021

I mostly addressed the comments but what I haven't yet done, and will do tomorrow, is check what the input envelopes we are getting from GPKG / OGR are, how they handle meridian issues, and convert them appropriately into the W,S,E,N envelopes that I'm using now. Right now it just assumes W = min-x and E = max-x, which seems to work as long as there are no meridian-crossing geometries.

@rcoup
Copy link
Member

rcoup commented Dec 8, 2021

We should add a couple of antimeridian datasets to our unit test data. @craigds got anything small+handy?

@rcoup
Copy link
Member

rcoup commented Dec 8, 2021

semi-OT: OGR has an OGRGeometry.Segmentize() method

@olsen232
Copy link
Collaborator Author

olsen232 commented Dec 9, 2021

Added some code to handle anti-meridian issues. Still haven't handled long lines that need segmenting

Copy link
Member

@rcoup rcoup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting there 😃

kart/spatial_tree.py Outdated Show resolved Hide resolved

def get_envelope_for_indexing(geom, transforms):

def anticlockwise_ring_from_minmax_envelope(minmax_envelope):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will produce an invalid geometry for a single point, since LinearRings can't have duplicate points except for the first/last.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oddly it seems to work anyway, but it might be a bit brittle - for instance the shoelace calculation will return 0 and I'm just lucky that my code is apparently okay with that - so I've now special cased points.

tests/test_spatial_tree.py Outdated Show resolved Hide resolved
return normalised * (max_value - min_value) + min_value


def get_envelope_for_indexing(geom, transforms, feature_oid):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one needs tests... and I think you'll find we don't always get useful envelopes from GPKG geometries because the spec defines them as minx,maxx... these two would have the same envelope:

POLYGON((170 10, -170 10, -170 11, 170 11, 170 10)) → (-170, 10, 170, 11)
POLYGON((-170 10, 0 10, 170 10, 170 11, 0 11, -170 11, -170 10)) → (-170, 10, 170, 11)

So if it's defined in a geographic CRS and it's not a point we'll need to access the underlying geometry object:


More complications: if a ring is wound a different direction... we can't obviously differentiate between

  • POLYGON((170 10, -170 10, -170 11, 170 11, 170 10)) 20° wide box spanning the antimeridian (counterclockwise)
  • POLYGON((170 10, -170 10, -170 11, 170 11, 170 10)) 160° wide box spanning the prime meridian (clockwise)

And lines & multi-geometries don't have a winding direction at all...

  • LINESTRING(170 10, -170 10) — 20° "long" line or 160°?
  • MULTIPOINT((170 10, -170 10)) — 20° spacing or 160°?
  • MULTILINESTRING((170 10, 171 10), (-170 10, -171 10)) — 18° spacing or 160°?
  • MULTIPOLYGON(((170 10, 171 10, 171 11, 170 11, 170 10)), (-170 10, -170 11, -171 11, -171 10, -170 10))) — 18° spacing or 160°?

(I presume by now you have a piece of paper in front of you with lots of boxes and lines and arrows drawn on it? 👍 Or a globe)

Unfortunately ESRI & OGC differ on the winding direction of exterior rings, so both are common.

We can make useful assumptions though that generally hold up, like points are likely to be joined by the shortest distance (ie. +170°→-170° is more likely to go East than West). And obviously if points go <-180° or >+180° that helps too.

Worst case we have to define it as -180°→180° (or null) so the feature always gets included.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went looking for actual GPKGs that cross the antimeridian and I found that they had values >180. I assumed that this was the proper way of encoding geometries in GPKG, since it is also the one the only one that makes sense with min-x and max-x envelopes that GPKG requires:

  • if you have a polygon that has 4 different x values: [170, 175, 185, 190] then the min and max x values are 170 and 190, which is useful
  • for the same geometry, if you store those x values as [170, 175, -175, -170] then the min and max x values are -170 and 170, which are useless - those values are not at the outside of the shape, and they don't tell us if the shape crosses the antimeridian or not, so there's nothing at all we can use them for.

However, you are right - just because it wouldn't make sense to encode a single geometry in GPKG which is split into two halves, one near 180 and the other near -180, because that breaks the min and max envelopes, doesn't mean that nobody is doing this.

I propose the following:

  1. I construct an anticlockwise envelope around the min and max values of the geometry in its native CRS.
    (This can mean I sometimes get a useless envelope for geometries that are split by the antimeridian - eg spanning from -179 to 179 - and sometimes I will get a useful envelope eg 179 to 181, depending on how the geometry is encoded).
  2. I transform the envelope into EPSG:4326
  3. I shift parts of the envelope by 360 degrees until the envelope has anticlockwise winding order.
    (So far the algorithm is unchanged)
  4. I test if the resulting envelope spans more than 180 degrees of longitude. If it does, I abandon the attempt to find a meaningful envelope and return None, so this geometry will match all spatial filters.

Why this should work:
The code I have is working for input geometries that don't cross the antimeridian, and that are contiguous across the antimeridian (eg spanning 179 to 181) - and the fact that these geometries are split in two halves post-transform I can handle fine - but my code won't work for input geometries that are split into two halves, since then (min-x, max-x) will be something like (-179, 179), which looks the same to me as a geometry that covers the whole globe except for a tiny gap near the antimeridian. However, since its width is 358 degrees, I just give up, so the end result will be fine.
The only case where this wouldn't work - where I won't detect that the shape has been split and simply give up - is if there is a single geometry that crosses the antimeridian with a single line segment which is > 180 degrees of longitude wide, but that doesn't seem like a real world geometry.

Why I want to do it this way:
This way keeps the efficiency of the current method - the rare case which is too hard to index using the current method, we just don't index. Inspecting underlying geometries to try and recalculate their true envelopes when they have been split across the antimeridian could require a) a deeper knowledge of the various CRSs than I currently have, and a lot more code for handling odd geometries and b) algorithms that are O(n^2) where n = num_points, running on the individual geometries with all of their complexity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a little bit more about this: the one case I said I can't handle correctly above is actually not a real-case since it looks identical from our point of view to a smaller geometry with shorter line segments that doesn't cross the antimeridian. The only way we could distinguish between the two would be to check the winding order of the raw geometry, but since we don't know what the winding order is supposed to be, we just have to assume the smaller geometry with smaller line segments is the right interpretation. So, I've convinced myself that this approach is adequate...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this the other day but I've only just read it written down. What you've written is a good approach I think. WRT supporting large sparse-vertexed boxes that cover more than a hemisphere: as you've mentioned, there's no way to handle those without knowing the intended winding order in advance. Luckily, no one ever does that. But if they do, tough :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that this was the proper way of encoding geometries in GPKG, since it is also the one the only one that makes sense with min-x and max-x envelopes that GPKG requires:

It's kind of up to the input geometry whether it goes past >180° or <-180° longitude... but geometries that use "normalised" coordinates certainly aren't wrong/invalid. There's no requirement in GPKG/anywhere else for extending beyond those limits. Since geometries are projected vertex-by-vertex, a projected CRS geometry transformed into 4326 will always have vertexes in the range (-180° <= x <= 180°).

Copy link
Member

@rcoup rcoup Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me.

The only case where this wouldn't work - where I won't detect that the shape has been split and simply give up - is if there is a single geometry that crosses the antimeridian with a single line segment which is > 180 degrees of longitude wide, but that doesn't seem like a real world geometry.

This is basically what MS SQL Server does, fwiw, and rejects any geometries with vertexes more than a hemisphere apart — your points need to be <=180° apart. I think what you have is a good approach.

Lots of tests to make sure that the behaviour works as documented :-)

kart/spatial_tree.py Outdated Show resolved Hide resolved
kart/spatial_tree.py Outdated Show resolved Hide resolved
kart/spatial_tree.py Outdated Show resolved Hide resolved
kart/spatial_tree.py Show resolved Hide resolved
@craigds
Copy link
Member

craigds commented Dec 12, 2021

should add a couple of antimeridian datasets to our unit test data.

lots of stuff from LINZ Hydro; e.g. [a small crop of] Depth contour polyline (Hydro, 1:1.5mil and smaller) ; which stretches from roughly +145°E to +215°E

@olsen232 olsen232 requested a review from rcoup December 13, 2021 20:46
@rcoup
Copy link
Member

rcoup commented Dec 13, 2021

lots of stuff from LINZ Hydro; which stretches from roughly +145°E to +215°E

We can get this in a couple of different CRS? eg: 3994 & 3832 are projected CRS that span the antimeridian

CHANGELOG.md Outdated Show resolved Hide resolved
Also encodes envelopes using 20 bits per value.
Together, these changes make the index 25x smaller than it currently
is (or 3x smaller than the best tuned S2 index that I have seen).

Doesn't update the git filter extension, so cloning won't work yet.
This is therefore part 1 of a 2 or 3 part change.
Still doesn't handle possible long-line issues - assumes straight lines
are still straight lines even when reprojected, so assumes the corners
of bounding boxes are sufficient for reprojecting a bounding box.
Removes libs2 and s2_py from the vendor build.
Also updates spatial-filter test data to have an envelope index instead of S2 index.
@olsen232 olsen232 merged commit 2f9b4a3 into master Dec 15, 2021
@olsen232 olsen232 deleted the write-envelopes branch December 15, 2021 02:20
@rcoup
Copy link
Member

rcoup commented Dec 16, 2021

@olsen232 did we end up with some proper anti-meridian data here? eg: the Hydro datasets in EPSG:3994.

AFAICS the only test-data changes were swapping s2 for envelopes in the existing dataset?

@olsen232
Copy link
Collaborator Author

Yeah I kinda pushed forward with this and started doing some indexing... I've gone back and added tests using the dataset Craig linked too now, using two different CRS: 3994 and 3832

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants