Adds a new command `spatial-tree index` #475

olsen232 · 2021-09-14T03:03:04Z

Command only works if pywraps2 is manually installed into the kart venv.
Although tests exist, they are currently skipped for this reason.
(Avoiding modifying the kart build while it is red - hopefully fixed sometime soon).

When command is run, creates a spatial-index using s2 cells, but this isn't yet used for anything.
Unlike the proof-of-concept, can be used to index a set of commits and their ancestors, minus another set of commits and their ancestors (ie the ones that we already indexed last time).

Changelog is not yet updated - will wait until the command works and does something useful.

#456

Commnd only works if pywraps2 is manually installed into the kart venv. Although tests exist, they are currently skipped for this reason. (Not trying to modify the kart build while it is red)

craigds

Looking good!

craigds · 2021-09-14T23:21:33Z

kart/spatial_tree.py

+
+    s2_coverer = s2.S2RegionCoverer()
+    s2_coverer.set_max_cells(S2_MAX_CELLS_INDEX)
+    s2_coverer.set_max_level(S2_MAX_LEVEL)


this corresponds to 47K-100K ㎡, which is reasonably high precision on a geographic scale, but means the index will be not-very-useful for very small scale data.

do we need to set a max level at all?

Not sure - @rcoup tuned this number, he can explain why it's this. I imagine there's a tradeoff but I don't know what the tradeoff is.

The precision should eventually

Be user modifiable/set

Be more precise for “smaller” datasets automatically on a sliding scale

So long as these are possible later

the client-side repo→working-copy precision is accurate

we don't need high precision for server-side filtering, we need high speed and minimal storage space. Including a pile of bonus features won't make much meaningful difference to clone-time, but going overboard on precision has a sizeable impact on storage costs and potentially slows down every fetch.

if your data is small it'll be much much faster to do a full clone than going anywhere near a filter. Full clones take advantage of bitmaps, existing packfiles, etc which provide a big speed boost.

We can mess with the settings later once it's in use.

craigds · 2021-09-14T23:37:25Z

kart/spatial_tree.py

+        repo.path,
+        "rev-list",
+        "--objects",
+        "--filter=object:type=blob",


this was added in git 2.32.0 so we need to ensure we're always getting a git newer than that. Looks like we're not:

kart/vendor/git/Makefile

Line 1 in bc39296

GIT_VERSION = 2.29.2

Good point - will add it to the tracking bug, but won't fix build issues for now.

kart/spatial_tree.py

craigds · 2021-09-15T00:01:48Z

kart/spatial_tree.py

+        if src_crs.IsGeographic():
+            transform = None
+            desc = f"IDENTITY({src_crs.GetAuthorityCode(None)})"
+        else:
+            target_crs = src_crs.CloneGeogCS()
+            transform = osr.CoordinateTransformation(src_crs, target_crs)
+            desc = f"{src_crs.GetAuthorityCode(None)} -> {target_crs.GetAuthorityCode(None)}"


this is incorrect I think - we don't want to project to the nearest geographic CRS, we want to project everything to a single CRS. Otherwise S2 cells won't be comparable between different source CRSes.

Different geographic CRSes have different meridians, axis order, possibly units (e.g. radians) etc. We should pick one and always transform to that. Normal choice being EPSG:4326

Hmm I think you are right. However again I must defer to @rcoup who wrote the prototype - I imagine he had his reasons for doing it this way. If not, I'll change it.

The "Geocentric vs Geodetic coordinates" section in https://s2geometry.io/about/overview might be relevant, but I'm not even sure about that.

Logic was avoiding needless & expensive datum transforms. But Craig's right, axis-order/units/meridians can be in different places so that's not a great plan.

WGS84 seems sensible to me.

Okay. Will switch this to convert to WGS 84. Is there a way we can tell it not to bother doing the expensive and needless thing?

OSRSpatialReference.IsSame(other)?

kart/spatial_tree.py

craigds · 2021-09-15T00:39:13Z

kart/spatial_tree.py

+    """
+    Indexes all features added by the supplied commits and all of their ancestors.
+    To stop recursing at a particular ancestor or ancestors (eg to stop at a commit
+    that has already been indexed) prefix that commit with a caret: ^COMMIT.


so if I want to run this in a post-receive hook, and someone has pushed ref myref, changing it from abc123 to def456, I would run this like

kart spatial-tree index def456 ^abc123

That's not super intuitive to me (and I don't think ^{commit} is an established convention?)

Would it make more sense to take a range argument?

kart spatial-tree index abc123..def456

So actually these arguments are just forwarded verbatim to git rev-list. That means:

^X is actually an established convention (although I hadn't heard of it)

X..Y and X...Y actually already work. I just didn't document them since they're syntactic sugar for some X, Y, ^Z way of writing the same specification.

The X, Y, ^Z way of specifying commits is more expressive since if you have new branch heads N1, N2, N3, N4, and old branch heads O1, O2 that you already indexed last time, you can just index: N1 N2 N3 N4 ^O1 ^O2 without worrying which old branch head(s) are ancestors of which new branch heads (it might be true that all are ancestors of all).

Maybe change the docstring to avoid using that and just say you can pass it anything git rev-list accepts

craigds · 2021-09-15T00:42:05Z

tests/test_spatial_tree.py

+@pytest.mark.skip(reason=SKIP_REASON)
+def test_index_points_idempotent(data_archive, cli_runner):
+    # Indexing the commits one at a time (and backwards) and then indexing --all should
+    # also give the same result, even though everything will have been indexed twice.


Reckon we can avoid this? it'd be helpful to be able to just kart spatial-tree index {refname} without having to specify revisions, and not have it fall off a performance cliff when everything's already been done.

In a post-receive hook, we do know what's been pushed (e.g. abc123..def456) but being able to simply run kart spatial-index def456 means that we need no bootstrapping the first time the post-receive hook is deployed. Whereas if the hook only indexes abc123..def456 then we also need to run an initial index of abc123

I guess it'd just be a matter of storing root tree hashes (or commit hashes) in a separate table in the db, and then using that to skip doing them if they already exist

Yeah this could be good. Won't do it in this PR though - it would be a separate first step in order to find the commit-spec that we want to use to run git rev-list <commit-spec>. So all this current code would remain pretty much unchanged and I can later add the other code that makes it less dumb.
Disinclined to do it as the next step - going to try and get the proof-of-concept working again but using some of the new pieces.

@craigds what's the use case for running spatial-tree index {ref}?

I also don't know exactly what Craig intends by spatial-tree index {ref}, but I agree with him it would be cool if you could just call spatial-tree index --all and it would return immediately if everything was already up to date, which I think is mostly what he's getting at.

That, as well as building for this not being 100% reliable all the time. If it dies once and fails to index some revisions, we want it to automatically catch up during the next push. Which won't happen if we have to explicitly provide a revision range.

it's simplistic to think of it as a revision range... we get a list of refs with (Old, New) commit oids (and Old or New can be null)... then we turn that into ^old1 ^old2 ^old3 new2 new3 new4 which spits out all the commit IDs in the new but not the old refs.

we want it to automatically catch up during the next push

We want it to catch up during the first of maintenance or after the next push.

So, we save root tree IDs?

I guess after we finish indexing, we store all of the branch head commit oids in a table somewhere - but in a minimal form, eg, no need to store a commit OID if we are also storing its descendant. Every time an index operation succeeds we replace this minimal list with the new minimal list.

equivalent of git for-each-ref refs/heads/ refs/tags/?

rcoup · 2021-09-15T08:26:59Z

kart/spatial_tree.py

+        if src_crs.IsGeographic():
+            transform = None
+            desc = f"IDENTITY({src_crs.GetAuthorityCode(None)})"
+        else:
+            target_crs = src_crs.CloneGeogCS()
+            transform = osr.CoordinateTransformation(src_crs, target_crs)
+            desc = f"{src_crs.GetAuthorityCode(None)} -> {target_crs.GetAuthorityCode(None)}"


Logic was avoiding needless & expensive datum transforms. But Craig's right, axis-order/units/meridians can be in different places so that's not a great plan.

WGS84 seems sensible to me.

rcoup · 2021-09-15T08:36:30Z

kart/spatial_tree.py

+
+    s2_coverer = s2.S2RegionCoverer()
+    s2_coverer.set_max_cells(S2_MAX_CELLS_INDEX)
+    s2_coverer.set_max_level(S2_MAX_LEVEL)


the client-side repo→working-copy precision is accurate

we don't need high precision for server-side filtering, we need high speed and minimal storage space. Including a pile of bonus features won't make much meaningful difference to clone-time, but going overboard on precision has a sizeable impact on storage costs and potentially slows down every fetch.

if your data is small it'll be much much faster to do a full clone than going anywhere near a filter. Full clones take advantage of bitmaps, existing packfiles, etc which provide a big speed boost.

We can mess with the settings later once it's in use.

rcoup · 2021-09-15T08:43:13Z

kart/spatial_tree.py

+    # Using sqlite directly here instead of sqlalchemy is about 10x faster.
+    # Possibly due to huge number of unbatched queries.
+    # TODO - investigate further.


Yeah, this needs to perform well... can we TODO it now?

Using SQLAlchemy shouldn't be magnitudes slower.

It's using sqlite directly right now instead of sqlalchemy so it's fast right now. But I can take some time and try and figure out what's making sqlalchemy slow (might be relevant to working copies also)

rcoup · 2021-09-15T08:44:22Z

kart/spatial_tree.py

+NO_GEOMETRY_COLUMN = object()
+
+
+def get_geometry(repo, feature_oid):


Presumably we have this functionality somewhere else? Can we re-use it?

This one works a bit different - mostly we load the repository at a particular revision. Then every feature we try to read has an attached legend ID, and we can combine that legend with the current schema to see which one is the geometry column. However, in this case, I'm running a magic git command that iterates through all features added in a certain set of commits. Its not straight-forward when using this command to see exactly which commit the feature was added at, so its not obvious what should be done in order to look up the attached legend or the current schema. But, it is possible to find the geometry field without doing those things, so that's what this code does.

rcoup · 2021-09-15T08:50:25Z

tests/test_spatial_tree.py

+@pytest.mark.skip(reason=SKIP_REASON)
+def test_index_points_idempotent(data_archive, cli_runner):
+    # Indexing the commits one at a time (and backwards) and then indexing --all should
+    # also give the same result, even though everything will have been indexed twice.


@craigds what's the use case for running spatial-tree index {ref}?

rcoup · 2021-09-15T08:57:34Z

kart/spatial_tree.py

+            transforms = crs_helper.transforms_for_dataset(ds_path)
+            if not transforms:
+                continue


can't this be pulled out of the loop?

No - different datasets have different transforms. This indexes features from all datasets in the same loop, possibly in commit order. Eg this loop indexes features from commit X then commit Y then commit Z, and these commits could each involve a few different datasets, and therefore different transforms - there's no loop that only deals with one dataset and therefore one set of transforms. However, transforms are cached so it doesn't take long to find them again.

rcoup · 2021-09-15T08:59:48Z

kart/spatial_tree.py

+    # Possibly due to huge number of unbatched queries.
+    # TODO - investigate further.
+    db = sqlite.connect(f"file:{db_path}", uri=True)
+    with db:


presume this deals with concurrency ok? Very possible multiple of these will be running/queued at the same time for a repository. Think two pushes in quick succession.

It's using WAL which I think should be okay - I can try it out

Adds a new command spatial-tree index

0c827d7

Commnd only works if pywraps2 is manually installed into the kart venv. Although tests exist, they are currently skipped for this reason. (Not trying to modify the kart build while it is red)

olsen232 requested review from craigds and rcoup September 14, 2021 03:03

craigds requested changes Sep 15, 2021

View reviewed changes

Address PR comments on spatial-tree command

bacd5c0

olsen232 mentioned this pull request Sep 15, 2021

Add spatial filtering support to Kart #456

Closed

craigds approved these changes Sep 15, 2021

View reviewed changes

olsen232 merged commit 06a470d into master Sep 15, 2021

olsen232 deleted the spatial-tree branch September 15, 2021 04:46

rcoup reviewed Sep 15, 2021

View reviewed changes

		NO_GEOMETRY_COLUMN = object()


		def get_geometry(repo, feature_oid):

Adds a new command spatial-tree index #475

Adds a new command spatial-tree index #475

Conversation

olsen232 commented Sep 14, 2021

craigds left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olsen232 Sep 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Adds a new command `spatial-tree index` #475

Adds a new command `spatial-tree index` #475

olsen232 Sep 15, 2021 •

edited

Loading