Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a new command spatial-tree index #475

Merged
merged 2 commits into from
Sep 15, 2021
Merged

Adds a new command spatial-tree index #475

merged 2 commits into from
Sep 15, 2021

Conversation

olsen232
Copy link
Collaborator

Command only works if pywraps2 is manually installed into the kart venv.
Although tests exist, they are currently skipped for this reason.
(Avoiding modifying the kart build while it is red - hopefully fixed sometime soon).

When command is run, creates a spatial-index using s2 cells, but this isn't yet used for anything.
Unlike the proof-of-concept, can be used to index a set of commits and their ancestors, minus another set of commits and their ancestors (ie the ones that we already indexed last time).

Changelog is not yet updated - will wait until the command works and does something useful.

#456

Commnd only works if pywraps2 is manually installed into the kart venv.
Although tests exist, they are currently skipped for this reason.
(Not trying to modify the kart build while it is red)
@olsen232 olsen232 requested review from craigds and rcoup September 14, 2021 03:03
Copy link
Member

@craigds craigds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!


s2_coverer = s2.S2RegionCoverer()
s2_coverer.set_max_cells(S2_MAX_CELLS_INDEX)
s2_coverer.set_max_level(S2_MAX_LEVEL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this corresponds to 47K-100K ㎡, which is reasonably high precision on a geographic scale, but means the index will be not-very-useful for very small scale data.

do we need to set a max level at all?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure - @rcoup tuned this number, he can explain why it's this. I imagine there's a tradeoff but I don't know what the tradeoff is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The precision should eventually

  1. Be user modifiable/set
  2. Be more precise for “smaller” datasets automatically on a sliding scale

So long as these are possible later

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. the client-side repo→working-copy precision is accurate
  2. we don't need high precision for server-side filtering, we need high speed and minimal storage space. Including a pile of bonus features won't make much meaningful difference to clone-time, but going overboard on precision has a sizeable impact on storage costs and potentially slows down every fetch.
  3. if your data is small it'll be much much faster to do a full clone than going anywhere near a filter. Full clones take advantage of bitmaps, existing packfiles, etc which provide a big speed boost.

We can mess with the settings later once it's in use.

repo.path,
"rev-list",
"--objects",
"--filter=object:type=blob",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was added in git 2.32.0 so we need to ensure we're always getting a git newer than that. Looks like we're not:

GIT_VERSION = 2.29.2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - will add it to the tracking bug, but won't fix build issues for now.

kart/spatial_tree.py Outdated Show resolved Hide resolved
kart/spatial_tree.py Outdated Show resolved Hide resolved
Comment on lines +136 to +142
if src_crs.IsGeographic():
transform = None
desc = f"IDENTITY({src_crs.GetAuthorityCode(None)})"
else:
target_crs = src_crs.CloneGeogCS()
transform = osr.CoordinateTransformation(src_crs, target_crs)
desc = f"{src_crs.GetAuthorityCode(None)} -> {target_crs.GetAuthorityCode(None)}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is incorrect I think - we don't want to project to the nearest geographic CRS, we want to project everything to a single CRS. Otherwise S2 cells won't be comparable between different source CRSes.

Different geographic CRSes have different meridians, axis order, possibly units (e.g. radians) etc. We should pick one and always transform to that. Normal choice being EPSG:4326

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think you are right. However again I must defer to @rcoup who wrote the prototype - I imagine he had his reasons for doing it this way. If not, I'll change it.

The "Geocentric vs Geodetic coordinates" section in https://s2geometry.io/about/overview might be relevant, but I'm not even sure about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic was avoiding needless & expensive datum transforms. But Craig's right, axis-order/units/meridians can be in different places so that's not a great plan.

WGS84 seems sensible to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Will switch this to convert to WGS 84. Is there a way we can tell it not to bother doing the expensive and needless thing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OSRSpatialReference.IsSame(other)?

kart/spatial_tree.py Outdated Show resolved Hide resolved
kart/spatial_tree.py Outdated Show resolved Hide resolved
kart/spatial_tree.py Outdated Show resolved Hide resolved
"""
Indexes all features added by the supplied commits and all of their ancestors.
To stop recursing at a particular ancestor or ancestors (eg to stop at a commit
that has already been indexed) prefix that commit with a caret: ^COMMIT.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if I want to run this in a post-receive hook, and someone has pushed ref myref, changing it from abc123 to def456, I would run this like

kart spatial-tree index def456 ^abc123

That's not super intuitive to me (and I don't think ^{commit} is an established convention?)

Would it make more sense to take a range argument?

kart spatial-tree index abc123..def456

Copy link
Collaborator Author

@olsen232 olsen232 Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So actually these arguments are just forwarded verbatim to git rev-list. That means:

  • ^X is actually an established convention (although I hadn't heard of it)
  • X..Y and X...Y actually already work. I just didn't document them since they're syntactic sugar for some X, Y, ^Z way of writing the same specification.

The X, Y, ^Z way of specifying commits is more expressive since if you have new branch heads N1, N2, N3, N4, and old branch heads O1, O2 that you already indexed last time, you can just index: N1 N2 N3 N4 ^O1 ^O2 without worrying which old branch head(s) are ancestors of which new branch heads (it might be true that all are ancestors of all).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change the docstring to avoid using that and just say you can pass it anything git rev-list accepts

@pytest.mark.skip(reason=SKIP_REASON)
def test_index_points_idempotent(data_archive, cli_runner):
# Indexing the commits one at a time (and backwards) and then indexing --all should
# also give the same result, even though everything will have been indexed twice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reckon we can avoid this? it'd be helpful to be able to just kart spatial-tree index {refname} without having to specify revisions, and not have it fall off a performance cliff when everything's already been done.

In a post-receive hook, we do know what's been pushed (e.g. abc123..def456) but being able to simply run kart spatial-index def456 means that we need no bootstrapping the first time the post-receive hook is deployed. Whereas if the hook only indexes abc123..def456 then we also need to run an initial index of abc123

I guess it'd just be a matter of storing root tree hashes (or commit hashes) in a separate table in the db, and then using that to skip doing them if they already exist

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this could be good. Won't do it in this PR though - it would be a separate first step in order to find the commit-spec that we want to use to run git rev-list <commit-spec>. So all this current code would remain pretty much unchanged and I can later add the other code that makes it less dumb.
Disinclined to do it as the next step - going to try and get the proof-of-concept working again but using some of the new pieces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@craigds what's the use case for running spatial-tree index {ref}?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't know exactly what Craig intends by spatial-tree index {ref}, but I agree with him it would be cool if you could just call spatial-tree index --all and it would return immediately if everything was already up to date, which I think is mostly what he's getting at.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That, as well as building for this not being 100% reliable all the time. If it dies once and fails to index some revisions, we want it to automatically catch up during the next push. Which won't happen if we have to explicitly provide a revision range.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's simplistic to think of it as a revision range... we get a list of refs with (Old, New) commit oids (and Old or New can be null)... then we turn that into ^old1 ^old2 ^old3 new2 new3 new4 which spits out all the commit IDs in the new but not the old refs.

we want it to automatically catch up during the next push

We want it to catch up during the first of maintenance or after the next push.

So, we save root tree IDs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess after we finish indexing, we store all of the branch head commit oids in a table somewhere - but in a minimal form, eg, no need to store a commit OID if we are also storing its descendant. Every time an index operation succeeds we replace this minimal list with the new minimal list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equivalent of git for-each-ref refs/heads/ refs/tags/?

@olsen232 olsen232 merged commit 06a470d into master Sep 15, 2021
@olsen232 olsen232 deleted the spatial-tree branch September 15, 2021 04:46
Comment on lines +136 to +142
if src_crs.IsGeographic():
transform = None
desc = f"IDENTITY({src_crs.GetAuthorityCode(None)})"
else:
target_crs = src_crs.CloneGeogCS()
transform = osr.CoordinateTransformation(src_crs, target_crs)
desc = f"{src_crs.GetAuthorityCode(None)} -> {target_crs.GetAuthorityCode(None)}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic was avoiding needless & expensive datum transforms. But Craig's right, axis-order/units/meridians can be in different places so that's not a great plan.

WGS84 seems sensible to me.


s2_coverer = s2.S2RegionCoverer()
s2_coverer.set_max_cells(S2_MAX_CELLS_INDEX)
s2_coverer.set_max_level(S2_MAX_LEVEL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. the client-side repo→working-copy precision is accurate
  2. we don't need high precision for server-side filtering, we need high speed and minimal storage space. Including a pile of bonus features won't make much meaningful difference to clone-time, but going overboard on precision has a sizeable impact on storage costs and potentially slows down every fetch.
  3. if your data is small it'll be much much faster to do a full clone than going anywhere near a filter. Full clones take advantage of bitmaps, existing packfiles, etc which provide a big speed boost.

We can mess with the settings later once it's in use.

Comment on lines +253 to +255
# Using sqlite directly here instead of sqlalchemy is about 10x faster.
# Possibly due to huge number of unbatched queries.
# TODO - investigate further.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this needs to perform well... can we TODO it now?

Using SQLAlchemy shouldn't be magnitudes slower.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's using sqlite directly right now instead of sqlalchemy so it's fast right now. But I can take some time and try and figure out what's making sqlalchemy slow (might be relevant to working copies also)

NO_GEOMETRY_COLUMN = object()


def get_geometry(repo, feature_oid):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably we have this functionality somewhere else? Can we re-use it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one works a bit different - mostly we load the repository at a particular revision. Then every feature we try to read has an attached legend ID, and we can combine that legend with the current schema to see which one is the geometry column. However, in this case, I'm running a magic git command that iterates through all features added in a certain set of commits. Its not straight-forward when using this command to see exactly which commit the feature was added at, so its not obvious what should be done in order to look up the attached legend or the current schema. But, it is possible to find the geometry field without doing those things, so that's what this code does.

@pytest.mark.skip(reason=SKIP_REASON)
def test_index_points_idempotent(data_archive, cli_runner):
# Indexing the commits one at a time (and backwards) and then indexing --all should
# also give the same result, even though everything will have been indexed twice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@craigds what's the use case for running spatial-tree index {ref}?

Comment on lines +264 to +266
transforms = crs_helper.transforms_for_dataset(ds_path)
if not transforms:
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't this be pulled out of the loop?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - different datasets have different transforms. This indexes features from all datasets in the same loop, possibly in commit order. Eg this loop indexes features from commit X then commit Y then commit Z, and these commits could each involve a few different datasets, and therefore different transforms - there's no loop that only deals with one dataset and therefore one set of transforms. However, transforms are cached so it doesn't take long to find them again.

# Possibly due to huge number of unbatched queries.
# TODO - investigate further.
db = sqlite.connect(f"file:{db_path}", uri=True)
with db:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presume this deals with concurrency ok? Very possible multiple of these will be running/queued at the same time for a repository. Think two pushes in quick succession.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's using WAL which I think should be okay - I can try it out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants