-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a new command spatial-tree index
#475
Conversation
Commnd only works if pywraps2 is manually installed into the kart venv. Although tests exist, they are currently skipped for this reason. (Not trying to modify the kart build while it is red)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
|
||
s2_coverer = s2.S2RegionCoverer() | ||
s2_coverer.set_max_cells(S2_MAX_CELLS_INDEX) | ||
s2_coverer.set_max_level(S2_MAX_LEVEL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this corresponds to 47K-100K ㎡, which is reasonably high precision on a geographic scale, but means the index will be not-very-useful for very small scale data.
do we need to set a max level at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure - @rcoup tuned this number, he can explain why it's this. I imagine there's a tradeoff but I don't know what the tradeoff is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The precision should eventually
- Be user modifiable/set
- Be more precise for “smaller” datasets automatically on a sliding scale
So long as these are possible later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the client-side repo→working-copy precision is accurate
- we don't need high precision for server-side filtering, we need high speed and minimal storage space. Including a pile of bonus features won't make much meaningful difference to clone-time, but going overboard on precision has a sizeable impact on storage costs and potentially slows down every fetch.
- if your data is small it'll be much much faster to do a full clone than going anywhere near a filter. Full clones take advantage of bitmaps, existing packfiles, etc which provide a big speed boost.
We can mess with the settings later once it's in use.
repo.path, | ||
"rev-list", | ||
"--objects", | ||
"--filter=object:type=blob", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was added in git 2.32.0 so we need to ensure we're always getting a git newer than that. Looks like we're not:
Line 1 in bc39296
GIT_VERSION = 2.29.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - will add it to the tracking bug, but won't fix build issues for now.
if src_crs.IsGeographic(): | ||
transform = None | ||
desc = f"IDENTITY({src_crs.GetAuthorityCode(None)})" | ||
else: | ||
target_crs = src_crs.CloneGeogCS() | ||
transform = osr.CoordinateTransformation(src_crs, target_crs) | ||
desc = f"{src_crs.GetAuthorityCode(None)} -> {target_crs.GetAuthorityCode(None)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is incorrect I think - we don't want to project to the nearest geographic CRS, we want to project everything to a single CRS. Otherwise S2 cells won't be comparable between different source CRSes.
Different geographic CRSes have different meridians, axis order, possibly units (e.g. radians) etc. We should pick one and always transform to that. Normal choice being EPSG:4326
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think you are right. However again I must defer to @rcoup who wrote the prototype - I imagine he had his reasons for doing it this way. If not, I'll change it.
The "Geocentric vs Geodetic coordinates" section in https://s2geometry.io/about/overview might be relevant, but I'm not even sure about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic was avoiding needless & expensive datum transforms. But Craig's right, axis-order/units/meridians can be in different places so that's not a great plan.
WGS84 seems sensible to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Will switch this to convert to WGS 84. Is there a way we can tell it not to bother doing the expensive and needless thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OSRSpatialReference.IsSame(other)
?
""" | ||
Indexes all features added by the supplied commits and all of their ancestors. | ||
To stop recursing at a particular ancestor or ancestors (eg to stop at a commit | ||
that has already been indexed) prefix that commit with a caret: ^COMMIT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if I want to run this in a post-receive hook, and someone has pushed ref myref
, changing it from abc123
to def456
, I would run this like
kart spatial-tree index def456 ^abc123
That's not super intuitive to me (and I don't think ^{commit}
is an established convention?)
Would it make more sense to take a range argument?
kart spatial-tree index abc123..def456
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So actually these arguments are just forwarded verbatim to git rev-list
. That means:
- ^X is actually an established convention (although I hadn't heard of it)
- X..Y and X...Y actually already work. I just didn't document them since they're syntactic sugar for some X, Y, ^Z way of writing the same specification.
The X, Y, ^Z way of specifying commits is more expressive since if you have new branch heads N1, N2, N3, N4, and old branch heads O1, O2 that you already indexed last time, you can just index: N1 N2 N3 N4 ^O1 ^O2
without worrying which old branch head(s) are ancestors of which new branch heads (it might be true that all are ancestors of all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change the docstring to avoid using that and just say you can pass it anything git rev-list
accepts
@pytest.mark.skip(reason=SKIP_REASON) | ||
def test_index_points_idempotent(data_archive, cli_runner): | ||
# Indexing the commits one at a time (and backwards) and then indexing --all should | ||
# also give the same result, even though everything will have been indexed twice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reckon we can avoid this? it'd be helpful to be able to just kart spatial-tree index {refname}
without having to specify revisions, and not have it fall off a performance cliff when everything's already been done.
In a post-receive hook, we do know what's been pushed (e.g. abc123..def456
) but being able to simply run kart spatial-index def456
means that we need no bootstrapping the first time the post-receive hook is deployed. Whereas if the hook only indexes abc123..def456
then we also need to run an initial index of abc123
I guess it'd just be a matter of storing root tree hashes (or commit hashes) in a separate table in the db, and then using that to skip doing them if they already exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this could be good. Won't do it in this PR though - it would be a separate first step in order to find the commit-spec that we want to use to run git rev-list <commit-spec>
. So all this current code would remain pretty much unchanged and I can later add the other code that makes it less dumb.
Disinclined to do it as the next step - going to try and get the proof-of-concept working again but using some of the new pieces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@craigds what's the use case for running spatial-tree index {ref}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't know exactly what Craig intends by spatial-tree index {ref}
, but I agree with him it would be cool if you could just call spatial-tree index --all
and it would return immediately if everything was already up to date, which I think is mostly what he's getting at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That, as well as building for this not being 100% reliable all the time. If it dies once and fails to index some revisions, we want it to automatically catch up during the next push. Which won't happen if we have to explicitly provide a revision range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's simplistic to think of it as a revision range... we get a list of refs with (Old, New) commit oids (and Old or New can be null)... then we turn that into ^old1 ^old2 ^old3 new2 new3 new4
which spits out all the commit IDs in the new but not the old refs.
we want it to automatically catch up during the next push
We want it to catch up during the first of maintenance or after the next push.
So, we save root tree IDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess after we finish indexing, we store all of the branch head commit oids in a table somewhere - but in a minimal form, eg, no need to store a commit OID if we are also storing its descendant. Every time an index operation succeeds we replace this minimal list with the new minimal list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
equivalent of git for-each-ref refs/heads/ refs/tags/
?
if src_crs.IsGeographic(): | ||
transform = None | ||
desc = f"IDENTITY({src_crs.GetAuthorityCode(None)})" | ||
else: | ||
target_crs = src_crs.CloneGeogCS() | ||
transform = osr.CoordinateTransformation(src_crs, target_crs) | ||
desc = f"{src_crs.GetAuthorityCode(None)} -> {target_crs.GetAuthorityCode(None)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic was avoiding needless & expensive datum transforms. But Craig's right, axis-order/units/meridians can be in different places so that's not a great plan.
WGS84 seems sensible to me.
|
||
s2_coverer = s2.S2RegionCoverer() | ||
s2_coverer.set_max_cells(S2_MAX_CELLS_INDEX) | ||
s2_coverer.set_max_level(S2_MAX_LEVEL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the client-side repo→working-copy precision is accurate
- we don't need high precision for server-side filtering, we need high speed and minimal storage space. Including a pile of bonus features won't make much meaningful difference to clone-time, but going overboard on precision has a sizeable impact on storage costs and potentially slows down every fetch.
- if your data is small it'll be much much faster to do a full clone than going anywhere near a filter. Full clones take advantage of bitmaps, existing packfiles, etc which provide a big speed boost.
We can mess with the settings later once it's in use.
# Using sqlite directly here instead of sqlalchemy is about 10x faster. | ||
# Possibly due to huge number of unbatched queries. | ||
# TODO - investigate further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this needs to perform well... can we TODO it now?
Using SQLAlchemy shouldn't be magnitudes slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's using sqlite directly right now instead of sqlalchemy so it's fast right now. But I can take some time and try and figure out what's making sqlalchemy slow (might be relevant to working copies also)
NO_GEOMETRY_COLUMN = object() | ||
|
||
|
||
def get_geometry(repo, feature_oid): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably we have this functionality somewhere else? Can we re-use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one works a bit different - mostly we load the repository at a particular revision. Then every feature we try to read has an attached legend ID, and we can combine that legend with the current schema to see which one is the geometry column. However, in this case, I'm running a magic git command that iterates through all features added in a certain set of commits. Its not straight-forward when using this command to see exactly which commit the feature was added at, so its not obvious what should be done in order to look up the attached legend or the current schema. But, it is possible to find the geometry field without doing those things, so that's what this code does.
@pytest.mark.skip(reason=SKIP_REASON) | ||
def test_index_points_idempotent(data_archive, cli_runner): | ||
# Indexing the commits one at a time (and backwards) and then indexing --all should | ||
# also give the same result, even though everything will have been indexed twice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@craigds what's the use case for running spatial-tree index {ref}
?
transforms = crs_helper.transforms_for_dataset(ds_path) | ||
if not transforms: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't this be pulled out of the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - different datasets have different transforms. This indexes features from all datasets in the same loop, possibly in commit order. Eg this loop indexes features from commit X then commit Y then commit Z, and these commits could each involve a few different datasets, and therefore different transforms - there's no loop that only deals with one dataset and therefore one set of transforms. However, transforms are cached so it doesn't take long to find them again.
# Possibly due to huge number of unbatched queries. | ||
# TODO - investigate further. | ||
db = sqlite.connect(f"file:{db_path}", uri=True) | ||
with db: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
presume this deals with concurrency ok? Very possible multiple of these will be running/queued at the same time for a repository. Think two pushes in quick succession.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's using WAL which I think should be okay - I can try it out
Command only works if pywraps2 is manually installed into the kart venv.
Although tests exist, they are currently skipped for this reason.
(Avoiding modifying the kart build while it is red - hopefully fixed sometime soon).
When command is run, creates a spatial-index using s2 cells, but this isn't yet used for anything.
Unlike the proof-of-concept, can be used to index a set of commits and their ancestors, minus another set of commits and their ancestors (ie the ones that we already indexed last time).
Changelog is not yet updated - will wait until the command works and does something useful.
#456