Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds an experimental byod-point-cloud-import command #906

Merged
merged 6 commits into from
Sep 20, 2023
Merged

Conversation

olsen232
Copy link
Collaborator

@olsen232 olsen232 commented Sep 7, 2023

Imports the metadata from some PC tiles hosted on S3 without downloading them, except, right now it does download them (using boto3), since
a) PDAL can't extract metadata from S3 without downloading the whole tile anyway and
b) boto3 is now configured to fetch from S3 during testing, but PDAL is not, and in the long run, we might not need it to anyway (see TODO below).

Still TODO: use boto3 or similar to download certain ranges of a PC file, which we can then extract metadata from, without downloading all the points. (Or fix PDAL so that it can do this).

Some things are still TBD - eg, the name of this command - but nevertheless, getting this underway will help highlight which
obstacles and decisions still remain to be solved, and this code will not be thrown away.

Related links:

#905

Checklist:

  • Have you reviewed your own change?
  • Have you included test(s)?
  • Have you updated the changelog?

Imports the metadata from some PC tiles hosted on S3 without
downloading them, except, right now it does download them, since
a) PDAL can't extract metadata from S3 without downloading
the whole tile anyway and
b) boto3 is now configured to fetch from S3 during testing, but
PDAL is not, and in the long run, we might not need it to anyway
(see TODO below).

Still TODO: use boto3 or similar to download certain ranges of a PC
file, which we can then extract metadata from, without downloading
all the points. (Or fix PDAL so that it can do this).
@olsen232 olsen232 requested review from craigds and rcoup September 7, 2023 02:29
kart/s3_util.py Outdated Show resolved Hide resolved

class ByodPointCloudImporter(ByodTileImporter, PointCloudImporter):
def extract_tile_metadata(self, tile_location):
oid_and_size = get_hash_and_size_of_s3_object(tile_location)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a more efficient way of doing this?

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/list_objects_v2.html supports a OptionalObjectAttributes thing, which might let us grab filesizes and checksums in the list call, to avoid doing GetObject requests for each object.

If we can get PDAL to do the PC tile metadata bit, and we can move the checksum/filesize into the list call, then we don't need to do any per-object calls ourselves, which should make things quite a bit faster for big layers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not done - we don't necessarily do a list call if the user didn't supply us with a wildcard operator to expand. We could do a list call regardless, using whatever is the common prefix of all the supplied tiles... or multiple list calls if there are multiple distinct prefixes... but, this sounds like a whole other thing, it doesn't need to be a part of this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out this thing doesn't actually work anyway because RestoreStatus is the only thing you can put in OptionalObjectAttributes - you can't ask for checksums as part of that. Dumb but can't do much about it 👍

tests/conftest.py Show resolved Hide resolved
@olsen232 olsen232 requested a review from craigds September 19, 2023 21:07
Subdirectories (or the S3 equivalent - S3 is not exactly a directory hierarchy) are not matched -
that is, s3://bucket/path/*.txt matches s3://bucket/path/example.txt but not s3://bucket/path/subpath/example.txt
"""
# TODO: handle any kind of failure, sanity check to make sure we don't match a million objects.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meh; a million objects is actually fine at this level, not really our job to try to guess if that's the user's intent 🤷

response = get_s3_client().head_object(
Bucket=bucket, Key=key, ChecksumMode="ENABLED"
)
# TODO - handle failure (eg missing SHA256 checksum), which is extremely likely.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems important


class ByodPointCloudImporter(ByodTileImporter, PointCloudImporter):
def extract_tile_metadata(self, tile_location):
oid_and_size = get_hash_and_size_of_s3_object(tile_location)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out this thing doesn't actually work anyway because RestoreStatus is the only thing you can put in OptionalObjectAttributes - you can't ask for checksums as part of that. Dumb but can't do much about it 👍

@olsen232 olsen232 merged commit bfd3644 into master Sep 20, 2023
32 checks passed
@olsen232 olsen232 deleted the byoda-import branch September 20, 2023 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants