-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds an experimental byod-point-cloud-import command #906
Conversation
Imports the metadata from some PC tiles hosted on S3 without downloading them, except, right now it does download them, since a) PDAL can't extract metadata from S3 without downloading the whole tile anyway and b) boto3 is now configured to fetch from S3 during testing, but PDAL is not, and in the long run, we might not need it to anyway (see TODO below). Still TODO: use boto3 or similar to download certain ranges of a PC file, which we can then extract metadata from, without downloading all the points. (Or fix PDAL so that it can do this).
|
||
class ByodPointCloudImporter(ByodTileImporter, PointCloudImporter): | ||
def extract_tile_metadata(self, tile_location): | ||
oid_and_size = get_hash_and_size_of_s3_object(tile_location) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a more efficient way of doing this?
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/list_objects_v2.html supports a OptionalObjectAttributes
thing, which might let us grab filesizes and checksums in the list call, to avoid doing GetObject requests for each object.
If we can get PDAL to do the PC tile metadata bit, and we can move the checksum/filesize into the list call, then we don't need to do any per-object calls ourselves, which should make things quite a bit faster for big layers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not done - we don't necessarily do a list call if the user didn't supply us with a wildcard operator to expand. We could do a list call regardless, using whatever is the common prefix of all the supplied tiles... or multiple list calls if there are multiple distinct prefixes... but, this sounds like a whole other thing, it doesn't need to be a part of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turns out this thing doesn't actually work anyway because RestoreStatus
is the only thing you can put in OptionalObjectAttributes
- you can't ask for checksums as part of that. Dumb but can't do much about it 👍
c23808e
to
7145291
Compare
Subdirectories (or the S3 equivalent - S3 is not exactly a directory hierarchy) are not matched - | ||
that is, s3://bucket/path/*.txt matches s3://bucket/path/example.txt but not s3://bucket/path/subpath/example.txt | ||
""" | ||
# TODO: handle any kind of failure, sanity check to make sure we don't match a million objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meh; a million objects is actually fine at this level, not really our job to try to guess if that's the user's intent 🤷
response = get_s3_client().head_object( | ||
Bucket=bucket, Key=key, ChecksumMode="ENABLED" | ||
) | ||
# TODO - handle failure (eg missing SHA256 checksum), which is extremely likely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems important
|
||
class ByodPointCloudImporter(ByodTileImporter, PointCloudImporter): | ||
def extract_tile_metadata(self, tile_location): | ||
oid_and_size = get_hash_and_size_of_s3_object(tile_location) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turns out this thing doesn't actually work anyway because RestoreStatus
is the only thing you can put in OptionalObjectAttributes
- you can't ask for checksums as part of that. Dumb but can't do much about it 👍
Imports the metadata from some PC tiles hosted on S3 without downloading them, except, right now it does download them (using boto3), since
a) PDAL can't extract metadata from S3 without downloading the whole tile anyway and
b) boto3 is now configured to fetch from S3 during testing, but PDAL is not, and in the long run, we might not need it to anyway (see TODO below).
Still TODO: use boto3 or similar to download certain ranges of a PC file, which we can then extract metadata from, without downloading all the points. (Or fix PDAL so that it can do this).
Some things are still TBD - eg, the name of this command - but nevertheless, getting this underway will help highlight which
obstacles and decisions still remain to be solved, and this code will not be thrown away.
Related links:
#905
Checklist: