-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Rasterio's new linux wheels #942
Comments
I made edits to the post a few minutes ago. A reader pointed out to me in an email that Rasterio shouldn't be making so many requests for a tile. Indeed: I'd misused Rasterio's |
I need similar functionality but with google cloud. Is it possible? |
@MelnykAndriy search the repo for "google cloud": https://github.com/mapbox/rasterio/search?q=google+cloud&type=Issues&utf8=%E2%9C%93. |
Is there a possibility of using "/vsizip/" as well as S3 to query metadata from a large zip compressed geotiff on S3? |
I have tried to use this with Sentinel-2 images from the Using
The following works, but seems to be less efficient, as it does a lot more requests than in the TIF file example above. >>> from rasterio.windows import Window
>>> src.read(window=Window(1024, 1024, 512, 512))
... (lots of output)
array([[[ 817, 779, 940, ..., 781, 669, 720],
[ 811, 797, 966, ..., 930, 695, 707],
[ 859, 894, 971, ..., 1161, 927, 806],
...,
[ 759, 772, 763, ..., 886, 844, 728],
[ 751, 747, 725, ..., 847, 825, 745],
[ 723, 678, 683, ..., 1022, 938, 806]]], dtype=uint16) The following fails >>> src.read(window=Window(1000, 1000, 512, 512))
DEBUG:rasterio._io:Output nodata value read from file: None
DEBUG:rasterio._io:Output nodata values: [None]
DEBUG:rasterio._io:Jump straight to _read()
DEBUG:rasterio._io:Window: Window(col_off=1000, row_off=1000, width=512, height=512)
DEBUG:rasterio._io:IO window xoff=1000.0 yoff=1000.0 width=512.0 height=512.0
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "rasterio/_io.pyx", line 330, in rasterio._io.DatasetReaderBase.read
File "rasterio/_io.pyx", line 591, in rasterio._io.DatasetReaderBase._read
OSError: Read or write failed So I guess I have two questions: why is the partial querying doing many more requests than in the TIF example, and why am I getting the above errors? My GDAL version >>> import rasterio
>>> rasterio.__gdal_version__
'2.2.2' |
@yellowcap I have also noticed poor performance with the same JP2 files. I think they're not optimized for remote access with GDAL like the Landsat PDS GeoTIFFs are. Can you make a new ticket for the |
Thanks for the info @sgillies regarding performance. Any chance the Sentinel-2 data access can be optimized in the future through a software update without changes in the files? Or is that related to the files and can not be worked around? Opened separate ticket for error as requested. |
- rasterio on debian/ubuntu requires this curl config rasterio/rasterio#942
We're making binary wheels for Linux that include all the C libraries Rasterio needs for all of the pre-1.0 releases. This is a post about how to use them.
These wheels are not intended for production use by the internet, but should be perfectly adequate for integration testing of Python software that requires Rasterio. They might even be useful for developing prototype services.
The GDAL library included in these wheels is only lightly provisioned with format drivers. The JPEG2000 driver based on Jasper is the only non-default driver. There are no proprietary drivers.
My example: installing Rasterio wheels on Ubuntu 14.04 and performing the little extra configuration needed to access AWS Public Datasets like Landsat on AWS.
Installation
I'm going to use a container based on the Ubuntu 14.04 image in Docker Hub as a host. It has Python 3 installed, but pip, the program we're going to use to install Rasterio, is not installed. Rather than install the python3-pip apt package (possibly requiring
apt-get update
) and drag in a mess of other dependencies, let's get pip via wget.Rasterio has a host of extra Python dependencies, thus it's always a good idea to install Rasterio applications in a dedicated environment. Create and activate one with virtualenv.
$ pip install virtualenv $ virtualenv -p python3 venv $ source venv/bin/activate
Now install Rasterio into the environment using pip, also requesting the optional "s3" set of extra dependencies (boto3 and more).
(venv)$ pip install --pre rasterio[s3]>=1.0a4
This fetches the
rasterio-1.0a4-cp34-cp34m-manylinux1_x86_64.whl
file from the Python Package Index and extracts it into the environment's site-packages directory. A peek into site-packages reveals the included C libraries.Yes, the libs are big. The wheels are heavy. I'm working on it, I promise.
Start a Python interpreter and import rasterio as a last check.
Configuration
Rasterio includes a program named "rio" and its "info" sub-command provides many of the same features as the venerable "gdalinfo" program. Before you can use it to query datasets on S3, you need to do a little extra system configuration.
First, set language and locale environment variables so rio will run properly with Python 3.
Next, specify where to find the SSL certs on your host. Rasterio's libcurl, which is built on CentOS, expects
/etc/pki/tls/certs/ca-bundle.crt
. Ubuntu's are in a different location.(venv)$ export CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
Finally, set up AWS credentials. Rasterio uses boto3 to deal with credentials and these can be configured following the directions in the AWS CLI guide.
Running rio-info
Give an s3-prefixed object identifier, the same kind you would use with the AWS CLI, to
rio info
with a--indent 2
option to get pretty-printed JSON.Efficient metadata queries
Access to S3 GeoTIFF metadata is very efficient. Thanks to GDAL's support for HTTP range requests, Rasterio only needs to download 0.03% of the dataset's bytes in order to query its metadata. Turn up the verbosity of rio-info and ask for extra curl logging to see the individual HTTP requests.
The
HTTP/1.1 400 Bad Request
is in response to probing of the object's folder that GDAL does by default. In a future version of GDAL the probing can be disabled.Efficient partial data queries
Because the Landsat GeoTIFFs are tiled, subsets of them can be queried for a fraction of the cost of downloading the entire dataset. I'm going to use Rasterio's dataset inspector, rio-insp, to demonstrate. Knowing that the GeoTIFF is tiled and that the tiles are 512 x 512 bytes, I'm going to request a subset corresponding to a single tile in the middle of the raster.
Here are the request details printed to stderr:
And here is the abbreviated representation of the 512 x 512 array in the Python console:
Only about 0.7% of the dataset's bytes have to be read in order to get that subset. If I ask for the tile in the upper left corner, which happens to be all zeros and has been compressed to nearly nothing, there's no additional HTTP request: all the data for that tile was already picked up in the initial 16 kb request and cached by GDAL.
That's it for examples in this post. There's more spacewalking to be done with other datasets and other formats. I'll leave that up to you.
See also
The manylinux project is the one that we're closely following to learn how to build these wheels.
The wheel building infrastructure is here: https://github.com/sgillies/frs-wheel-builds.
Feedback is very welcome
Are these useful to you? Can they be more useful with a modest amount of effort? Please let us know.
Thanks for reading!
The text was updated successfully, but these errors were encountered: