From 77eca765dd4fbf9212e9dbec684c007bcf9fce7f Mon Sep 17 00:00:00 2001 From: Michael Dorman Date: Thu, 3 Oct 2024 18:50:36 +0300 Subject: [PATCH] improve ch07 --- 07-read-write.qmd | 74 +++++++++++++++++++++++------------------------ geocompr.bib | 2 +- 2 files changed, 38 insertions(+), 38 deletions(-) diff --git a/07-read-write.qmd b/07-read-write.qmd index 7df56e78..535e04a5 100644 --- a/07-read-write.qmd +++ b/07-read-write.qmd @@ -51,7 +51,7 @@ Data output is also vital, enabling others to use valuable new or improved datas Taken together, these processes of input/output can be referred to as data I/O. Geographic data I/O is often done with few lines of code at the beginning and end of projects. -It is often overlooked as a simple one step process. +It is often overlooked as a simple one-step process. However, mistakes made at the outset of projects (e.g., using an out-of-date or in some way faulty dataset) can lead to large problems later down the line, so it is worth putting considerable time into identifying which datasets are available, where they can be found and how to retrieve them. These topics are covered in @sec-retrieving-open-data, which describes several geoportals, which collectively contain many terabytes of data, and how to use them. To further ease data access, a number of packages for downloading geographic data have been developed, as demonstrated in @sec-geographic-data-packages. @@ -157,11 +157,11 @@ The resulting layer `counties` is shown in @fig-ne-counties. counties.plot(); ``` -Note that @fig-ne-counties x-axis spans the entire range of longitues, between `-180` and `180`, since the Aleutian Islands county (which is small and difficult to see on the map) crosses the International Date Line. +Note that @fig-ne-counties x-axis spans the entire range of longitudes, between `-180` and `180`, since the Aleutian Islands county (which is small and difficult to see on the map) crosses the International Date Line. Other layers can from NaturalEarth be accessed the same way. You need to specify the `resolution`, `category`, and `name` of the requested dataset in Natural Earth Data, then run the `cartopy.io.shapereader.natural_earth`, which downloads the file(s) and returns the path, and read the file into the Python environment, e.g., using `gpd.read_file`. -This is an alternative approach to "directly" downloading files as shown earlier (@sec-retrieving-open-data). +This is an alternative approach to 'directly' downloading files as shown earlier (@sec-retrieving-open-data). The second example uses the **osmnx** package [@osmnx] to find parks from the OpenStreetMap (OSM) database. As illustrated in the code chunk below, OpenStreetMap data can be obtained using the `ox.features.features_from_place` function. @@ -179,7 +179,7 @@ parks = ox.features.features_from_place( ``` The result is a `GeoDataFrame` with the parks in Leeds. -Now, we can plots the geometries with the `name` property in the tooltips using `explore` (@fig-ox-features). +Now, we can plot the geometries with the `name` property in the tooltips using `explore` (@fig-ox-features). ::: {.content-visible when-format="html"} ```{python} @@ -233,7 +233,7 @@ result = gpd.tools.geocode('54 Frith St, London W1D 4SJ, UK', timeout=10) result ``` -Importantly, (1) we can pass a `list` of multiple addresses instead of just one, resulting in a `GeoDataFrame` with corresponding multiple rows, and (2) "No Results" responses are represented by `POINT EMPTY` geometries, as shown in the following example. +Importantly, (1) we can pass a `list` of multiple addresses instead of just one, resulting in a `GeoDataFrame` with corresponding multiple rows, and (2) 'No Results' responses are represented by `POINT EMPTY` geometries, as shown in the following example. ```{python} result = gpd.tools.geocode( @@ -273,7 +273,7 @@ Geographic datasets are usually stored as files or in spatial databases. File formats usually can either store vector or raster data, while spatial databases such as PostGIS can store both. The large variety of file formats may seem bewildering, but there has been much consolidation and standardization since the beginnings of GIS software in the 1960s when the first widely distributed program SYMAP for spatial analysis was created at Harvard University [@coppock_history_1991]. -GDAL (which originally was pronounced as "goo-dal", with the double "o" making a reference to object-orientation), the Geospatial Data Abstraction Library, has resolved many issues associated with incompatibility between geographic file formats since its release in 2000. +GDAL (which originally was pronounced as 'goo-dal', with the double 'o' making a reference to object-orientation), the Geospatial Data Abstraction Library, has resolved many issues associated with incompatibility between geographic file formats since its release in 2000. GDAL provides a unified and high-performance interface for reading and writing of many raster and vector data formats. Many open and proprietary GIS programs, including GRASS, ArcGIS and QGIS, use GDAL behind their GUIs for doing the legwork of ingesting and spitting out geographic data in appropriate formats. Most Python packages for working with spatial data, including **geopandas** and **rasterio** used in this book, also rely on GDAL for importing and exporting spatial data files. @@ -316,7 +316,7 @@ The GeoTIFF format seems to be the most prominent raster data format. It allows spatial information, such as the CRS definition and the transformation matrix (see @sec-using-rasterio), to be embedded within a TIFF file. Similar to ESRI Shapefile, this format was firstly developed in the 1990s, but as an open format. Additionally, GeoTIFF is still being expanded and improved. -One of the most significant recent addition to the GeoTIFF format is its variant called COG (Cloud Optimized GeoTIFF). +One of the most significant recent additions to the GeoTIFF format is its variant called COG (Cloud Optimized GeoTIFF). Raster objects saved as COGs can be hosted on HTTP servers, so other people can read only parts of the file without downloading the whole file (@sec-input-raster). There is also a plethora of other spatial data formats that we do not explain in detail or mention in @tbl-file-formats due to the book limits. @@ -337,7 +337,7 @@ The latter is the most straightforward approach, suitable when RAM is not a limi For large vector layers and rasters, partial reading may be required. For vector layers, we will demonstrate how to read subsets of vector layers, filtered by attributes or by location (@sec-input-vector). For rasters, we already showed earlier in the book how the user can choose which specific bands to read (@sec-using-rasterio), or read resampled data to a lower resolution (@sec-raster-agg-disagg). -In this section, we also show how to read specific rectangular extents ("windows") from a raster file (@sec-input-raster). +In this section, we also show how to read specific rectangular extents ('windows') from a raster file (@sec-input-raster). ### Vector data {#sec-input-vector} @@ -405,13 +405,13 @@ info['dtypes'] The second mechanism uses the `mask` argument to filter data based on intersection with an existing geometry. This argument expects a geometry (`GeoDataFrame`, `GeoSeries`, or `shapely` geometry) representing the area where we want to extract the data. Let's try it using a small example---we want to read polygons from our file that intersect with the buffer of 50,000 $m$ of Tanzania's borders. -To do it, we need to transform the geometry to a projected CRS (such as `EPSG:32736`), prepare our "filter" by creating the buffer (@sec-buffers), and transform back to the original CRS to be used as a mask (@fig-read-shp-query (a)). +To do it, we need to transform the geometry to a projected CRS (such as `EPSG:32736`), prepare our 'filter' by creating the buffer (@sec-buffers), and transform back to the original CRS to be used as a mask (@fig-read-shp-query (a)). ```{python} tanzania_buf = tanzania.to_crs(32736).buffer(50000).to_crs(4326) ``` -Now, we can pass the "filter" geometry `tanzania_buf` to the `mask` argument of `gpd.read_file`. +Now, we can pass the 'filter' geometry `tanzania_buf` to the `mask` argument of `gpd.read_file`. ```{python} tanzania_neigh = gpd.read_file('data/world.gpkg', mask=tanzania_buf) @@ -550,7 +550,7 @@ ymax=70 ``` Using the extent coordinates along with the raster transformation matrix, we create a window object, using the `rasterio.windows.from_bounds` function. -This function basically "translates" the extent from coordinates, to row/column ranges. +This function basically 'translates' the extent from coordinates, to row/column ranges. ```{python} w = rasterio.windows.from_bounds( @@ -636,7 +636,7 @@ Note, that if you try to write to the same data source again, the function will world.to_file('output/world.gpkg') ``` -Instead of overwriting the file, we could add new rows to the file with `mode='a'` ("append" mode, as opposed to the default `mode='w'` for the "write" mode). +Instead of overwriting the file, we could add new rows to the file with `mode='a'` ('append' mode, as opposed to the default `mode='w'` for the 'write' mode). Appending is supported by several spatial formats, including GeoPackage. ```{python} @@ -644,7 +644,7 @@ world.to_file('output/w_many_features.gpkg') world.to_file('output/w_many_features.gpkg', mode='a') ``` -Now, `w_many_features.gpkg` contains a polygonal layer named `world` with two "copies" of each country (that is 177×2=354 features, whereas the `world` layer has 177 features). +Now, `w_many_features.gpkg` contains a polygonal layer named `world` with two 'copies' of each country (that is 177×2=354 features, whereas the `world` layer has 177 features). ```{python} #| warning: false @@ -658,8 +658,8 @@ world.to_file('output/w_many_layers.gpkg') world.to_file('output/w_many_layers.gpkg', layer='world2') ``` -In this case, `w_many_layers.gpkg` has two "layers": `w_many_layers` (same as the file name, when `layer` is unspecified) and `world2`. -Incidentally, the contents of the two layers is identical, but this does not have to be. +In this case, `w_many_layers.gpkg` has two 'layers': `w_many_layers` (same as the file name, when `layer` is unspecified) and `world2`. +Incidentally, the contents of the two layers are identical, but this does not have to be. Each layer from such a file can be imported separately using the `layer` argument of `gpd.read_file`. ```{python} @@ -680,7 +680,7 @@ As opposed to reading mode (`'r'`, the default) mode, the `rasterio.open` functi - `height`---Number of rows - `width`---Number of columns - `count`---Number of bands -- `nodata`---The value which represents "No Data", if any +- `nodata`---The value which represents 'No Data', if any - `dtype`---The raster data type, one of **numpy** types supported by the `driver` (e.g., `np.int64`) (see @tbl-numpy-data-types) - `crs`---The CRS, e.g., using an EPSG code (such as `4326`) - `transform`---The transform matrix @@ -698,7 +698,7 @@ Some functions, such as `rasterio.warp.reproject` used for resampling and reproj Most of the properties are either straightforward to choose, based on our aims, (e.g., `driver`, `crs`, `compress`, `nodata`), or directly derived from the array with the raster values itself (e.g., `height`, `width`, `count`, `dtype`). The most complicated property is the `transform`, which specifies the raster origin and resolution. -The `transform` is typically either obtained from an existing raster (serving as a "template"), created from scratch based on manually specified origin and resolution values (e.g., using `rasterio.transform.from_origin`), or calculated automatically (e.g., using `rasterio.warp.calculate_default_transform`), as shown in previous chapters. +The `transform` is typically either obtained from an existing raster (serving as a 'template'), created from scratch based on manually specified origin and resolution values (e.g., using `rasterio.transform.from_origin`), or calculated automatically (e.g., using `rasterio.warp.calculate_default_transform`), as shown in previous chapters. Earlier in the book, we have already demonstrated five common scenarios of writing rasters, covering the above-mentioned considerations: @@ -769,12 +769,12 @@ dst.close() These expressions, taken together, create a new file `output/r.tif`, which is a $2 \times 2$ raster, having a 2 decimal degree resolution, with the top-left corner placed over London. -To make the picture of raster export complete, there are three important concepts we have not covered yet: array and raster data types, writing multiband rasters, and handling "No Data" values. +To make the picture of raster export complete, there are three important concepts we have not covered yet: array and raster data types, writing multiband rasters, and handling 'No Data' values. Arrays (i.e., `ndarray` objects defined in package **numpy**) are used to store raster values when reading them from file, using `.read` (@sec-using-rasterio). All values in an array are of the same type, whereas the **numpy** package supports numerous numeric data types of various precision (and, accordingly, memory footprint). Raster formats, such as GeoTIFF, support (a subset of) exactly the same data types as **numpy**, which means that reading a raster file uses as little RAM as possible. -The most useful types for raster data, and thir support in GeoTIFF are summarized in @tbl-numpy-data-types. +The most useful types for raster data, and their support in GeoTIFF are summarized in @tbl-numpy-data-types. | Data type | Description | GeoTIFF | |-----------|----------------------------------------------------------------------|:--------:| @@ -834,7 +834,7 @@ dst_kwds.update(count=3) dst_kwds ``` -Finally, we can create a file connection using the updated metadata, write the values of the three bands, and close the connection (note that we are switching to the "keyword argument" syntax of Python function calls here; see note in @sec-raster-agg-disagg). +Finally, we can create a file connection using the updated metadata, write the values of the three bands, and close the connection (note that we are switching to the 'keyword argument' syntax of Python function calls here; see note in @sec-raster-agg-disagg). ```{python} dst = rasterio.open('output/r3.tif', 'w', **dst_kwds) @@ -846,16 +846,16 @@ dst.close() As a result, a three-band raster named `r3.tif` is created. -Rasters often contain "No Data" values, representing missing data, for example, unreliable measurements due to clouds or pixels outside of the photographed extent. -In a **numpy** `ndarray` object, "No Data" values may be represented by the special `np.nan` value. +Rasters often contain 'No Data' values, representing missing data, for example, unreliable measurements due to clouds or pixels outside of the photographed extent. +In a **numpy** `ndarray` object, 'No Data' values may be represented by the special `np.nan` value. However, due to computer memory limitations, only arrays of type `float` can contain `np.nan`, while arrays of type `int` cannot. -For `int` rasters containing "No Data", we typically mark missing data with a specific value beyond the valid range (e.g., `-9999`). -The missing data "flag" definition is stored in the file (set through the `nodata` property of the file connection, see above). -When reading an `int` raster with "No Data" back into Python, we need to be aware of the flag, if any. +For `int` rasters containing 'No Data', we typically mark missing data with a specific value beyond the valid range (e.g., `-9999`). +The missing data 'flag' definition is stored in the file (set through the `nodata` property of the file connection, see above). +When reading an `int` raster with 'No Data' back into Python, we need to be aware of the flag, if any. Let's demonstrate it through examples. We will start with the simpler case, rasters of type `float`. -Since `float` arrays may contain the "native" value `np.nan`, representing "No Data" is straightforward. +Since `float` arrays may contain the 'native' value `np.nan`, representing 'No Data' is straightforward. For example, suppose that we have a `float` array of size $2 \times 2$ containing one `np.nan` value. ```{python} @@ -867,7 +867,7 @@ r r.dtype ``` -When writing this type of array to a raster file, we do not need to specify any particular `nodata` "flag" value. +When writing this type of array to a raster file, we do not need to specify any particular `nodata` 'flag' value. ```{python} dst = rasterio.open( @@ -896,7 +896,7 @@ Reading from the raster back into the Python session reproduces the same exact a rasterio.open('output/r_nodata_float.tif').read() ``` -Now, conversely, suppose that we have an `int` array with missing data, where the "missing" value must inevitably be marked using a specific `int` "flag" value, such as `-9999` (remember that we can't store `np.nan` in an `int` array!). +Now, conversely, suppose that we have an `int` array with missing data, where the 'missing' value must inevitably be marked using a specific `int` 'flag' value, such as `-9999` (remember that we can't store `np.nan` in an `int` array!). ```{python} r = np.array([1,2,-9999,4]).reshape(2,2).astype(np.int32) @@ -907,7 +907,7 @@ r r.dtype ``` -When writing the array to file, we must specify `nodata=-9999` to keep track of our "No Data" flag. +When writing the array to file, we must specify `nodata=-9999` to keep track of our 'No Data' flag. ```{python} dst = rasterio.open( @@ -940,14 +940,14 @@ r = src.read() r ``` -The Python user must therefore be mindful of "No Data" `int` rasters, for example to avoid interpreting the value `-9999` literally. -For instance, if we "forget" about the `nodata` flag, the literal calculation of the `.mean` would incorrectly include the value `-9999`. +The Python user must therefore be mindful of 'No Data' `int` rasters, for example to avoid interpreting the value `-9999` literally. +For instance, if we 'forget' about the `nodata` flag, the literal calculation of the `.mean` would incorrectly include the value `-9999`. ```{python} r.mean() ``` -There are two basic ways to deal with the situation: either converting the raster to `float`, or using a "No Data" mask. +There are two basic ways to deal with the situation: either converting the raster to `float`, or using a 'No Data' mask. The first approach, simple and particularly relevant for small rasters where memory constraints are irrelevant, is to go from `int` to `float`, to gain the ability of the natural `np.nan` representation. Here is how we can do this with `r_nodata_int.tif`. We detect the missing data flag, convert the raster to `float`, then assign `np.nan` into the cells that are supposed to be missing. @@ -959,14 +959,14 @@ r[mask] = np.nan r ``` -From there on, we deal with `np.nan` the usual way, such as using `np.nanmean` to calculate the mean excluding "No Data". +From there on, we deal with `np.nan` the usual way, such as using `np.nanmean` to calculate the mean excluding 'No Data'. ```{python} np.nanmean(r) ``` -The second approach is to read the values into a so-called *"masked" array*, using the argument `masked=True` of the `.read` method. -A masked array can be thought of as an extended `ndarray`, with two components: `.data` (the values) and `.mask` (a corresponding boolean array marking "No Data" values). +The second approach is to read the values into a so-called *'masked' array*, using the argument `masked=True` of the `.read` method. +A masked array can be thought of as an extended `ndarray`, with two components: `.data` (the values) and `.mask` (a corresponding boolean array marking 'No Data' values). ```{python} r = src.read(masked=True) @@ -974,7 +974,7 @@ r ``` Complete treatment of masked arrays is beyond the scope of this book. -However, the basic idea is that many **numpy** operations "honor" the mask, so that the user does not have to keep track of the way that "No Data" values are marked, similarly to the natural `np.nan` representation and regardless of the data type. +However, the basic idea is that many **numpy** operations 'honor' the mask, so that the user does not have to keep track of the way that 'No Data' values are marked, similarly to the natural `np.nan` representation and regardless of the data type. For example, the `.mean` of a masked array ignores the value `-9999`, because it is masked, taking into account just the valid values `1`, `2`, and `4`. ```{python} @@ -984,7 +984,7 @@ r.mean() Switching to `float` and assigning `np.nan` is the simpler approach, since that way we can keep working with the familiar `ndarray` data structure for all raster types, whether `int` or `float`. Nevertheless, learning how to work with masked arrays can be beneficial when we have good reasons to keep our raster data in `int` arrays (for example, due to RAM limits) and still perform operations that take missing values into account. -Finally, keep in mind that, confusingly, `float` rasters may represent "No Data" using a specific "flag" (such as `-9999.0`), instead, or in addition to (!), the native `np.nan` representation. +Finally, keep in mind that, confusingly, `float` rasters may represent 'No Data' using a specific 'flag' (such as `-9999.0`), instead, or in addition to (!), the native `np.nan` representation. In such cases, the same considerations shown for `int` apply to `float` rasters as well. diff --git a/geocompr.bib b/geocompr.bib index 9fb3d366..180c6a1e 100644 --- a/geocompr.bib +++ b/geocompr.bib @@ -2602,7 +2602,7 @@ @ARTICLE{scipy @manual{cartopy, author = {{Met Office}}, title = {Cartopy: a cartographic python library with a Matplotlib interface}, -year = {2010 - 2015}, +year = {2010-2015}, address = {Exeter, Devon }, url = {https://scitools.org.uk/cartopy} }