Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add support to read, write, list and remove /vsimem/ files #457

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
7e327b9
ENH: add support to use /vsimem/ input files
theroggy Aug 3, 2024
0c1458b
ENH: make it possible to use /vsimem files
theroggy Aug 9, 2024
8f7733d
Update CHANGES.md
theroggy Aug 9, 2024
949e30f
temporarily add pyproj as dependency in CI
theroggy Aug 9, 2024
72c5248
Try to fix directory creation issue for gdal <3.8
theroggy Aug 9, 2024
4786a2b
Update _io.pyx
theroggy Aug 10, 2024
c7c7a2c
Update _io.pyx
theroggy Aug 10, 2024
4d69b9a
Support vsimem files in a directory for gdal<3.8
theroggy Aug 10, 2024
82b08d4
Textual improvements
theroggy Aug 10, 2024
33f55a6
improve directory name for vsimem fixture
theroggy Aug 10, 2024
8d281cf
Skip arrow write test for gdal < 3.8
theroggy Aug 10, 2024
d9ad3d7
fix vsimem listlayers test
theroggy Aug 10, 2024
b728332
rename is_tmp_vsimem to use_tmp_vsimem
theroggy Aug 10, 2024
3621218
Rename is_tmp_vsi to use_tmp_vsimem
theroggy Aug 10, 2024
1b15004
Rename input params in _ogr.pxd to path
theroggy Aug 10, 2024
504cc3d
Merge remote-tracking branch 'upstream/main' into ENH-make-it-possibl…
theroggy Aug 20, 2024
2c5f522
Merge main
theroggy Aug 21, 2024
149a0ad
sort dependencies (to trigger tests)
theroggy Aug 21, 2024
9870fdf
Update lint.yml
theroggy Aug 21, 2024
37cd30e
Merge remote-tracking branch 'upstream/main' into ENH-make-it-possibl…
theroggy Sep 10, 2024
e4ab48d
Apply feedback: small changes
theroggy Sep 10, 2024
68a595a
Apply feedback: small change
theroggy Sep 10, 2024
cfdfd7c
Fix test
theroggy Sep 10, 2024
a997cef
Let get_ogr_vsimem_write_path return if a tmp vsimem file is used
theroggy Sep 10, 2024
d610f00
Apply feedback, textual changes
theroggy Sep 15, 2024
63a8cb6
Apply feedback: textual change
theroggy Sep 15, 2024
2c11d77
Parametrize test_vsimem_rmtree_error
theroggy Sep 15, 2024
07b500a
Add doc to test_get_vsi_path_or_buffer_obj_to_string
theroggy Sep 15, 2024
8806612
Move vsimem_rmtree_toplevel to util + tests to new test_util.py
theroggy Sep 15, 2024
c3a01e5
Fix test_vsimem_rmtree_error
theroggy Sep 15, 2024
a12f78d
Merge remote-tracking branch 'upstream/main' into ENH-make-it-possibl…
theroggy Sep 16, 2024
774a6e8
Remove 0.9.1 in changelog
theroggy Sep 16, 2024
bf2b455
Feedback: textual changes + merge some vsimem tests
theroggy Sep 16, 2024
bc8edd2
Rollback changes to test_vsimem_rmtree_error
theroggy Sep 16, 2024
6436be1
Update test_core.py
theroggy Sep 16, 2024
bd51c70
Use requires_arrow_write_api to skip write arrow test
theroggy Sep 16, 2024
76cf286
Remove unreachable code regarding use_tmp_vsimem
theroggy Sep 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

## 0.10.0 (yyyy-mm-dd)

### Improvements

- Add support to read, write, list, and remove `/vsimem/` files (#457)

### Bug fixes

- Silence warning from `write_dataframe` with `GeoSeries.notna()` (#435).
Expand Down
2 changes: 1 addition & 1 deletion docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Core
----

.. automodule:: pyogrio
:members: list_drivers, detect_write_driver, list_layers, read_bounds, read_info, set_gdal_config_options, get_gdal_config_option, __gdal_version__, __gdal_version_string__
:members: list_drivers, detect_write_driver, list_layers, read_bounds, read_info, set_gdal_config_options, get_gdal_config_option, vsi_listtree, vsi_rmtree, vsi_unlink, __gdal_version__, __gdal_version_string__

GeoPandas integration
---------------------
Expand Down
4 changes: 2 additions & 2 deletions environment-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@ channels:
- conda-forge
dependencies:
# Required
- numpy
- libgdal-core
- numpy
- shapely>=2
# Optional
- geopandas-base
- pyproj
- pyarrow
- pyproj
# Specific for dev
- cython
- pre-commit
Expand Down
10 changes: 8 additions & 2 deletions pyogrio/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@
read_bounds,
read_info,
set_gdal_config_options,
vsi_listtree,
vsi_rmtree,
vsi_unlink,
Comment on lines +24 to +26
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we explicitly want to expose those functions publicly (top-level) to our users as well? (are there specific use cases we are thinking of?)
Because if this is mostly for the tests, we can also import from pyogrio.core or pyogrio.util

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggled with this as well. Some of them may be useful in order to leverage GDAL's VSI interface. That said, if you are using a different package to create VSI resources (e.g., the gdal package used directly), maybe you should use that other package to handle any related cleanup.

If we do want to keep this public, want to put these in the top-level interface? Maybe we should move the VSI related functions to vsi.py and expose these there, so that they are available, but compartmentalized for advanced usage a bit better?

Copy link
Member Author

@theroggy theroggy Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is practical/logical when it is possible to create files in e.g. /vsimem/ that the functions to remove them again are available as well without having to install another package...

Putting them in a seperate interface is possible, but I don't really see an obvious advantage. pyogrio doesn't have that many functions, and I was quite happy that the hierarchy was removed in shapely following the "Flat is better than nested" philosophy :-).

)
from pyogrio.geopandas import read_dataframe, write_dataframe
from pyogrio.raw import open_arrow, read_arrow, write_arrow
Expand All @@ -37,10 +40,13 @@
"set_gdal_config_options",
"get_gdal_config_option",
"get_gdal_data_path",
"read_arrow",
"open_arrow",
"write_arrow",
"read_arrow",
"read_dataframe",
"vsi_listtree",
"vsi_rmtree",
"vsi_unlink",
"write_arrow",
"write_dataframe",
"__gdal_version__",
"__gdal_version_string__",
Expand Down
91 changes: 50 additions & 41 deletions pyogrio/_io.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ import math
import os
import sys
import warnings
from pathlib import Path

from libc.stdint cimport uint8_t, uintptr_t
from libc.stdlib cimport malloc, free
Expand Down Expand Up @@ -1184,7 +1185,7 @@ def ogr_read(
):

cdef int err = 0
cdef bint is_vsimem = isinstance(path_or_buffer, bytes)
cdef bint use_tmp_vsimem = isinstance(path_or_buffer, bytes)
cdef const char *path_c = NULL
cdef char **dataset_options = NULL
cdef const char *where_c = NULL
Expand Down Expand Up @@ -1224,7 +1225,7 @@ def ogr_read(
raise ValueError("'max_features' must be >= 0")

try:
path = read_buffer_to_vsimem(path_or_buffer) if is_vsimem else path_or_buffer
path = read_buffer_to_vsimem(path_or_buffer) if use_tmp_vsimem else path_or_buffer

if encoding:
# for shapefiles, SHAPE_ENCODING must be set before opening the file
Expand Down Expand Up @@ -1362,8 +1363,8 @@ def ogr_read(
CPLFree(<void*>prev_shape_encoding)
prev_shape_encoding = NULL

if is_vsimem:
delete_vsimem_file(path)
if use_tmp_vsimem:
vsimem_rmtree_toplevel(path)

return (
meta,
Expand Down Expand Up @@ -1424,7 +1425,7 @@ def ogr_open_arrow(
):

cdef int err = 0
cdef bint is_vsimem = isinstance(path_or_buffer, bytes)
cdef bint use_tmp_vsimem = isinstance(path_or_buffer, bytes)
cdef const char *path_c = NULL
cdef char **dataset_options = NULL
cdef const char *where_c = NULL
Expand Down Expand Up @@ -1480,7 +1481,7 @@ def ogr_open_arrow(

reader = None
try:
path = read_buffer_to_vsimem(path_or_buffer) if is_vsimem else path_or_buffer
path = read_buffer_to_vsimem(path_or_buffer) if use_tmp_vsimem else path_or_buffer

if encoding:
override_shape_encoding = True
Expand Down Expand Up @@ -1679,8 +1680,8 @@ def ogr_open_arrow(
CPLFree(<void*>prev_shape_encoding)
prev_shape_encoding = NULL

if is_vsimem:
delete_vsimem_file(path)
if use_tmp_vsimem:
vsimem_rmtree_toplevel(path)


def ogr_read_bounds(
Expand All @@ -1697,7 +1698,7 @@ def ogr_read_bounds(
object mask=None):

cdef int err = 0
cdef bint is_vsimem = isinstance(path_or_buffer, bytes)
cdef bint use_tmp_vsimem = isinstance(path_or_buffer, bytes)
cdef const char *path_c = NULL
cdef const char *where_c = NULL
cdef OGRDataSourceH ogr_dataset = NULL
Expand All @@ -1715,7 +1716,7 @@ def ogr_read_bounds(
raise ValueError("'max_features' must be >= 0")

try:
path = read_buffer_to_vsimem(path_or_buffer) if is_vsimem else path_or_buffer
path = read_buffer_to_vsimem(path_or_buffer) if use_tmp_vsimem else path_or_buffer
ogr_dataset = ogr_open(path.encode('UTF-8'), 0, NULL)

if layer is None:
Expand Down Expand Up @@ -1744,8 +1745,8 @@ def ogr_read_bounds(
GDALClose(ogr_dataset)
ogr_dataset = NULL

if is_vsimem:
delete_vsimem_file(path)
if use_tmp_vsimem:
vsimem_rmtree_toplevel(path)

return bounds

Expand All @@ -1758,7 +1759,7 @@ def ogr_read_info(
int force_feature_count=False,
int force_total_bounds=False):

cdef bint is_vsimem = isinstance(path_or_buffer, bytes)
cdef bint use_tmp_vsimem = isinstance(path_or_buffer, bytes)
cdef const char *path_c = NULL
cdef char **dataset_options = NULL
cdef OGRDataSourceH ogr_dataset = NULL
Expand All @@ -1767,7 +1768,7 @@ def ogr_read_info(
cdef bint override_shape_encoding = False

try:
path = read_buffer_to_vsimem(path_or_buffer) if is_vsimem else path_or_buffer
path = read_buffer_to_vsimem(path_or_buffer) if use_tmp_vsimem else path_or_buffer

if encoding:
override_shape_encoding = True
Expand Down Expand Up @@ -1826,19 +1827,19 @@ def ogr_read_info(
if prev_shape_encoding != NULL:
CPLFree(<void*>prev_shape_encoding)

if is_vsimem:
delete_vsimem_file(path)
if use_tmp_vsimem:
vsimem_rmtree_toplevel(path)

return meta


def ogr_list_layers(object path_or_buffer):
cdef bint is_vsimem = isinstance(path_or_buffer, bytes)
cdef bint use_tmp_vsimem = isinstance(path_or_buffer, bytes)
cdef const char *path_c = NULL
cdef OGRDataSourceH ogr_dataset = NULL

try:
path = read_buffer_to_vsimem(path_or_buffer) if is_vsimem else path_or_buffer
path = read_buffer_to_vsimem(path_or_buffer) if use_tmp_vsimem else path_or_buffer
ogr_dataset = ogr_open(path.encode('UTF-8'), 0, NULL)
layers = get_layer_names(ogr_dataset)

Expand All @@ -1847,8 +1848,8 @@ def ogr_list_layers(object path_or_buffer):
GDALClose(ogr_dataset)
ogr_dataset = NULL

if is_vsimem:
delete_vsimem_file(path)
if use_tmp_vsimem:
vsimem_rmtree_toplevel(path)

return layers

Expand Down Expand Up @@ -1931,6 +1932,16 @@ cdef void * ogr_create(const char* path_c, const char* driver_c, char** options)
except CPLE_BaseError as exc:
raise DataSourceError(str(exc))

# For /vsimem/ files, with GDAL >= 3.8 parent directories are created automatically.
IF CTE_GDAL_VERSION < (3, 8, 0):
path = path_c.decode("UTF-8")
if "/vsimem/" in path:
parent = str(Path(path).parent.as_posix())
if not parent.endswith("/vsimem"):
retcode = VSIMkdirRecursive(parent.encode("UTF-8"), 0666)
if retcode != 0:
raise OSError(f"Could not create parent directory '{parent}'")

# Create the dataset
try:
ogr_dataset = exc_wrap_pointer(GDALCreate(ogr_driver, path_c, 0, 0, 0, GDT_Unknown, options))
Expand Down Expand Up @@ -2014,7 +2025,7 @@ cdef infer_field_types(list dtypes):

cdef create_ogr_dataset_layer(
str path,
bint is_vsi,
bint use_tmp_vsimem,
str layer,
str driver,
str crs,
Expand Down Expand Up @@ -2048,6 +2059,8 @@ cdef create_ogr_dataset_layer(
encoding : str
Only used if `driver` is "ESRI Shapefile". If not None, it overrules the default
shapefile encoding, which is "UTF-8" in pyogrio.
use_tmp_vsimem : bool
Whether the file path is meant to save a temporary memory file to.

Returns
-------
Expand Down Expand Up @@ -2075,8 +2088,8 @@ cdef create_ogr_dataset_layer(
driver_b = driver.encode('UTF-8')
driver_c = driver_b

# in-memory dataset is always created from scratch
path_exists = os.path.exists(path) if not is_vsi else False
# temporary in-memory dataset is always created from scratch
path_exists = os.path.exists(path) if not use_tmp_vsimem else False

if not layer:
layer = os.path.splitext(os.path.split(path)[1])[0]
Expand Down Expand Up @@ -2112,10 +2125,7 @@ cdef create_ogr_dataset_layer(
raise exc

# otherwise create from scratch
if is_vsi:
VSIUnlink(path_c)
else:
os.unlink(path)
os.unlink(path)

ogr_dataset = NULL

Expand Down Expand Up @@ -2250,7 +2260,7 @@ def ogr_write(
cdef int num_records = -1
cdef int num_field_data = len(field_data) if field_data is not None else 0
cdef int num_fields = len(fields) if fields is not None else 0
cdef bint is_vsi = False
cdef bint use_tmp_vsimem = False

if num_fields != num_field_data:
raise ValueError("field_data array needs to be same length as fields array")
Expand Down Expand Up @@ -2291,12 +2301,11 @@ def ogr_write(

try:
# Setup in-memory handler if needed
path = get_ogr_vsimem_write_path(path_or_fp, driver)
is_vsi = path.startswith('/vsimem/')
path, use_tmp_vsimem = get_ogr_vsimem_write_path(path_or_fp, driver)

# Setup dataset and layer
layer_created = create_ogr_dataset_layer(
path, is_vsi, layer, driver, crs, geometry_type, encoding,
path, use_tmp_vsimem, layer, driver, crs, geometry_type, encoding,
dataset_kwargs, layer_kwargs, append,
dataset_metadata, layer_metadata,
&ogr_dataset, &ogr_layer,
Expand Down Expand Up @@ -2501,7 +2510,7 @@ def ogr_write(
raise DataSourceError(f"Failed to write features to dataset {path}; {exc}")

# copy in-memory file back to path_or_fp object
if is_vsi:
if use_tmp_vsimem:
read_vsimem_to_buffer(path, path_or_fp)

finally:
Expand All @@ -2523,8 +2532,8 @@ def ogr_write(
if ogr_dataset != NULL:
ogr_close(ogr_dataset)

if is_vsi:
delete_vsimem_file(path)
if use_tmp_vsimem:
vsimem_rmtree_toplevel(path)


def ogr_write_arrow(
Expand All @@ -2548,7 +2557,7 @@ def ogr_write_arrow(
cdef OGRDataSourceH ogr_dataset = NULL
cdef OGRLayerH ogr_layer = NULL
cdef char **options = NULL
cdef bint is_vsi = False
cdef bint use_tmp_vsimem = False
cdef ArrowArrayStream* stream = NULL
cdef ArrowSchema schema
cdef ArrowArray array
Expand All @@ -2557,11 +2566,11 @@ def ogr_write_arrow(
array.release = NULL

try:
path = get_ogr_vsimem_write_path(path_or_fp, driver)
is_vsi = path.startswith('/vsimem/')
# Setup in-memory handler if needed
path, use_tmp_vsimem = get_ogr_vsimem_write_path(path_or_fp, driver)

layer_created = create_ogr_dataset_layer(
path, is_vsi, layer, driver, crs, geometry_type, encoding,
path, use_tmp_vsimem, layer, driver, crs, geometry_type, encoding,
dataset_kwargs, layer_kwargs, append,
dataset_metadata, layer_metadata,
&ogr_dataset, &ogr_layer,
Expand Down Expand Up @@ -2622,7 +2631,7 @@ def ogr_write_arrow(
raise DataSourceError(f"Failed to write features to dataset {path}; {exc}")

# copy in-memory file back to path_or_fp object
if is_vsi:
if use_tmp_vsimem:
read_vsimem_to_buffer(path, path_or_fp)

finally:
Expand All @@ -2642,8 +2651,8 @@ def ogr_write_arrow(
if ogr_dataset != NULL:
ogr_close(ogr_dataset)

if is_vsi:
delete_vsimem_file(path)
if use_tmp_vsimem:
vsimem_rmtree_toplevel(path)


cdef get_arrow_extension_metadata(const ArrowSchema* schema):
Expand Down
10 changes: 9 additions & 1 deletion pyogrio/_ogr.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,10 @@ cdef extern from "cpl_error.h" nogil:
void CPLPopErrorHandler()


cdef extern from "cpl_port.h":
ctypedef char **CSLConstList


cdef extern from "cpl_string.h":
char** CSLAddNameValue(char **list, const char *name, const char *value)
char** CSLSetNameValue(char **list, const char *name, const char *value)
Expand All @@ -53,6 +57,9 @@ cdef extern from "cpl_vsi.h" nogil:
long st_mode
int st_mtime

int VSIStatL(const char *path, VSIStatBufL *psStatBuf)
int VSI_ISDIR(int mode)
char** VSIReadDirRecursive(const char *path)
int VSIFCloseL(VSILFILE *fp)
int VSIFFlushL(VSILFILE *fp)
int VSIUnlink(const char *path)
Expand All @@ -61,7 +68,8 @@ cdef extern from "cpl_vsi.h" nogil:
unsigned char *VSIGetMemFileBuffer(const char *path, vsi_l_offset *data_len, int take_ownership)

int VSIMkdir(const char *path, long mode)
int VSIRmdirRecursive(const char *pszDirname)
int VSIMkdirRecursive(const char *path, long mode)
int VSIRmdirRecursive(const char *path)


cdef extern from "ogr_core.h":
Expand Down
4 changes: 2 additions & 2 deletions pyogrio/_vsi.pxd
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
cdef str get_ogr_vsimem_write_path(object path_or_fp, str driver)
cdef tuple get_ogr_vsimem_write_path(object path_or_fp, str driver)
cdef str read_buffer_to_vsimem(bytes bytes_buffer)
cdef read_vsimem_to_buffer(str path, object out_buffer)
cdef delete_vsimem_file(str path)
cpdef vsimem_rmtree_toplevel(str path)
Loading