Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pyogrio doesn't like io.BytesIO? #3260

Open
3 tasks done
bretttully opened this issue Apr 22, 2024 · 10 comments
Open
3 tasks done

BUG: pyogrio doesn't like io.BytesIO? #3260

bretttully opened this issue Apr 22, 2024 · 10 comments
Assignees
Labels

Comments

@bretttully
Copy link
Contributor

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of geopandas.
  • (optional) I have confirmed this bug exists on the main branch of geopandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import io

from shapely import Polygon

import geopandas as gpd
from geopandas.testing import assert_geodataframe_equal

gpd.options.io_engine = "pyogrio"

data = gpd.GeoDataFrame(
    [
        {"foo": 1, "bar": "a", "geometry": Polygon([(0, 0), (0, 1), (1, 1)])},
        {"foo": 2, "bar": "b", "geometry": Polygon([(0, 0), (0, 2), (2, 2)])},
        {"foo": 3, "bar": "c", "geometry": Polygon([(0, 0), (0, 3), (3, 3)])},
    ],
    geometry="geometry",
    crs="EPSG:4326",
)


with io.BytesIO() as stream:
    data.to_file(stream, layer="geometry", driver="GPKG")
    bytestr = stream.getvalue()


with io.BytesIO(bytestr) as stream:
    data1 = gpd.read_file(stream, driver="GPKG")


assert_geodataframe_equal(data, data1)

Problem description

Fails with the following error

/Users/brett.tully/micromamba/envs/geopandas_dev/lib/python3.12/site-packages/pyogrio/raw.py:530: RuntimeWarning: The filename extension should be 'gpkg' instead of '' to conform to the GPKG specification.
  ogr_write(
Traceback (most recent call last):
  File "/Users/brett.tully/Development/datascience/geopandas/tmp.py", line 29, in <module>
    data1 = gpd.read_file(stream, driver="GPKG")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/brett.tully/Development/datascience/geopandas/geopandas/io/file.py", line 280, in _read_file
    return _read_file_pyogrio(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/brett.tully/Development/datascience/geopandas/geopandas/io/file.py", line 533, in _read_file_pyogrio
    return pyogrio.read_dataframe(path_or_bytes, bbox=bbox, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/brett.tully/micromamba/envs/geopandas_dev/lib/python3.12/site-packages/pyogrio/geopandas.py", line 239, in read_dataframe
    result = read_func(
             ^^^^^^^^^^
  File "/Users/brett.tully/micromamba/envs/geopandas_dev/lib/python3.12/site-packages/pyogrio/raw.py", line 194, in read
    result = ogr_read(
             ^^^^^^^^^
  File "pyogrio/_io.pyx", line 1124, in pyogrio._io.ogr_read
  File "pyogrio/_io.pyx", line 167, in pyogrio._io.ogr_open
pyogrio.errors.DataSourceError: '/vsimem/0bb337302a3e416585c0713adbddd5cf' not recognized as a supported file format. It might help to specify the correct driver explicitly by prefixing the file path with '<DRIVER>:', e.g. 'CSV:path'.

Expected Output

Output of geopandas.show_versions()

SYSTEM INFO
-----------
python     : 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 08:01:35) [Clang 16.0.6 ]
executable : /Users/brett.tully/micromamba/envs/geopandas_dev/bin/python
machine    : macOS-14.3.1-arm64-arm-64bit

GEOS, GDAL, PROJ INFO
---------------------
GEOS       : 3.12.1
GEOS lib   : None
GDAL       : 3.8.3
GDAL data dir: /Users/brett.tully/micromamba/envs/geopandas_dev/share/gdal/
PROJ       : 9.3.1
PROJ data dir: /Users/brett.tully/micromamba/envs/geopandas_dev/share/proj

PYTHON DEPENDENCIES
-------------------
geopandas  : 0.14.0+118.ga4a14c5
numpy      : 1.26.3
pandas     : 2.2.0
pyproj     : 3.6.1
shapely    : 2.0.2
pyogrio    : 0.7.2
geoalchemy2: None
geopy      : None
matplotlib : 3.8.2
mapclassify: None
fiona      : 1.9.5
psycopg    : None
psycopg2   : None
pyarrow    : 14.0.2
@bretttully
Copy link
Contributor Author

bretttully commented Apr 22, 2024

It actually looks like the error might be on the save, not on the read. The following fails.

import io
import os
from pathlib import Path

import pyogrio
import pyogrio.raw
from shapely import Polygon

import geopandas as gpd
from geopandas.testing import assert_geodataframe_equal

os.environ["PYOGRIO_USE_ARROW"] = "1"
gpd.options.io_engine = "pyogrio"
gpd.show_versions()

data = gpd.GeoDataFrame(
    [
        {"foo": 1, "bar": "a", "geometry": Polygon([(0, 0), (0, 1), (1, 1)])},
        {"foo": 2, "bar": "b", "geometry": Polygon([(0, 0), (0, 2), (2, 2)])},
        {"foo": 3, "bar": "c", "geometry": Polygon([(0, 0), (0, 3), (3, 3)])},
    ],
    geometry="geometry",
    crs="EPSG:4326",
)

outpath = Path("tmp.gpkg")
if outpath.exists():
    outpath.unlink()
data.to_file(outpath, layer="geometry", driver="GPKG")
assert outpath.exists()
bytestr_from_file = outpath.read_bytes()

with io.BytesIO() as stream:
    data.to_file(stream, layer="geometry", driver="GPKG")
    bytestr = stream.getvalue()
assert bytestr == bytestr_from_file, f"{len(bytestr)=} != {len(bytestr_from_file)=}"

AssertionError: len(bytestr)=0 != len(bytestr_from_file)=98304

@m-richards
Copy link
Member

m-richards commented Apr 22, 2024

Thanks @bretttully for the report, this is currently the case - bytesIO can't be written to, see geopandas/pyogrio#249 (and discussion in #2875). We should note this as a difference between fiona and pyogrio that could break people in 1.0

@bretttully
Copy link
Contributor Author

Oh, thanks @m-richards -- that would be a fairly large regression for us... We could work around by writing to a temp file and then reading to bytes back in, but that wouldn't be great.

@martinfleis
Copy link
Member

Thanks @bretttully, this is a good feedback to have! I suppose you're not the only one using BytesIO as intermediate files.

@jorisvandenbossche @brendan-ward @theroggy what is the feasibility of getting this to pyogrio 0.8 before geopandas 1.0 lands?

@brendan-ward
Copy link
Member

I've been looking into this based on how it is implemented in Fiona / rasterio and working toward a potential PR. Not sure about the timing because there are some complexities here to work out (GPKG append / add layers to memory stream). Will continue the discussion on the pyogrio side.

@martinfleis
Copy link
Member

@bretttully can you post the output of geopandas.show_versions() of an environment where this actually works, when using Fiona?

@bretttully
Copy link
Contributor Author

SYSTEM INFO
-----------
python     : 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
executable : /opt/conda/bin/python
machine    : Linux-4.14.336-257.562.amzn2.x86_64-x86_64-with-glibc2.35

GEOS, GDAL, PROJ INFO
---------------------
GEOS       : 3.12.1
GEOS lib   : None
GDAL       : 3.8.4
GDAL data dir: /opt/conda/share/gdal
PROJ       : 9.3.1
PROJ data dir: /opt/conda/share/proj

PYTHON DEPENDENCIES
-------------------
geopandas  : 0.14.3
numpy      : 1.26.4
pandas     : 2.2.2
pyproj     : 3.6.1
shapely    : 2.0.4
fiona      : 1.9.6
geoalchemy2: None
geopy      : 2.4.1
matplotlib : 3.8.4
mapclassify: 2.6.1
pygeos     : None
pyogrio    : 0.7.2
psycopg2   : 2.9.9 (dt dec pq3 ext lo64)
pyarrow    : 13.0.0
rtree      : 1.2.0

Code:

import io
from pathlib import Path

import geopandas as gpd
from geopandas.testing import assert_geodataframe_equal
from shapely import Polygon

gpd.show_versions()

data = gpd.GeoDataFrame(
    [
        {"foo": 1, "bar": "a", "geometry": Polygon([(0, 0), (0, 1), (1, 1)])},
        {"foo": 2, "bar": "b", "geometry": Polygon([(0, 0), (0, 2), (2, 2)])},
        {"foo": 3, "bar": "c", "geometry": Polygon([(0, 0), (0, 3), (3, 3)])},
    ],
    geometry="geometry",
    crs="EPSG:4326",
)

outpath = Path("tmp.gpkg")
if outpath.exists():
    outpath.unlink()
data.to_file(outpath, layer="geometry", driver="GPKG")
assert outpath.exists()
bytestr_from_file = outpath.read_bytes()

with io.BytesIO() as stream:
    data.to_file(stream, layer="geometry", driver="GPKG")
    bytestr = stream.getvalue()
assert len(bytestr) == len(bytestr_from_file), f"{len(bytestr)=} != {len(bytestr_from_file)=}"


with io.BytesIO(bytestr) as stream:
    data2 = gpd.read_file(stream, driver="GPKG")
assert_geodataframe_equal(data, data2)

@bretttully
Copy link
Contributor Author

Note the change of assert bytestr == bytestr_from_file to assert len(bytestr) == len(bytestr_from_file) -- I forgot sqlite puts the timestamp in the file.

@jorisvandenbossche
Copy link
Member

Thanks @bretttully I can indeed reproduce that, it works with fiona (both with released geopandas as with geopandas main), and as we know it does not yet work with pyogrio (geopandas/pyogrio#249)

From a quick test, current fiona does not allow to append (mode="a") for writing to a file-like object.

Fiona allows you to write for a multi-file driver like Shapefile, but then reading the resulting bytes doesn't work (at least not easily by just passing a stream):

In [13]: with io.BytesIO() as stream:
    ...:     data.to_file(stream, driver="ESRI Shapefile", engine="fiona")
    ...:     bytestr = stream.getvalue()
    ...: 

In [14]: with io.BytesIO(bytestr) as stream:
    ...:     data2 = gpd.read_file(stream, engine="fiona")
    ...: 
---------------------------------------------------------------------------
CPLE_OpenFailedError                      Traceback (most recent call last)
...

File fiona/ogrext.pyx:143, in fiona.ogrext.gdal_open_vector()

DriverError: '/vsimem/9d4fe4810f7c446898a9875a739fbebf' not recognized as a supported file format.

@brendan-ward
Copy link
Member

This is now implemented in pyogrio 0.8.0; wheels are on PyPI / conda forge.
(note: append to existing GPKG in memory / multiple layers are not yet supported)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants