Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: recommended way to load columns containing SRID info? #356

Open
codeananda opened this issue Jan 31, 2024 · 3 comments
Open

Question: recommended way to load columns containing SRID info? #356

codeananda opened this issue Jan 31, 2024 · 3 comments

Comments

@codeananda
Copy link

I've got some csv files where the geometry column is named 'geom' and looks like this

SRID=27700;POLYGON()
SRID=27700;POLYGON((10 10, 20 20, 30 10, 10 10))
SRID=27700;MULTIPOLYGON()

To load it, I manually check and modify the columns but wonder if there is a way for pyogrio to handle this itself?

Some things I've tried

  • Looking at the GDAL CSV Driver docs but that wasn't too helpful.
  • Using GEOM_POSSIBLE_NAMES=['geom'] in pyogrio.read_dataframe - but think it got confused with SRID info.

My working code

import pandas as pd
import geopandas as gpd
from shapely import wkt
from io import StringIO

# What I actually do
shlaa_gdf = gpd.read_file(
    "SHLAA.csv",
    engine='pyogrio',
    encoding='utf-8',
)

# Toy example
csv_content = """fid,geom
1,"SRID=27700;POLYGON()"
2,"SRID=27700;POLYGON((10 10, 20 20, 30 10, 10 10))"
3,"SRID=27700;MULTIPOLYGON()" """

shlaa_gdf = pd.read_csv(StringIO(csv_content))
shlaa_gdf = gpd.GeoDataFrame(shlaa_gdf)

def load_to_geom(x):
    if x == 'POLYGON()':
        return wkt.loads('POLYGON EMPTY')
    elif x == 'MULTIPOLYGON()':
        return wkt.loads('MULTIPOLYGON EMPTY')
    else:
        return wkt.loads(x)

if 'geom' in shlaa_gdf.columns:
    split_df = shlaa_gdf['geom'].str.split(';', expand=True)
    shlaa_gdf['srid'] = split_df[0]
    shlaa_gdf['wkt_geom'] = split_df[1]
    
    shlaa_gdf['geometry'] = shlaa_gdf['wkt_geom'].apply(load_to_geom)
    shlaa_gdf = shlaa_gdf.set_geometry('geometry')
@martinfleis
Copy link
Member

This is EWKT, or Extended Well-Known Text representation of geometries. With the exception of empty geometries, which are invalid here. I can't seem to find a reference to it in GDAL docs so I suppose that it supports only standard WKT, which is everything after ;.

We could potentially try to detect it, parse it and create geometry column, assuming all SRID's are the same, though it is not supported at the moment. Though you'd need to ensure that POLYGON() is not present as that is incorrect WKT.

My recommendation for now is to use what you have, maybe trying to use vectorized shapely.from_wkt rather than wkt.loads but that is a minor difference.

@theroggy
Copy link
Member

theroggy commented Jan 31, 2024

For information, as mentioned by @martinfleis , if you would have to process larger datasets it would probably be more efficiënt to use only vectorized functions.

E.g. like this:

import pandas as pd
import geopandas as gpd
import shapely
from io import StringIO

# Toy example
csv_content = """fid,geom
1,"SRID=27700;POLYGON()"
2,"SRID=27700;POLYGON((10 10, 20 20, 30 10, 10 10))"
3,"SRID=27700;MULTIPOLYGON()"
"""

shlaa_df = pd.read_csv(StringIO(csv_content))

shlaa_df[["srid", "wkt"]] = shlaa_df["geom"].str.split(";", expand=True)
shlaa_df.loc[shlaa_df["wkt"] == "POLYGON()", "wkt"] = "POLYGON EMPTY"
shlaa_df.loc[shlaa_df["wkt"] == "MULTIPOLYGON()", "wkt"] = "MULTIPOLYGON EMPTY"
crs = shlaa_df.iloc[0]["srid"].split("=")[1]

shlaa_gdf = gpd.GeoDataFrame(geometry=shapely.from_wkt(shlaa_df["wkt"]), crs=crs)

print(shlaa_gdf)
print(shlaa_gdf.crs)

@codeananda
Copy link
Author

codeananda commented Feb 1, 2024

Amazing thank you!

@theroggy could you write 'python' after the triple backticks? It's so much nicer to read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants