-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can magellan handle large shapefiles (1M+ polgons)? #127
Comments
@dphayter can you share the shape file dataset? also how many nodes are you using? what is your cluster configuration like? |
Unfortunately due to licensing I can't share the shape file :-( Are their any large open-source shape files (similar size) that we could use instead? Thanks |
@dphayter on average how many edges do the polygons in each file have? and how big is each file in terms of bytes? i'll see if i can find a similar shape file open source |
approx. 50% of polygons have 4 edges / 50% of polygons have 6-8 edges. |
@dphayter thanks! I am looking into this issue now. Basically right now we read an entire shape file into memory. I am trying to see if there is a sensible way to split a shape file so it can be streamed in. Will update the thread in a day or so with a conclusion |
Any progress/thoughts on how to split the shapefile? I've seen tools like QGIS have a python utility 'ShapefileSplitter' or R package ShapePattern function shpsplitter, but it would be nice to be able to split the shape file on sqlcontext.read.format("magellan").load |
@dphayter Hi, shapefile as a spatial data type has limitations, like it can not exceed size larger that 2GB. There is a limitation on shapefile component file size, not on the shapefile size. All component files of a shapefile are limited to 2GB each. Accordingly, the .dbf cannot exceed 2GB, and the .shp cannot exceed 2GB. These are the only component files that are likely to approach the 2GB limit. So, it is too small not to fit in a cluster. A nice turn around is to load the Shapefile to Spark is using jts. (Snapshoot for a project that I am developing)
|
@dphayter @Charmatzis I actually had a branch where i did this, but seem to have accidentally deleted it. |
@Charmatzis also the issue is not so much that a shape file cannot exceed 2GB but the 2GB shape file is being read by a single core... so its not really using much parallelism. We can fix it by using the .SHX index file to figure out how to split the shape files so it can be read in parallel. |
Thanks both. |
Good news, #146 worked for me! val rooftops = sqlcontext.read.format("magellan").load(../shape/") Note: My .Shp (file size 334,134,120 bytes / FID count of 1.697 million) also had a corresponding .Shx I'll do more detailed join testing shortly Many thanks David |
Resolved by #146 |
Thanks for a great geospatial library :-)
I've been trying to load in some large reference shapefiles (1M+ polygons per file) but with no success.
The schema is read in ok, but no data is returned with rooftops.show()
I've tried increasing Spark memory allocation, but will no joy. Any pointers to where the issue maybe / methods I should debug? Is there anyway to only read n polygons per file?
val rooftops = sqlcontext.read.format("magellan").load(../shape/")
rooftops.printSchema
rooftops.show()
Many thanks
David
The text was updated successfully, but these errors were encountered: