Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can magellan handle large shapefiles (1M+ polgons)? #127

Closed
dphayter opened this issue Jul 27, 2017 · 13 comments
Closed

Can magellan handle large shapefiles (1M+ polgons)? #127

dphayter opened this issue Jul 27, 2017 · 13 comments

Comments

@dphayter
Copy link

Thanks for a great geospatial library :-)

I've been trying to load in some large reference shapefiles (1M+ polygons per file) but with no success.
The schema is read in ok, but no data is returned with rooftops.show()
I've tried increasing Spark memory allocation, but will no joy. Any pointers to where the issue maybe / methods I should debug? Is there anyway to only read n polygons per file?

val rooftops = sqlcontext.read.format("magellan").load(../shape/")
rooftops.printSchema
rooftops.show()

Many thanks
David

@harsha2010
Copy link
Owner

harsha2010 commented Jul 27, 2017

@dphayter can you share the shape file dataset? also how many nodes are you using? what is your cluster configuration like?

@dphayter
Copy link
Author

dphayter commented Aug 3, 2017

Unfortunately due to licensing I can't share the shape file :-(
I'm running on a 20 node YARN cluster (total available memory: 2.6 TB) via Zeppelin.

Are their any large open-source shape files (similar size) that we could use instead?

Thanks

@harsha2010
Copy link
Owner

@dphayter on average how many edges do the polygons in each file have? and how big is each file in terms of bytes? i'll see if i can find a similar shape file open source

@dphayter
Copy link
Author

dphayter commented Aug 3, 2017

approx. 50% of polygons have 4 edges / 50% of polygons have 6-8 edges.
At present I'm just trying to read in one shapefile. The file is 334,134,120 bytes and has a FID count of 1.697 million

@harsha2010
Copy link
Owner

@dphayter thanks! I am looking into this issue now. Basically right now we read an entire shape file into memory. I am trying to see if there is a sensible way to split a shape file so it can be streamed in. Will update the thread in a day or so with a conclusion

@dphayter
Copy link
Author

Any progress/thoughts on how to split the shapefile? I've seen tools like QGIS have a python utility 'ShapefileSplitter' or R package ShapePattern function shpsplitter, but it would be nice to be able to split the shape file on sqlcontext.read.format("magellan").load
Thanks

@Charmatzis
Copy link
Contributor

Charmatzis commented Aug 14, 2017

@dphayter Hi, shapefile as a spatial data type has limitations, like it can not exceed size larger that 2GB.
http://support.esri.com/en/technical-article/000010813

There is a limitation on shapefile component file size, not on the shapefile size. All component files of a shapefile are limited to 2GB each.

Accordingly, the .dbf cannot exceed 2GB, and the .shp cannot exceed 2GB. These are the only component files that are likely to approach the 2GB limit.

So, it is too small not to fit in a cluster.

A nice turn around is to load the Shapefile to Spark is using jts.

(Snapshoot for a project that I am developing)

def readSimpleFeatures(path: String) = {
    // Extract the features as GeoTools 'SimpleFeatures'
    val url = s"file://${new File(path).getAbsolutePath}"
    val ds = new ShapefileDataStore(new URL(url))
    val ftItr: SimpleFeatureIterator = ds.getFeatureSource.getFeatures.features

    try {
      val simpleFeatures = mutable.ListBuffer[SimpleFeature]()
      while(ftItr.hasNext) simpleFeatures += ftItr.next()
      simpleFeatures.toList
    } finally {
      ftItr.close
      ds.dispose
    }
  }

def readPolygonFeatures(path: String): Seq[MultiPolygonFeature[Map[String,Object]]] =
    readSimpleFeatures(path)
      .flatMap { ft => ft.geom[jts.Polygon].map(PolygonFeature(_, ft.attributeMap)) }

 def shpUrlsToRdd(urlArray: Array[String], partitionSize: Int = 10)(implicit sc: SparkContext): RDD[Polygon ] = {
    val urlRdd: RDD[String] = sc.parallelize(urlArray, (urlArray.size * partitionSize).toInt)
    urlRdd.mapPartitions { urls =>
      urls.flatMap { url =>
        readPolygonFeatures(url).map(x=>x.geom)
      }
    }
  }```

@harsha2010
Copy link
Owner

@dphayter @Charmatzis I actually had a branch where i did this, but seem to have accidentally deleted it.
Talking to GitHub support to see if I can recover it.. in any case, this shouldn't take too long to fix and I will have a PR for this shortly
@dphayter would be great to have you try it out since I couldn't find really large shape files

@harsha2010
Copy link
Owner

@Charmatzis also the issue is not so much that a shape file cannot exceed 2GB but the 2GB shape file is being read by a single core... so its not really using much parallelism. We can fix it by using the .SHX index file to figure out how to split the shape files so it can be read in parallel.

@dphayter
Copy link
Author

Thanks both.
@harsha2010 let me know when you have a PR and I'll test it out. Thanks

@harsha2010
Copy link
Owner

@dphayter , thanks to the fine people at Github support i was able to recover the branch !
#146
Can you test this out and let me know if it solves your problem?

@dphayter
Copy link
Author

Good news, #146 worked for me!

val rooftops = sqlcontext.read.format("magellan").load(../shape/")
rooftops.printSchema
rooftops.show(1000)

Note: My .Shp (file size 334,134,120 bytes / FID count of 1.697 million) also had a corresponding .Shx

I'll do more detailed join testing shortly

Many thanks David

@harsha2010
Copy link
Owner

Resolved by #146

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants