Can magellan handle large shapefiles (1M+ polgons)? #127

dphayter · 2017-07-27T21:07:59Z

Thanks for a great geospatial library :-)

I've been trying to load in some large reference shapefiles (1M+ polygons per file) but with no success.
The schema is read in ok, but no data is returned with rooftops.show()
I've tried increasing Spark memory allocation, but will no joy. Any pointers to where the issue maybe / methods I should debug? Is there anyway to only read n polygons per file?

val rooftops = sqlcontext.read.format("magellan").load(../shape/")
rooftops.printSchema
rooftops.show()

Many thanks
David

harsha2010 · 2017-07-27T22:01:16Z

@dphayter can you share the shape file dataset? also how many nodes are you using? what is your cluster configuration like?

dphayter · 2017-08-03T07:57:19Z

Unfortunately due to licensing I can't share the shape file :-(
I'm running on a 20 node YARN cluster (total available memory: 2.6 TB) via Zeppelin.

Are their any large open-source shape files (similar size) that we could use instead?

Thanks

harsha2010 · 2017-08-03T08:24:06Z

@dphayter on average how many edges do the polygons in each file have? and how big is each file in terms of bytes? i'll see if i can find a similar shape file open source

dphayter · 2017-08-03T09:12:24Z

approx. 50% of polygons have 4 edges / 50% of polygons have 6-8 edges.
At present I'm just trying to read in one shapefile. The file is 334,134,120 bytes and has a FID count of 1.697 million

harsha2010 · 2017-08-04T07:49:45Z

@dphayter thanks! I am looking into this issue now. Basically right now we read an entire shape file into memory. I am trying to see if there is a sensible way to split a shape file so it can be streamed in. Will update the thread in a day or so with a conclusion

dphayter · 2017-08-13T22:27:09Z

Any progress/thoughts on how to split the shapefile? I've seen tools like QGIS have a python utility 'ShapefileSplitter' or R package ShapePattern function shpsplitter, but it would be nice to be able to split the shape file on sqlcontext.read.format("magellan").load
Thanks

Charmatzis · 2017-08-14T07:20:30Z

@dphayter Hi, shapefile as a spatial data type has limitations, like it can not exceed size larger that 2GB.
http://support.esri.com/en/technical-article/000010813

There is a limitation on shapefile component file size, not on the shapefile size. All component files of a shapefile are limited to 2GB each.

Accordingly, the .dbf cannot exceed 2GB, and the .shp cannot exceed 2GB. These are the only component files that are likely to approach the 2GB limit.

So, it is too small not to fit in a cluster.

A nice turn around is to load the Shapefile to Spark is using jts.

(Snapshoot for a project that I am developing)

def readSimpleFeatures(path: String) = {
    // Extract the features as GeoTools 'SimpleFeatures'
    val url = s"file://${new File(path).getAbsolutePath}"
    val ds = new ShapefileDataStore(new URL(url))
    val ftItr: SimpleFeatureIterator = ds.getFeatureSource.getFeatures.features

    try {
      val simpleFeatures = mutable.ListBuffer[SimpleFeature]()
      while(ftItr.hasNext) simpleFeatures += ftItr.next()
      simpleFeatures.toList
    } finally {
      ftItr.close
      ds.dispose
    }
  }

def readPolygonFeatures(path: String): Seq[MultiPolygonFeature[Map[String,Object]]] =
    readSimpleFeatures(path)
      .flatMap { ft => ft.geom[jts.Polygon].map(PolygonFeature(_, ft.attributeMap)) }

 def shpUrlsToRdd(urlArray: Array[String], partitionSize: Int = 10)(implicit sc: SparkContext): RDD[Polygon ] = {
    val urlRdd: RDD[String] = sc.parallelize(urlArray, (urlArray.size * partitionSize).toInt)
    urlRdd.mapPartitions { urls =>
      urls.flatMap { url =>
        readPolygonFeatures(url).map(x=>x.geom)
      }
    }
  }```

harsha2010 · 2017-08-14T07:22:14Z

@dphayter @Charmatzis I actually had a branch where i did this, but seem to have accidentally deleted it.
Talking to GitHub support to see if I can recover it.. in any case, this shouldn't take too long to fix and I will have a PR for this shortly
@dphayter would be great to have you try it out since I couldn't find really large shape files

harsha2010 · 2017-08-14T07:23:49Z

@Charmatzis also the issue is not so much that a shape file cannot exceed 2GB but the 2GB shape file is being read by a single core... so its not really using much parallelism. We can fix it by using the .SHX index file to figure out how to split the shape files so it can be read in parallel.

dphayter · 2017-08-14T09:27:30Z

Thanks both.
@harsha2010 let me know when you have a PR and I'll test it out. Thanks

harsha2010 · 2017-08-14T10:07:24Z

@dphayter , thanks to the fine people at Github support i was able to recover the branch !
#146
Can you test this out and let me know if it solves your problem?

dphayter · 2017-08-14T23:18:20Z

Good news, #146 worked for me!

val rooftops = sqlcontext.read.format("magellan").load(../shape/")
rooftops.printSchema
rooftops.show(1000)

Note: My .Shp (file size 334,134,120 bytes / FID count of 1.697 million) also had a corresponding .Shx

I'll do more detailed join testing shortly

Many thanks David

harsha2010 · 2017-08-15T09:15:22Z

Resolved by #146

harsha2010 closed this as completed Aug 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can magellan handle large shapefiles (1M+ polgons)? #127

Can magellan handle large shapefiles (1M+ polgons)? #127

dphayter commented Jul 27, 2017

harsha2010 commented Jul 27, 2017 •

edited

Loading

dphayter commented Aug 3, 2017

harsha2010 commented Aug 3, 2017

dphayter commented Aug 3, 2017

harsha2010 commented Aug 4, 2017

dphayter commented Aug 13, 2017

Charmatzis commented Aug 14, 2017 •

edited

Loading

harsha2010 commented Aug 14, 2017

harsha2010 commented Aug 14, 2017

dphayter commented Aug 14, 2017

harsha2010 commented Aug 14, 2017

dphayter commented Aug 14, 2017

harsha2010 commented Aug 15, 2017

Can magellan handle large shapefiles (1M+ polgons)? #127

Can magellan handle large shapefiles (1M+ polgons)? #127

Comments

dphayter commented Jul 27, 2017

harsha2010 commented Jul 27, 2017 • edited Loading

dphayter commented Aug 3, 2017

harsha2010 commented Aug 3, 2017

dphayter commented Aug 3, 2017

harsha2010 commented Aug 4, 2017

dphayter commented Aug 13, 2017

Charmatzis commented Aug 14, 2017 • edited Loading

harsha2010 commented Aug 14, 2017

harsha2010 commented Aug 14, 2017

dphayter commented Aug 14, 2017

harsha2010 commented Aug 14, 2017

dphayter commented Aug 14, 2017

harsha2010 commented Aug 15, 2017

harsha2010 commented Jul 27, 2017 •

edited

Loading

Charmatzis commented Aug 14, 2017 •

edited

Loading