Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Shx Index to split large shapefiles #146

Merged
merged 1 commit into from
Aug 14, 2017

Conversation

halfabrane
Copy link
Contributor

@halfabrane halfabrane commented Aug 14, 2017

This PR allows the Shapefile Reader to split large shapefiles.
The Shapefile Reader looks for a Shx Index File for each (.shp) file. If the index file exists, we use it to split the shape file roughly at block boundaries as follows:

  • If you want to control the splitSize, use Hadoop Configuration parameters "mapreduce.input.fileinputformat.split.minsize"(defaults to 1) and "mapreduce.input.fileinputformat.split.maxsize" (defaults to Long.MaxValue)
  • We control the splitSize as follows:
    splitSize = Math.max(minSplitSize, Math.min(maxSplitSize, blockSize))
  • By default, splitSize = blockSize

@codecov-io
Copy link

Codecov Report

Merging #146 into master will increase coverage by 0.46%.
The diff coverage is 93.54%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #146      +/-   ##
==========================================
+ Coverage   85.56%   86.03%   +0.46%     
==========================================
  Files          48       49       +1     
  Lines        1372     1446      +74     
  Branches       96      100       +4     
==========================================
+ Hits         1174     1244      +70     
- Misses        198      202       +4
Impacted Files Coverage Δ
src/main/scala/magellan/io/ShapeWritable.scala 55.55% <ø> (-4.45%) ⬇️
src/main/scala/magellan/ShapefileRelation.scala 95.83% <100%> (+0.59%) ⬆️
...ain/scala/magellan/mapreduce/ShapefileReader.scala 96.15% <88.88%> (-0.91%) ⬇️
...in/scala/magellan/mapreduce/ShapeInputFormat.scala 90.62% <90.32%> (-9.38%) ⬇️
...main/scala/magellan/mapreduce/ShxInputFormat.scala 96% <96%> (ø)
src/main/scala/magellan/Polygon.scala 80.64% <0%> (+0.8%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d13db84...c8f97a7. Read the comment docs.

@harsha2010 harsha2010 merged commit e7f91ce into harsha2010:master Aug 14, 2017
@halfabrane halfabrane deleted the SPLIT-LARGE-SHAPEFILES branch September 12, 2017 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants