Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CanadianBuildingFootprints dataset #69

Merged
merged 8 commits into from
Aug 4, 2021
Merged

Add CanadianBuildingFootprints dataset #69

merged 8 commits into from
Aug 4, 2021

Conversation

adamjstewart
Copy link
Collaborator

Closes #3

@adamjstewart adamjstewart added the datasets Geospatial or benchmark datasets label Aug 3, 2021
@adamjstewart adamjstewart requested a review from calebrob6 August 3, 2021 17:09
@adamjstewart adamjstewart marked this pull request as ready for review August 3, 2021 21:07
@calebrob6
Copy link
Member

calebrob6 commented Aug 3, 2021

Added a notebook showing that the constructor takes 3 minutes and the __getitem__ method takes 1 minute for a moderately sized query. This may be unavoidable with GeoJSON files, but, as-is, this will not be directly useable for training models.

@adamjstewart, can you try converting the inputs to shapefile or geopackage and see if that helps?

@adamjstewart
Copy link
Collaborator Author

Let me see if I can get conda to play nicely so I can actually run a notebook in AML.

@calebrob6
Copy link
Member

If you want to iterate quickly you can just use the same query:

bounds = BoundingBox(-79.69096183776855,-79.68220710754395,43.78839898848133,43.79482711775757,0,1)

which is in Ontario.geojson

@adamjstewart
Copy link
Collaborator Author

Good news! Things seem to be a lot faster with ESRI Shapefile.

GeoJSON

$ ipython
In [1]: from torchgeo.datasets import BoundingBox, CanadianBuildingFootprints as CBF
In [2]: bounds = BoundingBox(-79.69096183776855,-79.68220710754395,43.78839898848133,43.79482711775757,0,1)
In [3]: %time ds = CBF('/mnt/blobfuse/adam-scratch/cbf')
CPU times: user 3min 27s, sys: 1.68 s, total: 3min 29s
Wall time: 3min 38s
In [4]: %time ds[bounds]
CPU times: user 1min 7s, sys: 508 ms, total: 1min 7s
Wall time: 1min 9s

ESRI Shapefile

$ ipython
In [1]: from torchgeo.datasets import BoundingBox, CanadianBuildingFootprints as CBF
In [2]: bounds = BoundingBox(-79.69096183776855,-79.68220710754395,43.78839898848133,43.79482711775757,0,1)
In [3]: %time ds = CBF('/mnt/blobfuse/adam-scratch/cbf-shp')
CPU times: user 203 ms, sys: 68.9 ms, total: 272 ms
Wall time: 9.47 s
In [4]: %time ds[bounds]
CPU times: user 1.28 s, sys: 456 ms, total: 1.74 s
Wall time: 1.74 s

This raises the question of how we want to support arbitrary file formats. Right now, things are hard-coded to search for *.geojson files. We could make this search for any file, but directories tend to contain many file types (*.zip, *.shp, *.dbf, etc.). We could search for all files, try to open with fiona/rasterio, and catch any exceptions.

We should also do some benchmarking for different file types and add that to the paper.

@calebrob6
Copy link
Member

I think the reason that shapefiles (or geopackages) will be fast is because they will have internal spatial indices. I don't think we should try to support geojson files as it will cause the same problems.

@adamjstewart
Copy link
Collaborator Author

Ideally, I would like to allow users to download their own data and transform it in any way. This includes:

  • Changing file format
  • Changing projection/transform
  • Changing resolution (for rasters)

We can certainly recommend certain file formats like COGs or Shapefiles for speed, or even issue warnings when users are using file formats with slow random access.

For now, let's merge as is. When I do a RasterDataset/VectorDataset refactor, we can allow them to work with any file extension.

@calebrob6 calebrob6 merged commit 8b193d9 into main Aug 4, 2021
@calebrob6 calebrob6 deleted the datasets/cbf branch August 4, 2021 17:10
@adamjstewart adamjstewart added this to the 0.1.0 milestone Nov 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Microsoft Canadian Building Footprints Dataset
2 participants