Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSL4EO-S12: add new dataset/datamodule #1151

Merged
merged 31 commits into from
Apr 15, 2023
Merged

SSL4EO-S12: add new dataset/datamodule #1151

merged 31 commits into from
Apr 15, 2023

Conversation

adamjstewart
Copy link
Collaborator

@adamjstewart adamjstewart commented Feb 28, 2023

Adds the SSL4EO-S12 dataset.

Some things the dataset doesn't currently support because I don't need them, but could potentially be added someday:

  • Automatic downloading: same as SEN12MS and So2Sat, we haven't yet figured out automatic downloads from the TUM servers
  • RGB version, 100-patch MSI subset, 50K RGB subset: alternate versions of the dataset on OneDrive and Google Drive
  • Combining s1, s2c, or s2a: we don't yet have an easy model for supporting this in trainers and transforms
  • Spectral band subsets: adds complexity to the dataset
  • GeoDataset: each file has geographic metadata, we could build a GeoDataset

Plotting seems broken at the moment, will upload plots once that's working better.

@adamjstewart adamjstewart added this to the 0.5.0 milestone Feb 28, 2023
@github-actions github-actions bot added datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing labels Feb 28, 2023

if path.endswith("B1.tif") or path.endswith("B9.tif"):
profile["width"] = profile["height"] = SIZE // 6
elif (
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably a better way to do this...

torchgeo/datasets/ssl4eo.py Outdated Show resolved Hide resolved
Comment on lines 139 to 141
directory = os.path.join(self.root, self.split, f"{index // self.times:07}")
subdirs = sorted(os.listdir(directory))
directory = os.path.join(directory, subdirs[index % self.times])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat hard-coded, but is much faster than storing a list of every directory like we do in SeCo.

for band in self.metadata[self.split]["bands"]:
filename = os.path.join(directory, f"{band}.tif")
with rasterio.open(filename) as f:
image = f.read(out_shape=(1, self.size, self.size)).astype(np.float32)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each Sentinel band is in its original resolution (10–60 m) and needs to be resampled to the higher resolution.

a matplotlib Figure with the rendered sample
"""
if self.split == "s1":
# See Sentinel1.plot
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wish we could call the plot methods of Sentinel1 and Sentinel2 directly so we didn't have to copy the same logic into every dataset. I bet we could figure out a way to do this...


co_polarization = torch.clamp(co_polarization / 0.3, min=0, max=1)
cross_polarization = torch.clamp(cross_polarization / 0.05, min=0, max=1)
ratio = torch.clamp(ratio / 25, min=0, max=1)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RitwikGupta this plotting trick isn't working, all I see is a black image with red specs. I know you're not the biggest fan of RGB plotting, but this dataset presumably only has one type of image processing so it should be possible to reliably plot. Not sure how to tell if the image is power/decibel/amplitude or gamma/sigma and what that actually means for plotting. Any advice? You can download the 100 patch subset if you want to test the plotting method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamjstewart The Sentinel-1 imagery they pulled from GEE is in the decibel scale. Normally, Sentinel-1 data from Copernicus is in the power scale. The ratio you want to calculate should be in the power scale. This code will work to do visualize this specific data:

cross_pol = np.exp(cross_pol / 10)
co_pol = np.exp(co_pol / 10)
ratio = cross_pol / co_pol

fig, ax = plt.subplots(1, 4, figsize=(20, 10))
ax[0].imshow(cross_pol, cmap='bone')
ax[1].imshow(co_pol, cmap='bone')
ax[2].imshow(ratio, cmap='bone')
ax[3].imshow(np.concatenate([cross_pol, co_pol, ratio], axis=2))

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the relevant GEE documentation that says they convert to decibel when pulling S1 data in GEE: https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S1_GRD
image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I thought it was supposed to be [co_pol, cross_pol, ratio], not [cross_pol, co_pol, ratio]?
  2. Am I not supposed to do normalization (co_pol / 0.3, cross_pol / 0.05, ratio / 25)?
  3. Even with divide by 10, no normalization, swap co_pol/cross_pol, the images don't look great:
    Figure_1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which image are you using? I can test with the same one.

  • I think you can keep [co_pol, cross_pol, ratio].
  • Are you doing np.exp(x / 10) or just x /10?
  • You should use the log scale values in that post for viz purposes only. It's not required, but it may help. Do min-max norm to those ranges, not just division.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, I was using x / 10 instead of np.exp(x / 10). However, it still looks pretty bad. I also tried 10 ** (x / 10). I'm just plotting the first 20 images of the 100 patch subset if you want to try to tinker with it.

In the post, it's unclear to me how one is supposed to plot db data. If I immediately convert from db to power, would I use the linear values or the db values? Is linear different from power? What does it mean for values to range from -15 to 5 (for red)? Does that mean to convert values from [-15, 5] to [0, 1]? Do I perform np.exp(x / 10) before or after all this normalization? Do I use co / cross or co - cross?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you want to use power values to plot? In my paper, I simply plot the dB values (99% percentile and scale each one to [0,1])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to try to make an RGB plot (see #821 (comment) for an example) because it's prettier and in theory provides more useful information. Unfortunately it's also much harder to plot. We could plot 2 grayscale plots in the worst case scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then can also plot dB for each channel? e.g. [VV,VH,VV/VH]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's what I was doing originally and it didn't work either. I just copied the plotting code from our Sentinel1 dataset, which worked fine for data downloaded from ASF DAAC, but apparently not for data downloaded from GEE.

@adamjstewart adamjstewart marked this pull request as ready for review March 10, 2023 02:54
@adamjstewart
Copy link
Collaborator Author

S1 plotting is definitely still broken, but I would be okay with merging this as is and fixing S1 plotting in a follow-up PR since I need this dataset soon.

I'm planning on adding a datamodule for this, but will wait until #1168 is merged so I can properly test it.

isaaccorley
isaaccorley previously approved these changes Mar 10, 2023
@github-actions github-actions bot added the datamodules PyTorch Lightning datamodules label Mar 17, 2023
@calebrob6
Copy link
Member

I don't have the S1 downloaded so can't check it
image

@adamjstewart
Copy link
Collaborator Author

I don't have the S1 downloaded so can't check it

You can download the 100-patch subset for testing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@calebrob6 this directly contradicts the change made in #1027. Should we revert that PR too? Given that:

  1. All of our datasets use a different scale factor, ignoring which satellite their images come from
  2. All images are normalized before they are returned from the dataset or datamodule, changing the appropriate scale factor

I'm starting to think we should use one of the default enhancements from QGIS:

  1. Min / max
  2. 2% / 98%
  3. ± 2 std dev

There could even be an option to select which of these you want to use. We can decide and implement this later, but I want to make sure we're not contradicting ourselves and won't revert #1027 multiple times until we come up with a uniform policy. This also relates to #496.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dividing by 10k obviously didn't work though

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why did we merge #1027?

@calebrob6
Copy link
Member

image

s1 plotting is broken

@calebrob6
Copy link
Member

calebrob6 commented Mar 29, 2023 via email

@adamjstewart
Copy link
Collaborator Author

Yep. We could just plot 2 grayscale images instead of one RGB image for now and fix this in the future when someone (me) has a better understanding of SAR imagery. I'd prefer not to merge this with broken plotting, but we could open an issue and promise to fix it before 0.5.0 is released.

@RitwikGupta
Copy link
Contributor

Sorry all. I was out on vacation and then sick for a bit. I will take a look at this again. GEE imagery is in the decibel scale instead of the power scale most Sentinel-1 imagery is released in. This is standard for analysis and visualization.

@adamjstewart
Copy link
Collaborator Author

Any updates @RitwikGupta?

We might want to just merge this PR and fix S1 plotting at a later date.

@calebrob6
Copy link
Member

Yep, I would vote to make the plot show something and merge

@adamjstewart
Copy link
Collaborator Author

Alright, see how this looks:

Figure_1
Figure_2
Figure_3
Figure_4

@calebrob6 calebrob6 merged commit 8e9a894 into main Apr 15, 2023
@calebrob6 calebrob6 deleted the datasets/ssl4eo branch April 15, 2023 20:51
yichiac pushed a commit to yichiac/torchgeo that referenced this pull request Apr 29, 2023
* SSL4EO-S12: add new dataset

* Style fixes

* 100% coverage

* fix mypy

* black fixes

* mypy fix

* Convert from db to power

* Don't cast to numpy

* Remove comments referring to SeCo

* SSL4EO: add extraction time

* Add RandomSeasonContrast

* Fix axes indexing

* Add datamodule

* fix tests

* mypy fixes

* fix missing import

* Fix tests

* isort fix

* Typo fix

* s2c: add B10

* Update test channels

* S2 plotting was broken

* Fix plotting

* Black fix

* Rename conf files

* Remove file introduced by bad merge

* Fix pixel size of bands

* black fix

* Better S1 plotting

---------

Co-authored-by: Caleb Robinson <calebrob6@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datamodules PyTorch Lightning datamodules datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants