GeoDataset: non-deterministic behavior #1899

DimitrisMantas · 2024-02-22T01:24:50Z

Description

The random_bbox_assignment dataset splitter (as well as all other spatial index-based splitters I suppose) is not deterministic between different program executions.

That is, splitting a dataset multiple times during the same runtime (e.g., by calling GeoDataModule.setup() multiple times in a row, always results in the same splits, which is expected. However, these splits are different when restarting the current kernel/program/script.

Steps to reproduce

Go to the random random_bbox_assignment function definition and add the following line under hits = list(dataset.index.intersection(dataset.index.bounds, objects=True)): x = [hit.object for hit in hits].
Insert a breakpoint in the following line (i.e., where hits is permuted).
Initialize the L7IrishDataModule.
Call setup("fit") on the data module and while debugging.
Catch the breakpoint.
Save the value of x somewhere.
Terminate the program.
Do Steps 3-6 again.
Notice that the value of x has changed.

Version

0.5.1

The text was updated successfully, but these errors were encountered:

DimitrisMantas · 2024-02-22T01:36:35Z

I believe that at least for the case of RasterDataset, where each entry in its spatial index is essentially one file, one can sort x lexicographically and use that instead of hits to proceed.

However, I am pretty sure that this whole thing is due to the files property of GeoDataset, specifically this part:

# Using set to remove any duplicates if directories are overlapping
files: set[str] = set()
for path in paths:
    if os.path.isdir(path):
        pathname = os.path.join(path, "**", self.filename_glob)
        files |= set(glob.iglob(pathname, recursive=True))
    elif os.path.isfile(path) or path_is_vsi(path):
        files.add(path)
    else:
        warnings.warn(
            f"Could not find any relevant files for provided path '{path}'. "
            f"Path was ignored.",
            UserWarning,
        )
return files

I see that self.files is iterated over to populate the index of the original dataset (i.e., the one initialized by the corresponding data module), at least in RasterDataset.__init__(), but sets are unordered. A very similar bug was fixed by #1839.

adamjstewart · 2024-02-25T19:07:49Z

Thanks for catching this, this is a HUGE bug! From what I can tell, the implications of this bug are as follows:

GeoDataset iteration order is non-deterministic across processes, so experiments cannot be reproduced, even with a random seed
Therefore, all of our splitting functions will result in a different split across processes
Therefore, when torchgeo fit and torchgeo test are run, there may be image patches that appear in both the train and test split (only affects L7Irish and L8Biome)

This bug was introduced in #1442 and #1597, and is included in releases 0.5.0 and 0.5.1. Sorry I didn't catch this during review @adriantre. We'll fix this in the next release.

I just finished presenting TorchGeo at a reproducibility workshop, so this is a bit embarrassing for me personally... 😅

I was able to reproduce implication 1 as follows. First, apply the following patch:

diff --git a/torchgeo/datasets/l7irish.py b/torchgeo/datasets/l7irish.py
index fe9e6b4a..fd9eeb89 100644
--- a/torchgeo/datasets/l7irish.py
+++ b/torchgeo/datasets/l7irish.py
@@ -183,6 +183,7 @@ class L7Irish(RasterDataset):
         """
         hits = self.index.intersection(tuple(query), objects=True)
         filepaths = cast(list[str], [hit.object for hit in hits])
+        print(filepaths)
 
         if not filepaths:
             raise IndexError(

Then, run the following code:

from torch.utils.data import DataLoader
from lightning.pytorch import seed_everything
from torchgeo.datasets import L7Irish, stack_samples
from torchgeo.samplers import RandomGeoSampler

seed_everything(0)

dataset = L7Irish(paths="data/l7irish", download=True)
sampler = RandomGeoSampler(dataset, size=64, length=10)

dataloader = DataLoader(dataset, sampler=sampler, collate_fn=stack_samples)

for batch in dataloader:
    pass

Every time you run this program, you'll notice that the order changes. Implications 2 and 3 follow by definition and can be easily reproduced as you described.

I believe your fix of sorting the output of GeoDataset.files is correct. Would you like to submit a PR to fix this so you can get credit for discovering and fixing the bug? It should look like:

diff --git a/torchgeo/datasets/geo.py b/torchgeo/datasets/geo.py
index 1e2382db..c044afb7 100644
--- a/torchgeo/datasets/geo.py
+++ b/torchgeo/datasets/geo.py
@@ -287,7 +287,7 @@ class GeoDataset(Dataset[dict[str, Any]], abc.ABC):
         self._res = new_res
 
     @property
-    def files(self) -> set[str]:
+    def files(self) -> list[str]:
         """A list of all files in the dataset.
 
         Returns:
@@ -316,7 +316,7 @@ class GeoDataset(Dataset[dict[str, Any]], abc.ABC):
                     UserWarning,
                 )
 
-        return files
+        return sorted(files)

I would love to figure out a good way to test these things. We could test that GeoDataset.files always has the same order, but that doesn't necessarily guarantee that the rest of TorchGeo is deterministic. Maybe some kind of integration test that tests a trained model byte-for-byte? Even then, reproducibility in PyTorch isn't guaranteed across PyTorch versions or hardware changes.

DimitrisMantas · 2024-02-25T19:27:43Z

Hi, thanks for taking the time to investigate the issue; I’m glad to help out!

I can take on the PR, but a word of warning, I’m really new to production-level CI/CD so this may take a bit until I understand the contribution guidelines.

Regarding testing, I think that proving that each part of TorchGeo is individually deterministic should be enough to assume the same for the whole thing. So, maybe just confirming that files is always in the same order is enough for our purposes?

adamjstewart · 2024-02-25T19:35:10Z

I can take on the PR, but a word of warning, I’m really new to production-level CI/CD so this may take a bit until I understand the contribution guidelines.

No worries, we have documentation for this: https://torchgeo.readthedocs.io/en/stable/user/contributing.html. Unit tests will go in tests/datasets/test_geo.py, and I'm happy to help write those if you get stuck. Everyone starts somewhere 😃

Regarding testing, I think that proving that each part of TorchGeo is individually deterministic should be enough to assume the same for the whole thing. So, maybe just confirming that files is always in the same order is enough for our purposes?

Kind of. This ensures that self.files is deterministic, but doesn't necessarily ensure that there are no other sources of non-determinism. There could easily be other places where we start using a set in the future that will lead to new bugs.

DimitrisMantas · 2024-02-25T19:39:01Z

Thanks for the offer! Will attempt to push something today because this actually affects experimentation my thesis work and I'd like to get it out of the way ASAP.

I see why you're scared of sets now haha!

adamjstewart added the datasets Geospatial or benchmark datasets label Feb 25, 2024

adamjstewart added this to the 0.5.2 milestone Feb 25, 2024

adamjstewart changed the title ~~random_bbox_assignment non-deterministic behavior~~ GeoDataset: non-deterministic behavior Feb 25, 2024

adamjstewart changed the title ~~GeoDataset: non-deterministic behavior~~ GeoDataset/DataModule: non-deterministic behavior Feb 25, 2024

adamjstewart changed the title ~~GeoDataset/DataModule: non-deterministic behavior~~ GeoDataset: non-deterministic behavior Feb 25, 2024

adamjstewart mentioned this issue Feb 25, 2024

Add Sentinel2_CDL Datamodule #1889

Merged

DimitrisMantas mentioned this issue Feb 25, 2024

Implement deterministic GeoDataset #1908

Merged

adamjstewart closed this as completed in #1908 Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GeoDataset: non-deterministic behavior #1899

GeoDataset: non-deterministic behavior #1899

DimitrisMantas commented Feb 22, 2024

DimitrisMantas commented Feb 22, 2024 •

edited

Loading

adamjstewart commented Feb 25, 2024

DimitrisMantas commented Feb 25, 2024

adamjstewart commented Feb 25, 2024

DimitrisMantas commented Feb 25, 2024

GeoDataset: non-deterministic behavior #1899

GeoDataset: non-deterministic behavior #1899

Comments

DimitrisMantas commented Feb 22, 2024

Description

Steps to reproduce

Version

DimitrisMantas commented Feb 22, 2024 • edited Loading

adamjstewart commented Feb 25, 2024

DimitrisMantas commented Feb 25, 2024

adamjstewart commented Feb 25, 2024

DimitrisMantas commented Feb 25, 2024

DimitrisMantas commented Feb 22, 2024 •

edited

Loading