Fixes VectorDataset rounding bug causing sample mask size mismatch #675

TCherici · 2022-07-14T11:50:33Z

I've had to adjust the test to also account for the extra pixel generated, which seems right to me (1.9-1.1 = 0.8 which would be 8 pixels with a res=0.1).

Fixes #674

ghost · 2022-07-14T11:50:44Z

All CLA requirements met.

adamjstewart

Can confirm that this fixes the issue, thanks for the PR!

adamjstewart · 2022-07-14T19:00:11Z

torchgeo/datasets/geo.py

@@ -611,12 +611,12 @@ def __getitem__(self, query: BoundingBox) -> Dict[str, Any]:
        )
        if shapes:
            masks = rasterio.features.rasterize(
-                shapes, out_shape=(int(height), int(width)), transform=transform
+                shapes, out_shape=(int(round(height)), int(round(width))), transform=transform


This line is too long now. Can you run black torchgeo/datasets/geo.py to fix this?

There is no need to cast it to int again if you're not specifying the number of decimals to round to.

adamjstewart · 2022-07-14T19:11:18Z

There are several other places where we aren't rounding before casting to an int. I'm not worried about the str -> int conversions, but some of these float -> int conversions could be problematic. Not sure if integer division includes rounding or not. I wonder if any of these should also be fixed:

$ grep -PR '\bint\((?!round)'
tests/models/test_changestar.py:        assert y[0].shape == (3, 1, int(32 * sf), int(32 * sf))
tests/samplers/test_single.py:        rows = int((100 - sampler.size[0]) // sampler.stride[0]) + 1
tests/samplers/test_single.py:        cols = int((100 - sampler.size[1]) // sampler.stride[1]) + 1
tests/data/forestdamage/data.py:        ET.SubElement(bbox, "xmin").text = str(0 + int(SIZE / 4))
tests/data/forestdamage/data.py:        ET.SubElement(bbox, "ymin").text = str(0 + int(SIZE / 4))
tests/data/forestdamage/data.py:        ET.SubElement(bbox, "xmax").text = str(SIZE - int(SIZE / 4))
tests/data/forestdamage/data.py:        ET.SubElement(bbox, "ymax").text = str(SIZE - int(SIZE / 4))
torchgeo/datasets/forestdamage.py:            int(bndbox.find("xmin").text),  # type: ignore[union-attr, arg-type]
torchgeo/datasets/forestdamage.py:            int(bndbox.find("ymin").text),  # type: ignore[union-attr, arg-type]
torchgeo/datasets/forestdamage.py:            int(bndbox.find("xmax").text),  # type: ignore[union-attr, arg-type]
torchgeo/datasets/forestdamage.py:            int(bndbox.find("ymax").text),  # type: ignore[union-attr, arg-type]
torchgeo/datasets/so2sat.py:            self.size = int(f["label"].shape[0])
torchgeo/datasets/spacenet.py:            [int(math.ceil(s / bin_size_mph)) for s in speed_arr_bin]
torchgeo/datasets/spacenet.py:                            int(feature["properties"]["inferred_speed_mph"]) - 1
torchgeo/datasets/gbif.py:        mint = datetime(int(year), 1, 1)
torchgeo/datasets/gbif.py:        maxt = datetime(int(year) + 1, 1, 1)
torchgeo/datasets/gbif.py:        mint = datetime(int(year), int(month), 1)
torchgeo/datasets/gbif.py:            maxt = datetime(int(year) + 1, 1, 1)
torchgeo/datasets/gbif.py:            maxt = datetime(int(year), int(month) + 1, 1)
torchgeo/datasets/gbif.py:        mint = datetime(int(year), int(month), int(day))
torchgeo/datasets/cowc.py:        target = int(self.targets[index])
torchgeo/datasets/idtrees.py:                [self.idx2class[int(i)] for i in sample["label"]]
torchgeo/datasets/idtrees.py:                [self.idx2class[int(i)] for i in sample["prediction_label"]]
torchgeo/datasets/openbuildings.py:                shapes, out_shape=(int(height), int(width)), transform=transform
torchgeo/datasets/openbuildings.py:            masks = torch.zeros(size=(1, int(height), int(width)))
torchgeo/datasets/cv4a_kenya_crop_type.py:                train_field_ids.append(int(row[0]))
torchgeo/datasets/cv4a_kenya_crop_type.py:                    test_field_ids.append(int(row[1]))
torchgeo/datasets/cyclone.py:        features["relative_time"] = int(features["relative_time"])
torchgeo/datasets/cyclone.py:        features["ocean"] = int(features["ocean"])
torchgeo/datasets/cyclone.py:        features["label"] = int(features["wind_speed"])
torchgeo/datamodules/etci2021.py:        size_train = int(0.8 * size_train_val)
torchgeo/datamodules/chesapeake.py:        self.original_patch_size = int(patch_size * 2.0)
torchgeo/datamodules/sen12ms.py:            scene_id = int(parts[3])
torchgeo/datamodules/utils.py:        val_length = int(len(dataset) * val_pct)
torchgeo/datamodules/utils.py:        val_length = int(len(dataset) * val_pct)
torchgeo/datamodules/utils.py:        test_length = int(len(dataset) * test_pct)
torchgeo/models/farseg.py:            num_upsample = int(math.log2(int(in_feat_os))) - int(
torchgeo/models/farseg.py:                math.log2(int(out_feature_output_stride))
torchgeo/samplers/single.py:            rows = int((bounds.maxy - bounds.miny - self.size[0]) // self.stride[0]) + 1
torchgeo/samplers/single.py:            cols = int((bounds.maxx - bounds.minx - self.size[1]) // self.stride[1]) + 1
torchgeo/samplers/single.py:            rows = int((bounds.maxy - bounds.miny - self.size[0]) // self.stride[0]) + 1
torchgeo/samplers/single.py:            cols = int((bounds.maxx - bounds.minx - self.size[1]) // self.stride[1]) + 1

ashnair1 · 2022-07-16T15:48:54Z

torchgeo/datasets/geo.py

            )
        else:
            # If no features are found in this query, return an empty mask
            # with the default fill value and dtype used by rasterize
-            masks = np.zeros((int(height), int(width)), dtype=np.uint8)
+            masks = np.zeros((int(round(height)), int(round(width))), dtype=np.uint8)


Same comment as above. Casting to int not required.

ashnair1 · 2022-07-16T16:46:43Z

There are several other places where we aren't rounding before casting to an int. I'm not worried about the str -> int conversions, but some of these float -> int conversions could be problematic. Not sure if integer division includes rounding or not. I wonder if any of these should also be fixed:

I had encountered a related issue with dataset_split in torchgeo/datamodules/utils.py. Direct int casting basically strips out the decimal part so it's a problem if len(dataset) * val_pct < 1. I had encountered this issue while testing coverage for a datamodule. The test dataset in question had len(dataset) = 3 and val_pct = 0.3.

Fix using round:

    if test_pct is None:
-        val_length = int(len(dataset) * val_pct)
+        val_length = round(len(dataset) * val_pct)
        train_length = len(dataset) - val_length
        return random_split(dataset, [train_length, val_length])
    else:
-       val_length = int(len(dataset) * val_pct)
-       test_length = int(len(dataset) * test_pct)
+       val_length = round(len(dataset) * val_pct)
+       test_length = round(len(dataset) * test_pct)

TCherici · 2022-07-18T12:22:51Z

I had encountered a related issue with dataset_split in torchgeo/datamodules/utils.py. Direct int casting basically strips out the decimal part so it's a problem if len(dataset) * val_pct < 0. I had encountered this issue while testing coverage for a datamodule. The test dataset in question had len(dataset) = 3 and val_pct = 0.3.

I don't understand how len(dataset) * val_pct could be less than 0 for the values you gave, and I'm not sure if this is related to the bugfix in this PR. @ashnair1 If you think it is I will add the fix you gave as a commit to this PR.

ashnair1 · 2022-07-18T13:34:37Z

I had encountered a related issue with dataset_split in torchgeo/datamodules/utils.py. Direct int casting basically strips out the decimal part so it's a problem if len(dataset) * val_pct < 0. I had encountered this issue while testing coverage for a datamodule. The test dataset in question had len(dataset) = 3 and val_pct = 0.3.

I don't understand how len(dataset) * val_pct could be less than 0 for the values you gave, and I'm not sure if this is related to the bugfix in this PR. @ashnair1 If you think it is I will add the fix you gave as a commit to this PR.

My bad. Meant less than 1. The comment is unrelated to this PR. I was just replying to @adamjstewart's comment on other places where round vs int errors can occur.

adamjstewart

We can fix other instances of this bug in a separate PR.

) * Fixes VectorDataset rounding bug causing sample mask size mismatch * removes unnecessary casting to int

…icrosoft#675) * Fixes VectorDataset rounding bug causing sample mask size mismatch * removes unnecessary casting to int

Fixes VectorDataset rounding bug causing sample mask size mismatch

d13d9b3

github-actions bot added datasets Geospatial or benchmark datasets testing Continuous integration testing labels Jul 14, 2022

adamjstewart added this to the 0.3.1 milestone Jul 14, 2022

adamjstewart previously approved these changes Jul 14, 2022

View reviewed changes

ashnair1 reviewed Jul 16, 2022

View reviewed changes

removes unnecessary casting to int

e16b731

TCherici dismissed adamjstewart’s stale review via e16b731 July 18, 2022 12:12

adamjstewart approved these changes Jul 18, 2022

View reviewed changes

adamjstewart merged commit 67b8b30 into microsoft:main Jul 18, 2022

adamjstewart mentioned this pull request Jul 18, 2022

Various rounding bugs #679

Closed

TCherici deleted the bugfix/vector_dataset_sample_rounding branch July 19, 2022 09:15

adamjstewart mentioned this pull request Aug 22, 2022

Fix float rounding issues #736

Merged

adamjstewart pushed a commit that referenced this pull request Sep 3, 2022

Fixes VectorDataset rounding bug causing sample mask size mismatch (#675

1044078

) * Fixes VectorDataset rounding bug causing sample mask size mismatch * removes unnecessary casting to int

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes VectorDataset rounding bug causing sample mask size mismatch #675

Fixes VectorDataset rounding bug causing sample mask size mismatch #675

TCherici commented Jul 14, 2022 •

edited by adamjstewart

Loading

ghost commented Jul 14, 2022 •

edited by ghost

Loading

adamjstewart left a comment

adamjstewart Jul 14, 2022

ashnair1 Jul 16, 2022 •

edited by adamjstewart

Loading

adamjstewart commented Jul 14, 2022

ashnair1 Jul 16, 2022 •

edited

Loading

ashnair1 commented Jul 16, 2022 •

edited

Loading

TCherici commented Jul 18, 2022 •

edited

Loading

ashnair1 commented Jul 18, 2022 •

edited

Loading

adamjstewart left a comment

Fixes VectorDataset rounding bug causing sample mask size mismatch #675

Fixes VectorDataset rounding bug causing sample mask size mismatch #675

Conversation

TCherici commented Jul 14, 2022 • edited by adamjstewart Loading

ghost commented Jul 14, 2022 • edited by ghost Loading

adamjstewart left a comment

Choose a reason for hiding this comment

adamjstewart Jul 14, 2022

Choose a reason for hiding this comment

ashnair1 Jul 16, 2022 • edited by adamjstewart Loading

Choose a reason for hiding this comment

adamjstewart commented Jul 14, 2022

ashnair1 Jul 16, 2022 • edited Loading

Choose a reason for hiding this comment

ashnair1 commented Jul 16, 2022 • edited Loading

TCherici commented Jul 18, 2022 • edited Loading

ashnair1 commented Jul 18, 2022 • edited Loading

adamjstewart left a comment

Choose a reason for hiding this comment

TCherici commented Jul 14, 2022 •

edited by adamjstewart

Loading

ghost commented Jul 14, 2022 •

edited by ghost

Loading

ashnair1 Jul 16, 2022 •

edited by adamjstewart

Loading

ashnair1 Jul 16, 2022 •

edited

Loading

ashnair1 commented Jul 16, 2022 •

edited

Loading

TCherici commented Jul 18, 2022 •

edited

Loading

ashnair1 commented Jul 18, 2022 •

edited

Loading