Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images_to_samples: Create binary labels from multi-class label data #167

Closed
remtav opened this issue Nov 10, 2020 · 5 comments
Closed

Images_to_samples: Create binary labels from multi-class label data #167

remtav opened this issue Nov 10, 2020 · 5 comments
Labels
P1 High priority

Comments

@remtav
Copy link
Collaborator

remtav commented Nov 10, 2020

Problem

To train class-specific models, GDL cannot currently create samples with single-class labels if label data provided is multi-class.

Solution

Two approaches seem feasible to implement this missing feature. First let's look at necessary developements, common to both approaches, that would let user control this feature seemlessly:

  • a parameter "hide_classes" would be added to the "global" section of the config yaml
  • validate_num_classes function would be adapted to take into account the ignored classes, since "num_classes" would most likely be smaller that the number of classes found in the geopackage.

1. Online approach

Description

Dynamically zero out (i.e. set as "background") irrelevant class values during training without modifying hdf5 files.

Advantages

  • Single class vs multi-class trainings can be performed with a single set of hdf5s, making it less prone to errors if user wants to directly compare performances of single-class vs multi-class models.

Disadvantages

  • For certain classes, like road or vegetation extraction, the online approach combined with the way GDL uses pre-rasterized samples saved as HDF5s during training may limit the ability of models to extract features of one class under features of another class. For example, a waterbody covered by vegetation would not be extracted if the HDF5s initially discarded the presence of this waterbody in the process of rasterizing the geopackages;
  • Though zeroing out irrelevant class values is possible, GDL is not currently developped to dynamically (i.e. during training) exclude samples that have no pixels of relevant class in their label. An quick implementation of online appoach would therefore be inadequate if user wishes to exclude samples in which class of interest is not represented (i.e. containing no pixels of that class in label). **Update: Random Weighted Sampler is about to be added to GDL and could help balance out an umbalanced dataset using the online approach;
  • May add some overhead during training

Implementation

The get_item method of data loader would be in charge of zeroing out irrelevant values, returning a binary label.

2. Offline approach

Description

In images_to_samples.py, once geopackage is read and rasterized, create hdf5 samples with binary values (i.e. "class of interest" and "non class of interest"). All irrelevant class values are then zeroed out to match the value of "non class of interest" class. Training continues as usual

Advantages

  • User can resort to already implemented features (i.e. "class proportion" parameter) to filter out samples for which the label contains no values of "class of interest"

Disadavantages

  • This approach calls for a separate set of hdf5 files. This will require more disk space and cause some duplication if a single-class vs multi-class comparison is desired for the same dataset.

Implementation

After having filtering out undesired samples (i.e. those that do not meet the class_prop threshold and min annotated percent), irrelevant class values are zeroed out before final samples is written to hdf5, in add_to_dataset function

Dev effort estimate

Necessary effort to implement either approaches ranges from 1 to 2 workdays. This dev seems fairly simple at a first glance.

Useful resources and articles

Buildings extraction

Road extraction

@mpelchat04 mpelchat04 added the P1 High priority label Nov 11, 2020
@mpelchat04 mpelchat04 added this to the 1.3.0 milestone Nov 11, 2020
@CharlesAuthier
Copy link
Collaborator

Personally I prefer the first one.

@ymoisan
Copy link
Contributor

ymoisan commented Nov 19, 2020

I would call approaches dynamic/static rather than online/offline. In an ideal world, that is when we can fetch training samples through the web, there could be an opportunity for us not to bother anymore with duplicating our data into HDF5 files. In that respect, I would prefer the dynamic approach (#1).

@remtav
Copy link
Collaborator Author

remtav commented Jan 25, 2022

This feature was broken with PR #208.

This feature should apply to any subset of classes as was done with the parameter target_ids. In previous implementations, the offline/static approach was chosen. The disadvantages of this approach can be dealt with by a sampling to plain .tifs and .geojson rather than HDF5s.

@ymoisan ymoisan modified the milestones: 1.3.0, Training enhancements Feb 15, 2022
@remtav
Copy link
Collaborator Author

remtav commented May 9, 2022

A solution is proposed in my 215-solaris-tiling branch. The vector ground truth is first tiled to vector chips (as geojson) alongside the imagery. The desired attribute values for a particular attribute field are then burned to raster. The advantage of this approach is that the geojson chips contain all initial attribute information and only to be created once. The burning of specific values can be requested multiple times and store to different copies as raster chips.

@CharlesAuthier
Copy link
Collaborator

This issue will be solve with the merge of the branch 222-stac-item-input and the change made at line 471 in train_segmentation.py. This change make the reading of the gt binary if only one class is specify at attribute_values in the dataset yaml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 High priority
Projects
None yet
Development

No branches or pull requests

4 participants