Images_to_samples: Create binary labels from multi-class label data #167

remtav · 2020-11-10T21:27:47Z

Problem

To train class-specific models, GDL cannot currently create samples with single-class labels if label data provided is multi-class.

Solution

Two approaches seem feasible to implement this missing feature. First let's look at necessary developements, common to both approaches, that would let user control this feature seemlessly:

a parameter "hide_classes" would be added to the "global" section of the config yaml
validate_num_classes function would be adapted to take into account the ignored classes, since "num_classes" would most likely be smaller that the number of classes found in the geopackage.

1. Online approach

Description

Dynamically zero out (i.e. set as "background") irrelevant class values during training without modifying hdf5 files.

Advantages

Single class vs multi-class trainings can be performed with a single set of hdf5s, making it less prone to errors if user wants to directly compare performances of single-class vs multi-class models.

Disadvantages

For certain classes, like road or vegetation extraction, the online approach combined with the way GDL uses pre-rasterized samples saved as HDF5s during training may limit the ability of models to extract features of one class under features of another class. For example, a waterbody covered by vegetation would not be extracted if the HDF5s initially discarded the presence of this waterbody in the process of rasterizing the geopackages;
Though zeroing out irrelevant class values is possible, GDL is not currently developped to dynamically (i.e. during training) exclude samples that have no pixels of relevant class in their label. An quick implementation of online appoach would therefore be inadequate if user wishes to exclude samples in which class of interest is not represented (i.e. containing no pixels of that class in label). **Update: Random Weighted Sampler is about to be added to GDL and could help balance out an umbalanced dataset using the online approach;
May add some overhead during training

Implementation

The get_item method of data loader would be in charge of zeroing out irrelevant values, returning a binary label.

2. Offline approach

Description

In images_to_samples.py, once geopackage is read and rasterized, create hdf5 samples with binary values (i.e. "class of interest" and "non class of interest"). All irrelevant class values are then zeroed out to match the value of "non class of interest" class. Training continues as usual

Advantages

User can resort to already implemented features (i.e. "class proportion" parameter) to filter out samples for which the label contains no values of "class of interest"

Disadavantages

This approach calls for a separate set of hdf5 files. This will require more disk space and cause some duplication if a single-class vs multi-class comparison is desired for the same dataset.

Implementation

After having filtering out undesired samples (i.e. those that do not meet the class_prop threshold and min annotated percent), irrelevant class values are zeroed out before final samples is written to hdf5, in add_to_dataset function

Dev effort estimate

Necessary effort to implement either approaches ranges from 1 to 2 workdays. This dev seems fairly simple at a first glance.

Useful resources and articles

Buildings extraction

Road extraction

Ding & Bruzzonne, 2020, "DiResNet: Direction-aware Residual Network for
Road Extraction in VHR Remote Sensing Images"

CharlesAuthier · 2020-11-12T18:14:43Z

Personally I prefer the first one.

ymoisan · 2020-11-19T22:04:12Z

I would call approaches dynamic/static rather than online/offline. In an ideal world, that is when we can fetch training samples through the web, there could be an opportunity for us not to bother anymore with duplicating our data into HDF5 files. In that respect, I would prefer the dynamic approach (#1).

remtav · 2022-01-25T14:40:22Z

This feature was broken with PR #208.

This feature should apply to any subset of classes as was done with the parameter target_ids. In previous implementations, the offline/static approach was chosen. The disadvantages of this approach can be dealt with by a sampling to plain .tifs and .geojson rather than HDF5s.

remtav · 2022-05-09T16:53:26Z

A solution is proposed in my 215-solaris-tiling branch. The vector ground truth is first tiled to vector chips (as geojson) alongside the imagery. The desired attribute values for a particular attribute field are then burned to raster. The advantage of this approach is that the geojson chips contain all initial attribute information and only to be created once. The burning of specific values can be requested multiple times and store to different copies as raster chips.

CharlesAuthier · 2022-09-13T13:00:48Z

This issue will be solve with the merge of the branch 222-stac-item-input and the change made at line 471 in train_segmentation.py. This change make the reading of the gt binary if only one class is specify at attribute_values in the dataset yaml.

mpelchat04 added the P1 High priority label Nov 11, 2020

mpelchat04 added this to the 1.3.0 milestone Nov 11, 2020

remtav mentioned this issue Jan 25, 2022

Reimplement choosing subset of classes from ground truth based on attribute field/values #244

Closed

ymoisan modified the milestones: 1.3.0, Training enhancements Feb 15, 2022

remtav mentioned this issue May 9, 2022

Reading single-band rasters: unified approach with multiband reader #223

Closed

mpelchat04 closed this as completed Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Images_to_samples: Create binary labels from multi-class label data #167

Images_to_samples: Create binary labels from multi-class label data #167

remtav commented Nov 10, 2020 •

edited

Loading

CharlesAuthier commented Nov 12, 2020

ymoisan commented Nov 19, 2020

remtav commented Jan 25, 2022 •

edited

Loading

remtav commented May 9, 2022

CharlesAuthier commented Sep 13, 2022

Images_to_samples: Create binary labels from multi-class label data #167

Images_to_samples: Create binary labels from multi-class label data #167

Comments

remtav commented Nov 10, 2020 • edited Loading

Problem

Solution

1. Online approach

Description

Advantages

Disadvantages

Implementation

2. Offline approach

Description

Advantages

Disadavantages

Implementation

Dev effort estimate

Useful resources and articles

Buildings extraction

Road extraction

CharlesAuthier commented Nov 12, 2020

ymoisan commented Nov 19, 2020

remtav commented Jan 25, 2022 • edited Loading

remtav commented May 9, 2022

CharlesAuthier commented Sep 13, 2022

remtav commented Nov 10, 2020 •

edited

Loading

remtav commented Jan 25, 2022 •

edited

Loading