Multiclass semantic segmentation of CamVid dataset using U-Net

Test sequence 0001TP_2. Left to right: input image sequence, true masks, and predicted masks overlaid onto the images

This repo consists of my implementation of the U-Net model and two additional variants using a pre-trained ResNet50V2 or MobileNetV2 for performing semantic segmentation for the Cambridge-driving Labeled Video Database (CamVid). I implemented the model in tensorflow==2.11. Please use ./train_model.ipynb to train new models and ./evaluate_model.ipynb to evaluate the performances.

I treated this project as an opportunity to learn the image segmentation task, including how to prepare the data, what are some of the popular models, and how to evaluate their performances. For this reason I chose a simple algorithm (i.e. U-Net) and a small enough dataset so that I can run these on my own hardware. My final models are still far from state-of-the-art performance but I'm still impressed by how well they work given their simple implementations. I also chose not to include the trained models in this repo (even though I kept the ./models folder) because of their sizes (between 300~600 MB) and how easy they can be trained on basic hardwares or on Colab.

Some of the useful papers and links I referred to include:

Overview of image segmentation: https://arxiv.org/abs/1704.06857, https://arxiv.org/abs/2001.05566, https://www.jeremyjordan.me/semantic-segmentation/, https://medium.com/swlh/image-segmentation-using-deep-learning-a-survey-e37e0f0a1489
The original U-Net paper: https://arxiv.org/abs/1505.04597
The SegNet paper, which includes a benchmark on the CamVid dataset: https://arxiv.org/abs/1505.07293
Papers describing the CamVid dataset: https://doi.org/10.1016/j.patrec.2008.04.005, https://www.robots.ox.ac.uk/~lubor/bmvc09.pdf
Benchmarks: https://paperswithcode.com/sota/semantic-segmentation-on-camvid

1. Semantic segmentation using U-Net

The goal of semantic segmentation is to assign a label to every pixel in a image. It differs from object detection, which tries to find a bounding box for each of the detected object, and instance segmentation, which tries to assign both the semantic class and the specific object instance to each pixel. The class labels can either be binary ("doesn this pixel belong to the class or not") or multiclass ("which class does this label belongs to"). For this project I focused on multiclass semantic segmentation.

One of the simpliest models for semantic segmentation is the U-Net. This network basically consists of a symmetric fully convolutional encoder-decoder network with skip connections between each encoder-decoder stage. It was originally created for segmenting biomedical images but has been applied to many other areas (probably because how easy it can be implemented). The model is trained to reproduce the given segmented masks using the input images.

Some common metrics for evaluating a model's performance includes:

Pixel-wise prediction accuracy per class ("how many pixels belonging to a specific class are predicted correctly"), as well as the average across all classes (excluding "Void").
Global pixel-wise prediction accuracy ("how many pixels, regardless of class distinctions ("Void" excluded), are predicted correctly")
Intersection-over-union (IOU), also known as the Jaccard score, per class ("how does the prediction overlap with the ground truth"), and the average across all classes or mIOU.

Please refer to the review papers for the definition of each metric. The functions for evaluating these metrics are stored in ./utils.py.

2. Dataset

11 semantic categories and data splits

CamVid is a database for understanding road or driving scenes. It consists of 701 images and hand-annotated masks captured from 5 driving video sequences. The original dataset contains 32 semantic classes, although a 11-category classification (which combines several similar classes) is more often used in the literatures. The 701 image-mask pairs are split into 367 for training, 101 for validation, and 233 for testing.

The data I used (which are included in the ./data folder) come from this repo credited to lih627.

3. Implementations in tensorflow

My implementation of the U-Net neural network is available in ./build_model.py. The model consists of 4 encoder and decoder stages with a latent stage in between. Each stage consists of a convolution block plus a MaxPooling2D or Conv2DTranspose layer for down/upsampling. For the convolution block I used 2x(Conv2D-BatchNorm-GeLU). From my tests this performed the best compared to other architectures by either swapping the GeLU with ReLU activation, by dropping the BatchNorm layers, or by replacing them with Dropouts.

I resized the images & masks to 224x224. The images have the shape (224, 224, 3), 3 for each of the RGB channels. The masks have the shape (224, 224, 1), the last dimension being the integer category label ranging from 0~10 and 255 (the 'Void' class, which is ignored in loss calculation). I used the SparseCategoricalCrossentropy loss and Adam optimizer. One could also use CategoricalCrossentropy loss but that would require one-hot encode the masks and be very memory-intensive.

Besides the vanilla U-Net I also implemented 2 modified models using the feature extractors from pre-trained ResNet50V2 and MobileNetV2. The decoder networks used in these 2 models are identical to the vanilla U-Net.

4. Results

Image, true mask, predicted mask using the vanilla U-Net

Model	U-Net	w.ResNet50V2	w.MobileNetV2	*SegNet (3.5k training set)
Sky	96.1	95.7	94.6	96.1
Building	85.3	84.0	90.0	89.6
Pole	22.0	28.4	17.4	32.1
Road	96.7	96.5	94.3	96.4
Pavement	80.5	85.9	89.2	62.2
Tree	76.7	75.1	79.0	83.4
SignSymbol	47.3	50.3	29.9	52.7
Fence	23.3	31.5	27.5	53.45
Car	81.6	74.1	72.5	87.7
Pedestrian	60.6	41.1	37.4	62.2
Bicyclist	32.0	29.4	11.9	36.5
Class avg.	58.5	57.7	53.6	71.20
Global Avg.	86.4	86.0	87.0	90.40
mIOU	0.52	0.50	0.50	0.60

Pixel accuracy by class, class average, global average, and mIOU of the 3 trained U-Net models, plus SegNet (reported from the paper). The SegNet model was also trained on an extended 3.5k image training set so it isn't exactly a fair comparison.

Pixel accuracy vs IOU by class

Pixel accuracy vs percentage in test set data

The results shown above definitely leaves a lot to be desired. While the models do decently well for classes that have more pixels in the training data such as Sky, Road, Building and to some extent Car and Tree, they perform poorly for smaller classes such as Pole, Fence, and Bicyclist. If this is going to be used for a self-driving system I'd be worried if the model cannot identify a road sign not to say hitting a cyclist or a pedestrian. As a cyclist myself I'm well aware how dangerous those cars and trucks can be when their drivers aren't aware of the cyclist near their vehicles.

I included the SegNet benchmarks as I can't find one for U-Net. The SegNet is a very similar model, except it uses pooling indices instead of trainable convolutional layers for upsampling. I'm quite surprised that my models actually do better in 3 of the 11 categories, especially since the SegNet model was trained on a significantly larger dataset.

I'm also surprised that the two other models using pre-trained models didn't perform much better than the vanilla U-Net. My hypothesis is that -- because of how small the training dataset is, the model does not need a more sophisticated encoder network (or perhaps we can call it "transferable knowledge?") to capture the patterns in this dataset. With a bigger dataset maybe these two models would improve more compared to the vanilla model. To be fair it could also be because of the decoder networks which are basically the same (except at the concatenate layers) for each of the models.

Finally, I've read about arguments for or against using different metrics -- whether pixel accuracy or IOU or something else is better for evaluating a segmentation model. From my tests it seems there is a decent correlation between class pixel accuracy and class IOU, except the anomalies of SignSymbol and Pedestrian. But anyway at least I can tell a good model is a good model using either of these two metrics.

5. Future works

Here are some of the ideas for improving the model I've either thought about, read about but haven't tried, or something I've tried but didn't work out:

Use data augmentation. This was emphasized in the original U-Net paper as they had a very limited dataset for training their model. I thought this was going to be a easy thing to do but it turned out to be quite difficult using the augmentation functions in tf.keras.
Add more training data, such as the Cityscapes dataset which is a significantly bigger dataset than CamVid. From my experience add more data almost always help improving the performance of a model.
Add a Conditional Random Fields module (such as a CRF-RNN layer) after the neural network model. This was used in the first version of DeepLab. I found this implementation of CRF-RNN for the original keras but I couldn't get it to work with tensorflow 2. That author does have an alternative implementation for pytorch which has a lot better documentation than the keras version.
(Obviously) try other algorithms.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
animations		animations
data		data
models		models
papers		papers
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
build_model.py		build_model.py
evaluate_model.ipynb		evaluate_model.ipynb
extended_results.xlsx		extended_results.xlsx
train_model.ipynb		train_model.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multiclass semantic segmentation of CamVid dataset using U-Net

1. Semantic segmentation using U-Net

2. Dataset

3. Implementations in tensorflow

4. Results

5. Future works

About

Releases

Packages

Languages

License

yumouwei/camvid_unet_semantic_segmentation

Folders and files

Latest commit

History

Repository files navigation

Multiclass semantic segmentation of CamVid dataset using U-Net

1. Semantic segmentation using U-Net

2. Dataset

3. Implementations in tensorflow

4. Results

5. Future works

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages