Skip to content

Kuzushiji Recognition Kaggle competition solution.

Notifications You must be signed in to change notification settings

jday96314/Kuzushiji

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Kuzushiji

This is the 13th place solution to the Kuzushiji Recognition Kaggle competition.

Thanks to Kaggle and the organizers for creating such an interesting competition. Congrats to everyone who finished.

Data preprocessing

I resized all images to 512x512. Some of my images were converted to greyscale.

Models

My solution was based on Faster-RCNN, but with several small modifications.

  1. Instead of choosing anchor sizes arbitrarily, I selected them by running k-means clustering on the widths and heights of the ground truth bounding boxes. I used the cluster centers as my anchor box sizes.
  2. I used ROI Align pooling instead of standard ROI pooling. I empirically found that ROI Align pooling yielded slightly better results.
  3. No layers are shared between the network which proposes regions of interest and the network which classifies them. This allowed for easier experimentation because it allowed the networks to be modified independently of one another. It also simplified the training procedure. However, the inference efficiency of my solution could likely be improved by sharing all of the residual blocks.

My region proposal network and my character classification network both used wide ResNet-34 backbones. My region proposal network used twice as many convolutional filters as the ResNet-34 configuration described in the original ResNet paper. My character classification network used roughly three times as many convolutional filters as the original ResNet-34 configuration. I experimented with deeper ResNet-50 and ResNet-101 based networks, but I found wider, shallower networks worked better.

To combat overfitting, I used dropout quite extensively. 2d dropout was used within all of my residual blocks. Ordinary dropout was used before the final 1x1 conv layers.

My final region proposal network uses color images and my final classification network uses greyscale images. I initially used greyscale images for both the region proposal network and the classification network because I didn't think the additional channels would be useful and wanted to conserve space. Near the end of the competition, I tried using color images for both networks. This slightly improved the region proposal network's output but worsened the classification network's accuracy due to overfitting.

Failed attempt at using a language model for post-processing the detection results

I was very interested in using a language model to correct my image processing code's incorrect detections. I spent over a month trying different approaches for this but was unable to get it working well.

To get the characters into approximately the order in which a human would read them, I used DBSCAN to group the characters into columns, sorted the columns by the mean horizontal coordinates of their characters, and then sorted the characters within each column by their vertical coordinate. This worked well for most images.

To train my correction networks, I used the ground truth labels to generate a large number of synthetic submission files that contained known errors. These errors were fairly realistic and were randomly added based on statistics from how my Faster-RCNN based model performs on a cross-validation dataset.

To perform the character label corrections, I initially tried network architectures that are meant to performing language translations, such as encoder-decoder LSTMs with an attention mechanism or a transformer network. I favored these network architectures over passing characters into a single recurrent network because in theory, they should be able to elegantly handle cases in which the number of ground truth characters is different from the number of characters which was detected. Unfortunately, these networks introduced slightly more errors than they corrected. I eventually abandoned the translation based network architectures and switched over to using a simple bi-directional GRU network. It was able to improve my score by around .01, but only on submission files which were generated by weak label detection models. If I run it on my best submission file, then it will lower my score by ~.003.

I think the reason I had so much difficulty getting a language model working well is that my Faster-RCNN based models do a better job considering a character's context than I initially expected. I think in order to get the correction network working well I would have needed to pull in an outside text corpus.

Potential improvements

There are several potential avenues for improvement that I did not implement or test.

  1. Data augmentation
  2. Pseudolabeling
  3. Using an ensemble of multiple classification networks

About

Kuzushiji Recognition Kaggle competition solution.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published