Forked from https://github.com/Lyken17/Efficient-PyTorch/ and simplified. Fixed quite a few warnings and made it easier to use via command line. Tested on both Windows and Linux systems using Python 3.8.
Trained on the Cats versus Dogs dataset avaliable on Kaggle. Results compare the torch.ImageFolder and our lmdb implementation. These are the results using a local SSD:
Timings for lmdb
Avg data time: 0.011866736168764075
Avg batch time: 0.10090051865091129
Total data time: 2.325880289077759
Total batch time: 19.776501655578613
Timings for imagefolder:
Avg data time: 0.017892257291443493
Avg batch time: 0.1053010200967594
Total data time: 3.506882429122925
Total batch time: 20.638999938964844
These are the results using a network file system (NFS) drive:
Timings for lmdb
Avg data time: 0.040608997247657
Avg batch time: 0.06778134983413074
Total data time: 7.9593634605407715
Total batch time: 13.285144567489624
Timings for imagefolder:
Avg data time: 0.056209570291090985
Avg batch time: 0.08088788086054277
Total data time: 11.017075777053833
Total batch time: 15.854024648666382
The format of converted LMDB is defined as follow.
key | value |
---|---|
img-id1 | (jpeg_raw1, label1) |
img-id2 | (jpeg_raw2, label2) |
img-id3 | (jpeg_raw3, label3) |
... | ... |
img-idn | (jpeg_rawn, labeln) |
__keys__ |
[img-id1, img-id2, ... img-idn] |
__len__ |
n |
As for details of reading/writing, please refer to code.
The folder2lmdb script can convert a default image-label structure to an LMDB file (see above). For example, to run it on Linux, given the Dogs vs Cats dataset is in /data and it has a subfolder called "train":
python folder2lmdb.py -f ~/pytorch-lmdb/data/cats_vs_dogs -s "train"
The usage of ImageFolderLMDB
is identical to torchvision.datasets
.
import ImageFolderLMDB
from torch.utils.data import DataLoader
dst = ImageFolderLMDB(path, transform, target_transform)
loader = DataLoader(dst, batch_size=64)
The main script includes the ImageFolderLMDB class. It can be run from command line and takes an ImageFolder path and a LMDB database path, runs training on the Dogs vs Cats dataset and outputs execution times of the two file storage strategies. For example, to run it on Linux, given the Dogs vs Cats dataset is in /data and the already created LMDB file is too:
python main.py -f ~/pytorch-lmdb/data/cats_vs_dogs/train -l ~/pytorch-lmdb/data/cats_vs_dogs/train.lmdb