Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating LevelDB in Python #745

Closed
Zackory opened this issue Jul 20, 2014 · 13 comments
Closed

Creating LevelDB in Python #745

Zackory opened this issue Jul 20, 2014 · 13 comments

Comments

@Zackory
Copy link

Zackory commented Jul 20, 2014

I am currently creating several million artificial data images in python, all of which I would like stored into LevelDB to be fed through caffe.

At the moment I'm saving all of the images directly to file, and then using 'create_leveldb.sh' to create the LevelDB directories. This creates a problem as I am having to save a couple million images to the HDD.

What I am trying to do is have python directly save the artificial images into LevelDB, and do so without having to save the image to file. Currently, my code is trying to emulate what happens in 'ReadImageToDatum' from io.cpp.

The LevelDB created from my code matches the size (number of leveldb files) of the LevelDB created from convert_imageset.bin; however, when I train caffe on my leveldb directory, both Test#1 and Test#2 get worse over time.

What I am suspecting is that I have missed something when converting the image into a string format, but I may have missed something completely different.

db = plyvel.DB('train_leveldb/', create_if_missing=True, error_if_exists=True, write_buffer_size=268435456)
wb = db.write_batch()

...

for file in imageSet:
    image = cv.LoadImageM(file, cv.CV_LOAD_IMAGE_COLOR)

    # Load image into datum object
    datum = caffe.proto.caffe_pb2.Datum()
    datum.height = image.rows
    datum.width = image.cols
    datum.channels = 3
    datum.label = label
    datum.data = image.tostring()

    wb.put('%08d_%s' % (count, file), datum.SerializeToString())

    count = count + 1
    if count % 1000 == 0:
        # Write batch of images to database
        wb.write()
        del wb
        wb = db.write_batch()
        print 'Processed %i images.' % count

if count % 1000 != 0:
    # Write last batch of images
    wb.write()
    print 'Processed a total of %i images.' % count
else:
    print 'Processed a total of %i images.' % count
@longjon
Copy link
Contributor

longjon commented Jul 21, 2014

Keep in mind that data needs to be one byte per channel, in row major order, with axes (channel, y, x). I haven't checked this, but I'm guessing that the axis order represented here is (y, x, channel). Also note that OpenCV likes BGR ordered channels (but if you're training from scratch, that shouldn't make a difference).

@Zackory
Copy link
Author

Zackory commented Jul 22, 2014

I tried using the code below to flip the axes to (channel, y, x) but still getting the same problem when training. Possibly my reshaping of the axes is incorrect though?

image = ndimage.imread(file, mode='RGB')

# Reshape image to (channel, y, x)
s = image.shape
image = image.reshape((s[2], s[0], s[1]))

I've also tried this with no success either.

image = ndimage.imread(file, mode='RGB')
s = image.shape
temp = np.zeros((s[2], s[0], s[1]), dtype=np.uint8)
temp[0, ..., ...] = image[..., ..., 0]
temp[1, ..., ...] = image[..., ..., 1]
temp[2, ..., ...] = image[..., ..., 2]
image = temp

@longjon
Copy link
Contributor

longjon commented Jul 22, 2014

You want transpose, not reshape. The second block looks like it ought to work, although you mean :, not ....

You might want to read _Net_preprocess in python/caffe/pycaffe.py for example code of this sort (although keep in mind that the preprocessing here is not exactly the same as that performed by DataLayer).

You can debug by loading both working and nonworking nets in Python and comparing the data blobs.

@Zackory
Copy link
Author

Zackory commented Jul 22, 2014

Nice suggestion longjon! The _Net_preprocess seems to be a step in the right direction. I was really hoping transpose would have fixed the problem.

I've added the code below, still with the same results with training getting worse. I've also tried loading both working and nonworking networks in Python, however both nets have empty data blobs (all zeros).

image = image[:, :, (2, 1, 0)]
image = image.transpose((2, 0, 1))

Cleaning up a bit, here is my new code, still with the same problem as before though.

db = plyvel.DB('train_leveldb/', create_if_missing=True, error_if_exists=True, write_buffer_size=268435456)
wb = db.write_batch()

...

for file in imageSet:
    image = caffe.io.load_image(file)

    # Reshape image
    image = image[:, :, (2, 1, 0)]
    image = image.transpose((2, 0, 1))
    image = image.astype(np.uint8, copy=False)

    # Load image into datum object
    datum = caffe.io.array_to_datum(image, label)

    wb.put('%08d_%s' % (count, file), datum.SerializeToString())

    count = count + 1
    if count % 1000 == 0:
        # Write batch of images to database
        wb.write()
        del wb
        wb = db.write_batch()
        print 'Processed %i images.' % count

if count % 1000 != 0:
    # Write last batch of images
    wb.write()
    print 'Processed a total of %i images.' % count
else:
    print 'Processed a total of %i images.' % count

@longjon
Copy link
Contributor

longjon commented Jul 23, 2014

When checking in Python, did you call forward to populate the blobs?

You can also check that the same data appear in both working and nonworking LevelDBs directly.

Also check that your labels are correct, and that you are shuffling them if you need to in order to balance classes in each batch.

@hnoel
Copy link

hnoel commented Aug 21, 2014

Hi there, how quick is working your algorithm? I am also working with big database and process some artificial data increase. However in that way, even with batch mode, the process is dramatically slow especially for big images (3x101x101 in my case) and lead to reduce the pre-processing evaluation. Hence I have trying to parallelize the process with one batch by core for the same database but it's lead to segmentation issue with plyvel and py-leveldb (but leveldb should support multi-thread acces to 1 db according to the docs...). Therefore I have segmented my testing database but the issue is still remaining for the training case. Do you use special implementation in python to avoid this point? By the way, must the batch size should be set to 1,000 or is it just an arbitrary choice?
Thx.
Henri

@shelhamer
Copy link
Member

If preprocessing speed and input size are important you should prepare your
data in C++ in a similar scheme to out convert_imageset tool as defined
in convert_imageset.cpp.

On Thu, Aug 21, 2014 at 8:58 AM, hnoel notifications@github.com wrote:

Hi there, how quick is working your algorithm? I am also working with big
database and process some artificial data increase. However in that way,
even with batch mode, the process is dramatically slow especially for big
images (3x101x101 in my case) and lead to reduce the pre-processing
evaluation. Hence I have trying to parallelize the process with one batch
by core for the same database but it's lead to segmentation issue with
plyvel and py-leveldb (but leveldb should support multi-thread acces to 1
db according to the docs...). Therefore I have segmented my testing
database but the issue is still remaining for the training case. Do you use
special implementation in python to avoid this point? Thx.
Henri


Reply to this email directly or view it on GitHub
#745 (comment).

@hnoel
Copy link

hnoel commented Aug 21, 2014

Thx @shelhamer . I will try to modify some io cpp method, it does not seem complicated. When I success I will do a PR for some classic artificial data increase (rotation, flip, scaling) that some are looking for in some other post. By the way why all the batch length must be inferior to 1,000 items ? Another question, is the key string important or can we put any string of 256 bits if all are different ? Finally have you ever tried multi-process access to a leveldb database in order to increase the speed ?
Thx Henri

@shelhamer
Copy link
Member

Sounds good.

leveldb is limited to single process access -- look at lmdb for
multi-processing.

On Thursday, August 21, 2014, hnoel notifications@github.com wrote:

Thx @shelhamer https://github.com/shelhamer . I will try to modify some
io cpp method, it does not seem complicated. When I success I will do a PR
for some classic artificial data increase (rotation, flip, scaling) that
some are looking for in some other post. By the way why all the batch
length must be inferior to 1,000 items ? Another question, is the key
string important or can we put any string of 256 bits? Finally have you
ever tried multi-process access to a leveldb database in order to increase
the speed ?
Thx Henri


Reply to this email directly or view it on GitHub
#745 (comment).

@Zackory
Copy link
Author

Zackory commented Sep 6, 2014

Hey all, my apologies for a late reply. I did solve my problem, in fact the reason why training was getting worse over time was due to two things.

  1. The image must be in BGR format rather than RGB and must be transposed to (channels, y, x)

    image = image[:, :, (2, 1, 0)]
    image = image.transpose((2, 0, 1))

  2. The data must be shuffled before being stored in leveldb format. Storing a batch of 1000 images all with the same class doesn't work.

Few answers to questions asked above:

_Why are all the images written to leveldb as batches of 1000 images? Is this significant?_

My code tries to duplicate what happens in convert_imageset.cpp found below (specifically at line 183).
https://github.com/BVLC/caffe/blob/master/tools/convert_imageset.cpp

From what I tested, I didn't find the batch size to be significant in CNN training. A batch size of 100 or 10,000 shouldn't act much differently than a batch size of 1,000. The only caveat is that the batch should be representative all of image classes (randomly shuffled data).

_Is the key important when saving an image into leveldb?_

Yes. I found the format of the key to be extremely important.
For example the following works:

wb.put('%08d_%s' % (count, filename), datum.SerializeToString())

The line below however doesn't work when caffe tries to load the data back out of leveldb for training.

wb.put('%06d_%s' % (count, filename), datum.SerializeToString())

More precisely the key must be in the format of '00000000_*' where the first 8 characters of the string must be digits, followed by an underscore. The * can represent anything else you want (a unique image identify for example). I chose to give each image a describing 'file name'. A full key for me might look like:
'00000068_keyboard_rotated_90.jpg'

@mohomran
Copy link
Contributor

mohomran commented Sep 6, 2014

  • If you're training from scratch and not finetuning an existing network, the order of the channels shouldn't matter.
  • A batch size of N (in the context of putting together a leveldb folder) also shouldn't matter for training. It just means that you collect N data points in a buffer before writing them to disk, which can be faster than one write per datum.
  • What does matter is that when you sort your data by key, that it's in the sequence you want it to be in for training. Also each data point needs to have a unique key. If you had less than a million training images '%06d_%s' or '%06d%s' would work fine.

@scturtle
Copy link

Python module leveldb performs badly. But plyvel works well. Even for the same logic. For example:

db = leveldb.LevelDB('mnist-%s-leveldb/' % name, create_if_missing=True, error_if_exists=True)
for im, lb in zip(imgs, lbls):
    ......
    db.Put('%08d' % count, datum.SerializeToString())

and

db = plyvel.DB('mnist-%s-leveldb/' % name, create_if_missing=True, error_if_exists=True)
for im, lb in zip(imgs, lbls):
    ......
    db.put('%08d' % count, datum.SerializeToString())

Have anyone encountered this problem?

@serge-m
Copy link

serge-m commented Jan 19, 2015

One interesting thing about using plyvel. When I create DB without using write batch, compute_image_mean tool produces some wrong result.
If I put all writes in one batch, it works ok.

I use the code, listed bellow. It just copies data from one DB to another.

import caffe
wb = db.write_batch()  # without batch binary mean is wrong
datum =  caffe.io.caffe_pb2.Datum()
for k, v in db_in:
    wb.put(k,v)
wb.write()  # without batch binary mean is wrong

Who knows why it happens?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants