Datum db #1568

sguada · 2014-12-12T21:46:36Z

This PR is intended to replace #1238 and fix some of the problems introduced by it. Now there is only one internal cursor, this avoid errors like MDB_MAX_READERS and higher memory consumption.

Introduces a new DatumDBParameter to be used by the Data_Layers i.e.

layers {
  name: "data"
  type: DATA
  top: "data"
  top: "label"
  data_param {
    batch_size: 256
  }
  datum_db_param {
    source: "examples/imagenet/ilsvrc12_train_lmdb"
    backend: LMDB
  }
  transform_param {
    crop_size: 227
    mean_file: "data/ilsvrc12/imagenet_mean.binaryproto"
    mirror: true
  }
  include: { phase: TRAIN }
}

For lower memory consumption use LEVELDB instead of LMDB.

All the code related to dataset needs to be removed, but before doing that I want to get feedback about this PR.

sguada · 2014-12-12T21:49:29Z

@shelhamer @longjon let me know what do you think?
Should I make it more like a Iterator?

sguada · 2014-12-29T19:35:55Z

Coordinating with #1629

Each database can only be opened once, so for each source there should be only one opened. By defining a DatumDB and DatumDBFactory we can impose that constraint more easily.
Each DatumDb can have multiple Generators, each one will have its own cursor.

bhack · 2014-12-29T20:04:35Z

Can you use tbb queue? See also #1535 request.

sguada · 2014-12-29T20:09:06Z

@bhack yes the idea is very similar, but instead of adding another dependency I implemented here
https://gist.github.com/sguada/1e1d474a25f4ddcc7ba8

sguada · 2014-12-29T20:15:28Z

@bhack the TBB looks good, but I not sure if the overhead of adding another dependency will be worthy, but could be considered later.

bhack · 2014-12-29T20:30:51Z

You can evaluate also boost::lockfree. We already have boost dependency.

sguada · 2014-12-29T20:36:55Z

@bhack thanks, but in this case I prefer blocking queue, since I'm planning to have thread filling the queue. With a lockfree queue one need to check if the element could be pushed or poped, and then do active wait.

…ract_keys tool Adapted convert_imageset to DatumDB. Added extract_keys tool Adapted data_layer to DatumDB

…ading

Make lint happy

sguada · 2015-01-05T21:48:53Z

@cypof I have redo the DatumDB, now it Datum_DB_factory keeps track of the opened DBs and allows only one per source.
Each DatumDb can have multiple DatumDB::Generators, each one has its own cursor (position).
I think this should simplify having multiple instances reading from the same source.

@shelhamer I thinks this should fix all the problems related to leveldb and lmdb.

shelhamer · 2015-01-05T23:34:05Z

Ok, thanks Sergio. I'll take a look tomorrow. Fixing up the data pipeline
is key!
On Mon, Jan 5, 2015 at 17:29 Sergio Guadarrama notifications@github.com
wrote:

Assigned #1568 #1568 to @shelhamer
https://github.com/shelhamer.

—
Reply to this email directly or view it on GitHub
#1568 (comment).

sguada · 2015-01-06T00:02:08Z

@shelhamer once it is reviewed I will remove the previous dataset code and files.

longjon · 2015-01-06T00:41:14Z

I'll also plan to take a pass by tomorrow.

longjon · 2015-01-07T01:01:42Z

include/caffe/datum_DB.hpp

+    }
+  }
+
+  class Generator {


What is Generator? My guess is it's supposed to be an iterator-like abstraction (a "Generator" nested class in a "DB" class sounds like it ought to produce DBs, but that doesn't seem to be what's going on here). But, it just passes through everything to its datumdb_, holding no internal state, and it doesn't seem to be subclassed, so, what's its raison d'être?

Yeah the Generator is iterator-like abstraction but simpler, it could be out DatumDB and being called DatumGenerator for clarity.
I explored different options, and the iterator-like need to have an state, which at the end is pretty much a DatumDB so, instead of creating a new state, a Generator just receives a copy of DatumDB which holds the needed state.

That feels confusing and underhanded; why not refer instead of copying?

Does this support multiple simultaneous cursors? If so, the interface should reflect that by implementing the cursor methods (Next, Reset, &c) in the cursor or iterator subclass (or with a cursor or iterator as arguments, which amounts to the same thing). If not, there's no need for a nested class.

The generator is a simplified iterator more like a stream of Datum, but without use of it.begin() and it.end() which I find annoying. I was thinking in calling cursor, which seems it will make more sense.
Currently it does support multiple cursors. The reason to use a nested class was to simplify the iteration between the cursor and the DB.
The reason for copying instead of refer was to create new cursor related vars, each cursor need to have its own cursor, but I didn't want to have to define a different cursor for each datumDB.

longjon · 2015-01-07T01:02:18Z

include/caffe/datum_DB.hpp

+  virtual bool Current(string* key, Datum* datum) = 0;
+
+  DatumDBParameter param_;
+  shared_ptr<bool> is_opened_;


I'm baffled by this... why shared_ptr<bool>?

The reason is shared is because I just want to open the DatumDB once per source and close it when the last Generator or reference is destroyed.
So all the DatumDB passed to Generators will share the same opened state.

shelhamer · 2015-01-07T02:16:01Z

With respect to naming, how about input_param or source_param instead of datum_db? I'm thinking ahead to once we've deprecated datum and it might be safer to pick a general name.

longjon · 2015-01-07T03:05:07Z

Agree we should think about our path away from Datum.

I took a coarse pass, mainly at the interface; it seems serviceable but I'd like to see the Generator thing cleaned up with something more conceptually transparent, then I could take a finer pass (the texture seems alright though).

sguada · 2015-01-07T06:50:24Z

@longjon thanks for the pass over the interface and organization. I think there are three options to simplify the interface:

Define Cursor as nested class of DatumDB and pass shallow copy the DatumDB.
Define Cursor as nested class of DatumDB and pass a reference to the DatumDB, but this will need to create some specific variables depending of the type of DatumDB.
Define Cursor as friend class of DatumDB and pass a reference to the DatumDB, and force that each type of DatumDB has to define its own type of Cursor.

Which one do you think will be clearer and easier to maintain?

@shelhamer I'm totally ok with changing the name to something more general as source_param.

I was thinking that on BlobDB could be built by combining a DatumDB and a DataTransformation, but other kinds of BlobDB could be defined. That reasoning behind of using Datum::Generator, Datum::Generator + DatumTransformation = Blob::Generator

bhack · 2015-01-07T10:03:04Z

@sguada What is this? Is this a kind of ORM?

cypof · 2015-01-07T18:03:04Z

Is there a way we could completely replace the Generator by queues, as part of merging with #1629? It would let multiple solvers pick from the same db, and simplify prefetching by simply filling the queue. Also it would allow running the db on it's own thread.

longjon · 2015-01-07T21:08:45Z

@cypof maybe, although I'd worry about

combining too much functionality in the same abstraction, as a cursor seems like a sensible thing on its own, and shouldn't be hard to wrap in a queue to obtain blocking/threading functionality; and
I'd rather see this merged sooner than delayed by being drawn into the future,

so I think it might be better to stick to the original vision of a thin wrapper around data sources, and update or wrap the cursor functionality later.

@bhack no, it's meant to be a thin wrapper around data sources to avoid the former grossness of hand-coded switches to interface with different DBs.

@sguada here's what makes sense to me:

(abstract) base classes define the interfaces for DB and Cursor; it's probably simplest to do both at top level; DB has a method for obtaining cursors, but no friend relationship is needed
each implementation defines its own kind of DB and Cursor; it might very well make sense to have specific Cursors be nested classes of their specific DBs
Cursors hold references to their owning DBs if needed (is there a reason that, e.g., for LMDB, you can't just wrap LMDB's own cursor without needing a ref to the DB?)

Indeed, I think each DB should have its own type of cursor, not that this should result in more DB-specific code, it's just that the current implementations of the cursor-like methods (Next &c) should go in a Cursor class rather than a DB class. Think of std::vector and its iterators; iterators don't hold copies of vectors, but rather refer to them as needed, and the logic of traversing the vector is the business of the iterator rather than the vector (although note that iterators are not a class per se, unlike here).

Also, how is this supposed to interact with existing data layers? It seems there is an images DB, which I guess is supposed to replace the ImageDataLayer? What about the other data layers; of that functionality, what is meant to be covered here, and what not?

sguada · 2015-01-08T02:29:56Z

@cypof At some point I thought about including the queue and a thread into the DatumDB but then I discarded since it will introduce a lot of unrelated things into this. Every new DatumDB should be able to provide that.
What about this other idea instead; we could create a BlobDB or BlobGenerator (or a better name) that takes a DatumDB::Cursor, a Datum_Transformation (to convert Datum into Blob), optionally a queue and some processing threads. So the idea is that the internal threads get a Datum from the DatumDB, passes it to the Datum_Transformation, and put result in the queue. Then the Data_Layers could read blobs from that queue. If wanted a BlobDB could have more that one DatumDB (with different sources).
But this would need to wait for a future PR.

@longjon Now I like more the idea of defining a DatumDB::DB and DatumDB::Cursor or DatumDB::Iterator. I think I will model it in a similar fashion as leveldb, which have the same separation between DB and Iterator. What do you think about creating a new namespace datumdb to group all its classes?

sguada · 2015-01-08T07:51:03Z

@longjon I have refactor the code and interface to reflect what we discussed. Please let me know your opinion before changing the rest.

longjon · 2015-01-09T22:09:25Z

include/caffe/datum_DB.hpp

+class DatumDBCursor {
+ public:
+  explicit DatumDBCursor(const DatumDBParameter& param)
+    : param_(param) {}


I don't feel too strongly about this, but I think it could be argued that param passing should be left up to subclasses, and subclasses should be explicit about what information needs to be given to cursors. E.g., it seems that LMDB and LevelDB only need the single boolean param_.loop(), but doing it this way obscures that fact and makes one wonder if params might affect cursors in arbitrary ways.

longjon · 2015-01-09T22:20:17Z

@sguada okay, I've taken a coarse pass just over the interface and {Level,LM}DB implementations. The new interface seems way more sensible to me... now it's much easier to tell at a glance what basic usage should look like. Two comments as noted. And re: namespaces, yes, I think it's a fine idea to put all this stuff in a namespace; why not just caffe::db?

bhack · 2015-01-09T22:31:14Z

You can also take some idea from uberDB

cypof · 2015-01-21T23:34:10Z

@sguada I extracted all the data_layer from the parallel work into a new PR #1773. It should make it easier to merge this one. There is a lock to make sure that only one dataset is created by data file name, which might be redundant with the check you have here.

shelhamer · 2015-01-21T23:39:25Z

@cypof actually #1748 has replaced this PR for simplicity -- sorry for this miscommunication. The highest priority was to simplify the data interface so that's what #1748 does thanks to @longjon.

Closing, although other features of this PR might follow in the future.

sguada added in progress bug enhancement labels Dec 12, 2014

sguada mentioned this pull request Dec 13, 2014

DataLayer to provide data from LIBSVM format file #1571

Closed

This was referenced Dec 23, 2014

Abstraction and unification of data input #1522

Closed

RDMA, data pipeline #1629

Merged

sguada added 7 commits January 2, 2015 11:17

Create datum_DB with leveldb, lmdb and imagesdb backends

9204443

Added Test for datum_DB leveldb, lmdb and imagesdb

29602a5

Adapted convert_imageset and compute_image_mean to DatumDB. Added ext…

7f38e4c

…ract_keys tool Adapted convert_imageset to DatumDB. Added extract_keys tool Adapted data_layer to DatumDB

Make Lint happy and improve small things, i.e. mdb_mapsize = 1 for re…

f432100

…ading

Added Generator to DatumDB

37aa409

Added datum_DB_factory

f06b22e

Fix test_datum_DB and test_data_layer

7e004c5

Make lint happy

sguada force-pushed the datum_DB branch from 5a6472d to 7e004c5 Compare January 5, 2015 21:43

sguada removed the in progress label Jan 5, 2015

sguada mentioned this pull request Jan 5, 2015

Added extract_keys with labels tool #1387

Closed

sguada added the ready for review label Jan 5, 2015

sguada assigned shelhamer Jan 5, 2015

longjon reviewed Jan 7, 2015
View reviewed changes

shelhamer mentioned this pull request Jan 7, 2015

Reshape single input batches for inputs of varying dimension #1313

Merged

3 tasks

Refactor to separate DatumDB and DatumDBCursor

399c662

longjon reviewed Jan 9, 2015
View reviewed changes

longjon mentioned this pull request Jan 19, 2015

Simple database wrappers #1748

Merged

shelhamer closed this Jan 21, 2015

sguada mentioned this pull request Aug 12, 2015

Multi-GPU Data Parallelism (with Parallel Data Layers) #2903

Merged

9 tasks

bhack mentioned this pull request Nov 7, 2015

deadlock @ multiGPU caffe #3279

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datum db #1568

Datum db #1568

sguada commented Dec 12, 2014

sguada commented Dec 12, 2014

sguada commented Dec 29, 2014

bhack commented Dec 29, 2014

sguada commented Dec 29, 2014

sguada commented Dec 29, 2014

bhack commented Dec 29, 2014

sguada commented Dec 29, 2014

sguada commented Jan 5, 2015

shelhamer commented Jan 5, 2015

sguada commented Jan 6, 2015

longjon commented Jan 6, 2015

longjon Jan 7, 2015

sguada Jan 7, 2015

longjon Jan 7, 2015

sguada Jan 7, 2015

longjon Jan 7, 2015

sguada Jan 7, 2015

shelhamer commented Jan 7, 2015

longjon commented Jan 7, 2015

sguada commented Jan 7, 2015

bhack commented Jan 7, 2015

cypof commented Jan 7, 2015

longjon commented Jan 7, 2015

sguada commented Jan 8, 2015

sguada commented Jan 8, 2015

longjon Jan 9, 2015

longjon commented Jan 9, 2015

bhack commented Jan 9, 2015

cypof commented Jan 21, 2015

shelhamer commented Jan 21, 2015

Datum db #1568

Datum db #1568

Conversation

sguada commented Dec 12, 2014

sguada commented Dec 12, 2014

sguada commented Dec 29, 2014

bhack commented Dec 29, 2014

sguada commented Dec 29, 2014

sguada commented Dec 29, 2014

bhack commented Dec 29, 2014

sguada commented Dec 29, 2014

sguada commented Jan 5, 2015

shelhamer commented Jan 5, 2015

sguada commented Jan 6, 2015

longjon commented Jan 6, 2015

longjon Jan 7, 2015

Choose a reason for hiding this comment

sguada Jan 7, 2015

Choose a reason for hiding this comment

longjon Jan 7, 2015

Choose a reason for hiding this comment

sguada Jan 7, 2015

Choose a reason for hiding this comment

longjon Jan 7, 2015

Choose a reason for hiding this comment

sguada Jan 7, 2015

Choose a reason for hiding this comment

shelhamer commented Jan 7, 2015

longjon commented Jan 7, 2015

sguada commented Jan 7, 2015

bhack commented Jan 7, 2015

cypof commented Jan 7, 2015

longjon commented Jan 7, 2015

sguada commented Jan 8, 2015

sguada commented Jan 8, 2015

longjon Jan 9, 2015

Choose a reason for hiding this comment

longjon commented Jan 9, 2015

bhack commented Jan 9, 2015

cypof commented Jan 21, 2015

shelhamer commented Jan 21, 2015