Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datum db #1568

Closed
wants to merge 8 commits into from
Closed

Datum db #1568

wants to merge 8 commits into from

Conversation

sguada
Copy link
Contributor

@sguada sguada commented Dec 12, 2014

This PR is intended to replace #1238 and fix some of the problems introduced by it. Now there is only one internal cursor, this avoid errors like MDB_MAX_READERS and higher memory consumption.

Introduces a new DatumDBParameter to be used by the Data_Layers i.e.

layers {
  name: "data"
  type: DATA
  top: "data"
  top: "label"
  data_param {
    batch_size: 256
  }
  datum_db_param {
    source: "examples/imagenet/ilsvrc12_train_lmdb"
    backend: LMDB
  }
  transform_param {
    crop_size: 227
    mean_file: "data/ilsvrc12/imagenet_mean.binaryproto"
    mirror: true
  }
  include: { phase: TRAIN }
}

For lower memory consumption use LEVELDB instead of LMDB.

All the code related to dataset needs to be removed, but before doing that I want to get feedback about this PR.

@sguada
Copy link
Contributor Author

sguada commented Dec 12, 2014

@shelhamer @longjon let me know what do you think?
Should I make it more like a Iterator?

@sguada
Copy link
Contributor Author

sguada commented Dec 29, 2014

Coordinating with #1629

  • Each database can only be opened once, so for each source there should be only one opened. By defining a DatumDB and DatumDBFactory we can impose that constraint more easily.
  • Each DatumDb can have multiple Generators, each one will have its own cursor.

@bhack
Copy link
Contributor

bhack commented Dec 29, 2014

Can you use tbb queue? See also #1535 request.

@sguada
Copy link
Contributor Author

sguada commented Dec 29, 2014

@bhack yes the idea is very similar, but instead of adding another dependency I implemented here
https://gist.github.com/sguada/1e1d474a25f4ddcc7ba8

@sguada
Copy link
Contributor Author

sguada commented Dec 29, 2014

@bhack the TBB looks good, but I not sure if the overhead of adding another dependency will be worthy, but could be considered later.

@bhack
Copy link
Contributor

bhack commented Dec 29, 2014

You can evaluate also boost::lockfree. We already have boost dependency.

@sguada
Copy link
Contributor Author

sguada commented Dec 29, 2014

@bhack thanks, but in this case I prefer blocking queue, since I'm planning to have thread filling the queue. With a lockfree queue one need to check if the element could be pushed or poped, and then do active wait.

@sguada
Copy link
Contributor Author

sguada commented Jan 5, 2015

@cypof I have redo the DatumDB, now it Datum_DB_factory keeps track of the opened DBs and allows only one per source.
Each DatumDb can have multiple DatumDB::Generators, each one has its own cursor (position).
I think this should simplify having multiple instances reading from the same source.

@shelhamer I thinks this should fix all the problems related to leveldb and lmdb.

@shelhamer
Copy link
Member

Ok, thanks Sergio. I'll take a look tomorrow. Fixing up the data pipeline
is key!
On Mon, Jan 5, 2015 at 17:29 Sergio Guadarrama notifications@github.com
wrote:

Assigned #1568 #1568 to @shelhamer
https://github.com/shelhamer.


Reply to this email directly or view it on GitHub
#1568 (comment).

@sguada
Copy link
Contributor Author

sguada commented Jan 6, 2015

@shelhamer once it is reviewed I will remove the previous dataset code and files.

@longjon
Copy link
Contributor

longjon commented Jan 6, 2015

I'll also plan to take a pass by tomorrow.

}
}

class Generator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is Generator? My guess is it's supposed to be an iterator-like abstraction (a "Generator" nested class in a "DB" class sounds like it ought to produce DBs, but that doesn't seem to be what's going on here). But, it just passes through everything to its datumdb_, holding no internal state, and it doesn't seem to be subclassed, so, what's its raison d'être?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the Generator is iterator-like abstraction but simpler, it could be out DatumDB and being called DatumGenerator for clarity.
I explored different options, and the iterator-like need to have an state, which at the end is pretty much a DatumDB so, instead of creating a new state, a Generator just receives a copy of DatumDB which holds the needed state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That feels confusing and underhanded; why not refer instead of copying?

Does this support multiple simultaneous cursors? If so, the interface should reflect that by implementing the cursor methods (Next, Reset, &c) in the cursor or iterator subclass (or with a cursor or iterator as arguments, which amounts to the same thing). If not, there's no need for a nested class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generator is a simplified iterator more like a stream of Datum, but without use of it.begin() and it.end() which I find annoying. I was thinking in calling cursor, which seems it will make more sense.
Currently it does support multiple cursors. The reason to use a nested class was to simplify the iteration between the cursor and the DB.
The reason for copying instead of refer was to create new cursor related vars, each cursor need to have its own cursor, but I didn't want to have to define a different cursor for each datumDB.

virtual bool Current(string* key, Datum* datum) = 0;

DatumDBParameter param_;
shared_ptr<bool> is_opened_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm baffled by this... why shared_ptr<bool>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is shared is because I just want to open the DatumDB once per source and close it when the last Generator or reference is destroyed.
So all the DatumDB passed to Generators will share the same opened state.

@shelhamer
Copy link
Member

With respect to naming, how about input_param or source_param instead of datum_db? I'm thinking ahead to once we've deprecated datum and it might be safer to pick a general name.

@longjon
Copy link
Contributor

longjon commented Jan 7, 2015

Agree we should think about our path away from Datum.

I took a coarse pass, mainly at the interface; it seems serviceable but I'd like to see the Generator thing cleaned up with something more conceptually transparent, then I could take a finer pass (the texture seems alright though).

@sguada
Copy link
Contributor Author

sguada commented Jan 7, 2015

@longjon thanks for the pass over the interface and organization. I think there are three options to simplify the interface:

  1. Define Cursor as nested class of DatumDB and pass shallow copy the DatumDB.
  2. Define Cursor as nested class of DatumDB and pass a reference to the DatumDB, but this will need to create some specific variables depending of the type of DatumDB.
  3. Define Cursor as friend class of DatumDB and pass a reference to the DatumDB, and force that each type of DatumDB has to define its own type of Cursor.

Which one do you think will be clearer and easier to maintain?

@shelhamer I'm totally ok with changing the name to something more general as source_param.

I was thinking that on BlobDB could be built by combining a DatumDB and a DataTransformation, but other kinds of BlobDB could be defined. That reasoning behind of using Datum::Generator, Datum::Generator + DatumTransformation = Blob::Generator

@bhack
Copy link
Contributor

bhack commented Jan 7, 2015

@sguada What is this? Is this a kind of ORM?

@cypof
Copy link
Member

cypof commented Jan 7, 2015

Is there a way we could completely replace the Generator by queues, as part of merging with #1629? It would let multiple solvers pick from the same db, and simplify prefetching by simply filling the queue. Also it would allow running the db on it's own thread.

@longjon
Copy link
Contributor

longjon commented Jan 7, 2015

@cypof maybe, although I'd worry about

  • combining too much functionality in the same abstraction, as a cursor seems like a sensible thing on its own, and shouldn't be hard to wrap in a queue to obtain blocking/threading functionality; and
  • I'd rather see this merged sooner than delayed by being drawn into the future,

so I think it might be better to stick to the original vision of a thin wrapper around data sources, and update or wrap the cursor functionality later.

@bhack no, it's meant to be a thin wrapper around data sources to avoid the former grossness of hand-coded switches to interface with different DBs.

@sguada here's what makes sense to me:

  • (abstract) base classes define the interfaces for DB and Cursor; it's probably simplest to do both at top level; DB has a method for obtaining cursors, but no friend relationship is needed
  • each implementation defines its own kind of DB and Cursor; it might very well make sense to have specific Cursors be nested classes of their specific DBs
  • Cursors hold references to their owning DBs if needed (is there a reason that, e.g., for LMDB, you can't just wrap LMDB's own cursor without needing a ref to the DB?)

Indeed, I think each DB should have its own type of cursor, not that this should result in more DB-specific code, it's just that the current implementations of the cursor-like methods (Next &c) should go in a Cursor class rather than a DB class. Think of std::vector and its iterators; iterators don't hold copies of vectors, but rather refer to them as needed, and the logic of traversing the vector is the business of the iterator rather than the vector (although note that iterators are not a class per se, unlike here).

Also, how is this supposed to interact with existing data layers? It seems there is an images DB, which I guess is supposed to replace the ImageDataLayer? What about the other data layers; of that functionality, what is meant to be covered here, and what not?

@sguada
Copy link
Contributor Author

sguada commented Jan 8, 2015

@cypof At some point I thought about including the queue and a thread into the DatumDB but then I discarded since it will introduce a lot of unrelated things into this. Every new DatumDB should be able to provide that.
What about this other idea instead; we could create a BlobDB or BlobGenerator (or a better name) that takes a DatumDB::Cursor, a Datum_Transformation (to convert Datum into Blob), optionally a queue and some processing threads. So the idea is that the internal threads get a Datum from the DatumDB, passes it to the Datum_Transformation, and put result in the queue. Then the Data_Layers could read blobs from that queue. If wanted a BlobDB could have more that one DatumDB (with different sources).
But this would need to wait for a future PR.

@longjon Now I like more the idea of defining a DatumDB::DB and DatumDB::Cursor or DatumDB::Iterator. I think I will model it in a similar fashion as leveldb, which have the same separation between DB and Iterator. What do you think about creating a new namespace datumdb to group all its classes?

@sguada
Copy link
Contributor Author

sguada commented Jan 8, 2015

@longjon I have refactor the code and interface to reflect what we discussed. Please let me know your opinion before changing the rest.

class DatumDBCursor {
public:
explicit DatumDBCursor(const DatumDBParameter& param)
: param_(param) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel too strongly about this, but I think it could be argued that param passing should be left up to subclasses, and subclasses should be explicit about what information needs to be given to cursors. E.g., it seems that LMDB and LevelDB only need the single boolean param_.loop(), but doing it this way obscures that fact and makes one wonder if params might affect cursors in arbitrary ways.

@longjon
Copy link
Contributor

longjon commented Jan 9, 2015

@sguada okay, I've taken a coarse pass just over the interface and {Level,LM}DB implementations. The new interface seems way more sensible to me... now it's much easier to tell at a glance what basic usage should look like. Two comments as noted. And re: namespaces, yes, I think it's a fine idea to put all this stuff in a namespace; why not just caffe::db?

@bhack
Copy link
Contributor

bhack commented Jan 9, 2015

You can also take some idea from uberDB

@cypof
Copy link
Member

cypof commented Jan 21, 2015

@sguada I extracted all the data_layer from the parallel work into a new PR #1773. It should make it easier to merge this one. There is a lock to make sure that only one dataset is created by data file name, which might be redundant with the check you have here.

@shelhamer
Copy link
Member

@cypof actually #1748 has replaced this PR for simplicity -- sorry for this miscommunication. The highest priority was to simplify the data interface so that's what #1748 does thanks to @longjon.

Closing, although other features of this PR might follow in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants