-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datum db #1568
Datum db #1568
Conversation
@shelhamer @longjon let me know what do you think? |
Coordinating with #1629
|
@bhack yes the idea is very similar, but instead of adding another dependency I implemented here |
@bhack the TBB looks good, but I not sure if the overhead of adding another dependency will be worthy, but could be considered later. |
You can evaluate also boost::lockfree. We already have boost dependency. |
@bhack thanks, but in this case I prefer blocking queue, since I'm planning to have thread filling the queue. With a lockfree queue one need to check if the element could be pushed or poped, and then do active wait. |
…ract_keys tool Adapted convert_imageset to DatumDB. Added extract_keys tool Adapted data_layer to DatumDB
Make lint happy
@cypof I have redo the DatumDB, now it Datum_DB_factory keeps track of the opened DBs and allows only one per source. @shelhamer I thinks this should fix all the problems related to leveldb and lmdb. |
Ok, thanks Sergio. I'll take a look tomorrow. Fixing up the data pipeline
|
@shelhamer once it is reviewed I will remove the previous dataset code and files. |
I'll also plan to take a pass by tomorrow. |
} | ||
} | ||
|
||
class Generator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is Generator
? My guess is it's supposed to be an iterator-like abstraction (a "Generator" nested class in a "DB" class sounds like it ought to produce DBs, but that doesn't seem to be what's going on here). But, it just passes through everything to its datumdb_
, holding no internal state, and it doesn't seem to be subclassed, so, what's its raison d'être?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the Generator
is iterator-like abstraction but simpler, it could be out DatumDB
and being called DatumGenerator
for clarity.
I explored different options, and the iterator-like need to have an state, which at the end is pretty much a DatumDB
so, instead of creating a new state, a Generator
just receives a copy of DatumDB
which holds the needed state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That feels confusing and underhanded; why not refer instead of copying?
Does this support multiple simultaneous cursors? If so, the interface should reflect that by implementing the cursor methods (Next
, Reset
, &c) in the cursor or iterator subclass (or with a cursor or iterator as arguments, which amounts to the same thing). If not, there's no need for a nested class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generator is a simplified iterator more like a stream of Datum, but without use of it.begin() and it.end() which I find annoying. I was thinking in calling cursor, which seems it will make more sense.
Currently it does support multiple cursors. The reason to use a nested class was to simplify the iteration between the cursor and the DB.
The reason for copying instead of refer was to create new cursor related vars, each cursor need to have its own cursor, but I didn't want to have to define a different cursor for each datumDB.
virtual bool Current(string* key, Datum* datum) = 0; | ||
|
||
DatumDBParameter param_; | ||
shared_ptr<bool> is_opened_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm baffled by this... why shared_ptr<bool>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is shared is because I just want to open the DatumDB
once per source and close it when the last Generator
or reference is destroyed.
So all the DatumDB
passed to Generators
will share the same opened state.
With respect to naming, how about |
Agree we should think about our path away from I took a coarse pass, mainly at the interface; it seems serviceable but I'd like to see the |
@longjon thanks for the pass over the interface and organization. I think there are three options to simplify the interface:
Which one do you think will be clearer and easier to maintain? @shelhamer I'm totally ok with changing the name to something more general as source_param. I was thinking that on BlobDB could be built by combining a DatumDB and a DataTransformation, but other kinds of BlobDB could be defined. That reasoning behind of using Datum::Generator, Datum::Generator + DatumTransformation = Blob::Generator |
@sguada What is this? Is this a kind of ORM? |
Is there a way we could completely replace the Generator by queues, as part of merging with #1629? It would let multiple solvers pick from the same db, and simplify prefetching by simply filling the queue. Also it would allow running the db on it's own thread. |
@cypof maybe, although I'd worry about
so I think it might be better to stick to the original vision of a thin wrapper around data sources, and update or wrap the cursor functionality later. @bhack no, it's meant to be a thin wrapper around data sources to avoid the former grossness of hand-coded switches to interface with different DBs. @sguada here's what makes sense to me:
Indeed, I think each DB should have its own type of cursor, not that this should result in more DB-specific code, it's just that the current implementations of the cursor-like methods ( Also, how is this supposed to interact with existing data layers? It seems there is an images DB, which I guess is supposed to replace the |
@cypof At some point I thought about including the queue and a thread into the DatumDB but then I discarded since it will introduce a lot of unrelated things into this. Every new DatumDB should be able to provide that. @longjon Now I like more the idea of defining a DatumDB::DB and DatumDB::Cursor or DatumDB::Iterator. I think I will model it in a similar fashion as leveldb, which have the same separation between DB and Iterator. What do you think about creating a new namespace |
@longjon I have refactor the code and interface to reflect what we discussed. Please let me know your opinion before changing the rest. |
class DatumDBCursor { | ||
public: | ||
explicit DatumDBCursor(const DatumDBParameter& param) | ||
: param_(param) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel too strongly about this, but I think it could be argued that param passing should be left up to subclasses, and subclasses should be explicit about what information needs to be given to cursors. E.g., it seems that LMDB and LevelDB only need the single boolean param_.loop()
, but doing it this way obscures that fact and makes one wonder if params might affect cursors in arbitrary ways.
@sguada okay, I've taken a coarse pass just over the interface and {Level,LM}DB implementations. The new interface seems way more sensible to me... now it's much easier to tell at a glance what basic usage should look like. Two comments as noted. And re: namespaces, yes, I think it's a fine idea to put all this stuff in a namespace; why not just |
You can also take some idea from uberDB |
This PR is intended to replace #1238 and fix some of the problems introduced by it. Now there is only one internal cursor, this avoid errors like MDB_MAX_READERS and higher memory consumption.
Introduces a new
DatumDBParameter
to be used by the Data_Layers i.e.For lower memory consumption use
LEVELDB
instead ofLMDB
.All the code related to
dataset
needs to be removed, but before doing that I want to get feedback about this PR.