-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abstraction and unification of data input #1522
Conversation
* DataSource * DataMappingLayer * DataSequenceLayer
@netheril96 one first thought, since the difference between DataSequence and DataMapping is that the DataSequence don't take any bottom while DataMapping takes one, let's just have one Layer which if it is not given a bottom just behave sequentially and if it is given a bottom uses it to retrieve the Blobs. Use Let's define num, channels, height and width as 1 by default. Or maybe better, let's just define repeated Also take a look at DummyDataLayer on how to generate data randomly using Fillers. I think this should become a branch in caffe until it is well developed, tested and that way different people can contribute. @shelhamer @longjon @jeffdonahue |
The problem of uniting
I don't understand what you mean by "instead of passing it to the constructor". Perhaps you are referring to the problem of "too many levels of indirection"? The example definition file does have many indentation, more if the
Channels, width, height are already 1 by default.
Do I need to do something special or is this completely up to the maintainers? |
The relevant bits have now been removed. The specific implementation of other data sources shall be the focus of additional pull requests. |
@shelhamer If we sum up #1414 #523 and #149 we can prepare a birthday party for multilabel. I really hope that @netheril96 is still available to finish this but with holidays and probably one month waiting for a feedback there is the risk that a contributor can go busy and cannot put more work on futher feedbacks. |
@netheril96 Now I don't think is such a good idea to merge DataLayers with this PR. Take a look at #1568 on how organize the code and built layers from params. |
@sguada No one is gonna review or merge it anyway. |
@netheril96 What do you mean? |
@bhack Contributing to caffe is like striking at the wind. |
@netheril96 Please help us to propose an improvement of the process at #1623 |
@netheril96 Why you have closed this? |
@bhack Well, I thought it would no longer garner any attention. |
We are using indirection by the beginning of the PR. This was proposed as generalization so we are still interested on this. @shelhamer @longjon @sguada Ping. |
This is a generalization of
IndirectionLayer
in #1414 as discussed with @sguadaMotivation
There are two problems with the current way of data input into the network:
Datum
. It unnecessarily couples "data" (which are usually images) with a single integer "label". The inflexibility ofDatum
makes it difficult and unnatural to model multi-label classification, attribute classification, face verification and many more cases where "data" and "label" does not have one-to-one correspondence.DataLayer
s. There areDataLayer
,ImageDataLayer
,HDF5DataLayer
,DummyDataLayer
and many other ones. Each data layer has to handle both the task of reading the data as well as propagating them in the network. They also repeatedly handle the error prone task of multithreaded prefetching despite sharing a common base class.Core idea
The first idea is unify data and label under
Blob
. All "data" and "label" are just regular blobs, where the latter is just a special case (1x1x1 integer). So the new layers will just output blob. Whether they should be interpreted as data or label or something else is not these layers' concern.The second idea is to separate the data access from the act of propagating them.
The data access is abstracted by
DataSource
class. The data source is modeled as a mapping between integer and floating point array (int32 -> float[]
). The implementation may be/dev/zero
) and random numbers (analogous to/dev/random
)The propagation is handled by two layers,
DataSequenceLayer
andDataMappingLayer
(the naming is not fixed yet).DataSequenceLayer
outputs data in sequence. It reads from the lowest index to the highest and then cycles back. Because it knows in advance what data to retrieve, it may prefetch in a different thread. The threading code only needs to be implemented once for all sources.DataMappingLayer
outputs data with the indices specified by the bottom blob. This allows filtering, reordering, shuffling and slicing of the data source.Example definition
Here is a definition file for
DataMappingLayer
DataSequenceLayer
can be similarly declared, except that it does not have bottom input but abatch_size
parameter.Example network
Original case (image + label)
Multiple label
Verification
Shuffling
ImageDataLayer
has a special option to constantly shuffle input. With the new architecture, the similar can be achieved on any data source, by piping random indices intoDataMappingLayer
. The result is not the same, though, because this allows duplication.Selection
If there are many data input, splitting them into training and testing set and maintaining synchronization is tedious and error-prone. Instead, we can group them in the same data source. A single layer that outputs indices will be able to select different part of them in different scenarios. The synchronization is ensured by the fact that each
DataMappingLayer
receives the same index.To do
This PR is a minimal functional implementation of the above idea. It only has two data sources: one "pseudo device" that outputs constant, another backed by CSV files. It also lacks prefetching, which could be added later.
The following data sources may be implemented after and if this PR is accepted: