Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstraction and unification of data input #1522

Closed
wants to merge 7 commits into from
Closed

Abstraction and unification of data input #1522

wants to merge 7 commits into from

Conversation

netheril96
Copy link
Contributor

This is a generalization of IndirectionLayer in #1414 as discussed with @sguada

Motivation

There are two problems with the current way of data input into the network:

  1. The rigidity imposed by Datum. It unnecessarily couples "data" (which are usually images) with a single integer "label". The inflexibility of Datum makes it difficult and unnatural to model multi-label classification, attribute classification, face verification and many more cases where "data" and "label" does not have one-to-one correspondence.
  2. The code duplication and multiple responsibilities of all kinds of DataLayers. There are DataLayer, ImageDataLayer, HDF5DataLayer, DummyDataLayer and many other ones. Each data layer has to handle both the task of reading the data as well as propagating them in the network. They also repeatedly handle the error prone task of multithreaded prefetching despite sharing a common base class.

Core idea

The first idea is unify data and label under Blob. All "data" and "label" are just regular blobs, where the latter is just a special case (1x1x1 integer). So the new layers will just output blob. Whether they should be interpreted as data or label or something else is not these layers' concern.

The second idea is to separate the data access from the act of propagating them.

The data access is abstracted by DataSource class. The data source is modeled as a mapping between integer and floating point array (int32 -> float[]). The implementation may be

  • backed by files, like CSV, JSON, LMDB, HDF5, images
  • dynamically generated, such as a counter, a constant stream of zero (analogous to /dev/zero) and random numbers (analogous to /dev/random)
  • modulation of other data sources, such as caching and transformation.

The propagation is handled by two layers, DataSequenceLayer and DataMappingLayer (the naming is not fixed yet).

  • DataSequenceLayer outputs data in sequence. It reads from the lowest index to the highest and then cycles back. Because it knows in advance what data to retrieve, it may prefetch in a different thread. The threading code only needs to be implemented once for all sources.
  • DataMappingLayer outputs data with the indices specified by the bottom blob. This allows filtering, reordering, shuffling and slicing of the data source.

Example definition

Here is a definition file for DataMappingLayer

layer {
  type: DATA_MAPPING
  bottom: "index"
  top: "data"
  name: "mapping"
  data_mapping_param {
    channels: 3
    height: 12
    width: 12
    data_source_param {
      type: CSV
      filename: "mapping.csv"
    }
  }
}

DataSequenceLayer can be similarly declared, except that it does not have bottom input but a batch_size parameter.

Example network

Original case (image + label)

Multiple label

Verification

Shuffling

ImageDataLayer has a special option to constantly shuffle input. With the new architecture, the similar can be achieved on any data source, by piping random indices into DataMappingLayer. The result is not the same, though, because this allows duplication.

Selection

If there are many data input, splitting them into training and testing set and maintaining synchronization is tedious and error-prone. Instead, we can group them in the same data source. A single layer that outputs indices will be able to select different part of them in different scenarios. The synchronization is ensured by the fact that each DataMappingLayer receives the same index.

To do

This PR is a minimal functional implementation of the above idea. It only has two data sources: one "pseudo device" that outputs constant, another backed by CSV files. It also lacks prefetching, which could be added later.

The following data sources may be implemented after and if this PR is accepted:

  • LMDB
  • LevelDB
  • HDF5
  • protobuf
  • JSON
  • Gray/color images
  • Random integers
  • Normal distributed float
  • Range slice
  • Window

@sguada
Copy link
Contributor

sguada commented Dec 3, 2014

@netheril96 one first thought, since the difference between DataSequence and DataMapping is that the DataSequence don't take any bottom while DataMapping takes one, let's just have one Layer which if it is not given a bottom just behave sequentially and if it is given a bottom uses it to retrieve the Blobs.

Use data_source_param within layer_param to create the needed data_source, instead of passing it to the constructor.

Let's define num, channels, height and width as 1 by default. Or maybe better, let's just define repeated dim to define the dimensions of data.

Also take a look at DummyDataLayer on how to generate data randomly using Fillers.

I think this should become a branch in caffe until it is well developed, tested and that way different people can contribute. @shelhamer @longjon @jeffdonahue

@netheril96
Copy link
Contributor Author

@sguada

let's just have one Layer which if it is not given a bottom just behave sequentially and if it is given a bottom uses it to retrieve the Blobs.

The problem of uniting DataSequenceLayer and DataMappingLayer is that they almost share no common code at all. If they are merged, then Forward_cpu will become a giant if statement. It will become even more convoluted code when the multithreaded prefetching feature is added. Since the DataMappingLayer cannot prefetch, another bunch of if statements would be needed to determine whether or not to launch another thread, to lock the mutex, to join the thread etc.

Use data_source_param within layer_param to create the needed data_source, instead of passing it to the constructor.

I don't understand what you mean by "instead of passing it to the constructor". Perhaps you are referring to the problem of "too many levels of indirection"? The example definition file does have many indentation, more if the data_source has additional parameters. I may adopt this in a later commit.

Let's define num, channels, height and width as 1 by default. Or maybe better, let's just define repeated dim to define the dimensions of data.

Channels, width, height are already 1 by default. num is specified by batch_size or inferred from the bottom blobs. Why is repeated dim a better choice? It doesn't have semantic meaning.

I think this should become a branch in caffe until it is well developed, tested and that way different people can contribute.

Do I need to do something special or is this completely up to the maintainers?

@netheril96
Copy link
Contributor Author

Also take a look at DummyDataLayer on how to generate data randomly using Fillers.

The relevant bits have now been removed. The specific implementation of other data sources shall be the focus of additional pull requests.

@bhack
Copy link
Contributor

bhack commented Dec 21, 2014

@shelhamer If we sum up #1414 #523 and #149 we can prepare a birthday party for multilabel. I really hope that @netheril96 is still available to finish this but with holidays and probably one month waiting for a feedback there is the risk that a contributor can go busy and cannot put more work on futher feedbacks.

@sguada
Copy link
Contributor

sguada commented Dec 23, 2014

@netheril96 Now I don't think is such a good idea to merge DataLayers with this PR. Take a look at #1568 on how organize the code and built layers from params.

@netheril96
Copy link
Contributor Author

@sguada No one is gonna review or merge it anyway.

@bhack
Copy link
Contributor

bhack commented Dec 23, 2014

@netheril96 What do you mean?

@netheril96
Copy link
Contributor Author

@bhack Contributing to caffe is like striking at the wind.

@bhack
Copy link
Contributor

bhack commented Dec 23, 2014

@netheril96 Please help us to propose an improvement of the process at #1623

@netheril96 netheril96 closed this Jan 9, 2015
@bhack
Copy link
Contributor

bhack commented Jan 9, 2015

@netheril96 Why you have closed this?

@netheril96 netheril96 reopened this Jan 9, 2015
@netheril96
Copy link
Contributor Author

@bhack Well, I thought it would no longer garner any attention.

@bhack
Copy link
Contributor

bhack commented Jan 9, 2015

We are using indirection by the beginning of the PR. This was proposed as generalization so we are still interested on this. @shelhamer @longjon @sguada Ping.

@bhack bhack mentioned this pull request Mar 5, 2015
@shelhamer shelhamer added the JD label Mar 10, 2015
@netheril96 netheril96 closed this Jun 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants