Abstraction and unification of data input #1522

netheril96 · 2014-12-03T12:09:24Z

This is a generalization of IndirectionLayer in #1414 as discussed with @sguada

Motivation

There are two problems with the current way of data input into the network:

The rigidity imposed by Datum. It unnecessarily couples "data" (which are usually images) with a single integer "label". The inflexibility of Datum makes it difficult and unnatural to model multi-label classification, attribute classification, face verification and many more cases where "data" and "label" does not have one-to-one correspondence.
The code duplication and multiple responsibilities of all kinds of DataLayers. There are DataLayer, ImageDataLayer, HDF5DataLayer, DummyDataLayer and many other ones. Each data layer has to handle both the task of reading the data as well as propagating them in the network. They also repeatedly handle the error prone task of multithreaded prefetching despite sharing a common base class.

Core idea

The first idea is unify data and label under Blob. All "data" and "label" are just regular blobs, where the latter is just a special case (1x1x1 integer). So the new layers will just output blob. Whether they should be interpreted as data or label or something else is not these layers' concern.

The second idea is to separate the data access from the act of propagating them.

The data access is abstracted by DataSource class. The data source is modeled as a mapping between integer and floating point array (int32 -> float[]). The implementation may be

backed by files, like CSV, JSON, LMDB, HDF5, images
dynamically generated, such as a counter, a constant stream of zero (analogous to /dev/zero) and random numbers (analogous to /dev/random)
modulation of other data sources, such as caching and transformation.

The propagation is handled by two layers, DataSequenceLayer and DataMappingLayer (the naming is not fixed yet).

DataSequenceLayer outputs data in sequence. It reads from the lowest index to the highest and then cycles back. Because it knows in advance what data to retrieve, it may prefetch in a different thread. The threading code only needs to be implemented once for all sources.
DataMappingLayer outputs data with the indices specified by the bottom blob. This allows filtering, reordering, shuffling and slicing of the data source.

Example definition

Here is a definition file for DataMappingLayer

layer {
  type: DATA_MAPPING
  bottom: "index"
  top: "data"
  name: "mapping"
  data_mapping_param {
    channels: 3
    height: 12
    width: 12
    data_source_param {
      type: CSV
      filename: "mapping.csv"
    }
  }
}

DataSequenceLayer can be similarly declared, except that it does not have bottom input but a batch_size parameter.

Example network

Original case (image + label)

Multiple label

Verification

Shuffling

ImageDataLayer has a special option to constantly shuffle input. With the new architecture, the similar can be achieved on any data source, by piping random indices into DataMappingLayer. The result is not the same, though, because this allows duplication.

Selection

If there are many data input, splitting them into training and testing set and maintaining synchronization is tedious and error-prone. Instead, we can group them in the same data source. A single layer that outputs indices will be able to select different part of them in different scenarios. The synchronization is ensured by the fact that each DataMappingLayer receives the same index.

To do

This PR is a minimal functional implementation of the above idea. It only has two data sources: one "pseudo device" that outputs constant, another backed by CSV files. It also lacks prefetching, which could be added later.

The following data sources may be implemented after and if this PR is accepted:

LMDB
LevelDB
HDF5
protobuf
JSON
Gray/color images
Random integers
Normal distributed float
Range slice
Window

* DataSource * DataMappingLayer * DataSequenceLayer

sguada · 2014-12-03T15:26:18Z

@netheril96 one first thought, since the difference between DataSequence and DataMapping is that the DataSequence don't take any bottom while DataMapping takes one, let's just have one Layer which if it is not given a bottom just behave sequentially and if it is given a bottom uses it to retrieve the Blobs.

Use data_source_param within layer_param to create the needed data_source, instead of passing it to the constructor.

Let's define num, channels, height and width as 1 by default. Or maybe better, let's just define repeated dim to define the dimensions of data.

Also take a look at DummyDataLayer on how to generate data randomly using Fillers.

I think this should become a branch in caffe until it is well developed, tested and that way different people can contribute. @shelhamer @longjon @jeffdonahue

netheril96 · 2014-12-03T16:04:05Z

@sguada

let's just have one Layer which if it is not given a bottom just behave sequentially and if it is given a bottom uses it to retrieve the Blobs.

The problem of uniting DataSequenceLayer and DataMappingLayer is that they almost share no common code at all. If they are merged, then Forward_cpu will become a giant if statement. It will become even more convoluted code when the multithreaded prefetching feature is added. Since the DataMappingLayer cannot prefetch, another bunch of if statements would be needed to determine whether or not to launch another thread, to lock the mutex, to join the thread etc.

Use data_source_param within layer_param to create the needed data_source, instead of passing it to the constructor.

I don't understand what you mean by "instead of passing it to the constructor". Perhaps you are referring to the problem of "too many levels of indirection"? The example definition file does have many indentation, more if the data_source has additional parameters. I may adopt this in a later commit.

Let's define num, channels, height and width as 1 by default. Or maybe better, let's just define repeated dim to define the dimensions of data.

Channels, width, height are already 1 by default. num is specified by batch_size or inferred from the bottom blobs. Why is repeated dim a better choice? It doesn't have semantic meaning.

I think this should become a branch in caffe until it is well developed, tested and that way different people can contribute.

Do I need to do something special or is this completely up to the maintainers?

netheril96 · 2014-12-04T01:59:35Z

Also take a look at DummyDataLayer on how to generate data randomly using Fillers.

The relevant bits have now been removed. The specific implementation of other data sources shall be the focus of additional pull requests.

bhack · 2014-12-21T23:30:43Z

@shelhamer If we sum up #1414 #523 and #149 we can prepare a birthday party for multilabel. I really hope that @netheril96 is still available to finish this but with holidays and probably one month waiting for a feedback there is the risk that a contributor can go busy and cannot put more work on futher feedbacks.

sguada · 2014-12-23T00:54:52Z

@netheril96 Now I don't think is such a good idea to merge DataLayers with this PR. Take a look at #1568 on how organize the code and built layers from params.

netheril96 · 2014-12-23T03:16:04Z

@sguada No one is gonna review or merge it anyway.

bhack · 2014-12-23T08:20:01Z

@netheril96 What do you mean?

netheril96 · 2014-12-23T09:24:03Z

@bhack Contributing to caffe is like striking at the wind.

bhack · 2014-12-23T09:38:33Z

@netheril96 Please help us to propose an improvement of the process at #1623

bhack · 2015-01-09T09:23:00Z

@netheril96 Why you have closed this?

netheril96 · 2015-01-09T11:28:30Z

@bhack Well, I thought it would no longer garner any attention.

bhack · 2015-01-09T13:42:55Z

We are using indirection by the beginning of the PR. This was proposed as generalization so we are still interested on this. @shelhamer @longjon @sguada Ping.

netheril96 added 6 commits December 3, 2014 10:24

Declarations of the basic classes

f4fe012

* DataSource * DataMappingLayer * DataSequenceLayer

Implementation of new data layers

1704a88

Implementation of two typical data source

4bde708

Test case of DataMappingLayer

296b072

Test case for DataSequenceLayer

2bae3ce

Test case for CSVDataSource

0463e58

Remove not implemented data source type

23dac57

netheril96 closed this Jan 9, 2015

netheril96 reopened this Jan 9, 2015

bhack mentioned this pull request Mar 5, 2015

Vectors as labels #2047

Closed

shelhamer added the JD label Mar 10, 2015

netheril96 closed this Jun 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abstraction and unification of data input #1522

Abstraction and unification of data input #1522

netheril96 commented Dec 3, 2014

sguada commented Dec 3, 2014

netheril96 commented Dec 3, 2014

netheril96 commented Dec 4, 2014

bhack commented Dec 21, 2014

sguada commented Dec 23, 2014

netheril96 commented Dec 23, 2014

bhack commented Dec 23, 2014

netheril96 commented Dec 23, 2014

bhack commented Dec 23, 2014

bhack commented Jan 9, 2015

netheril96 commented Jan 9, 2015

bhack commented Jan 9, 2015

Abstraction and unification of data input #1522

Abstraction and unification of data input #1522

Conversation

netheril96 commented Dec 3, 2014

Motivation

Core idea

Example definition

Example network

Original case (image + label)

Multiple label

Verification

Shuffling

Selection

To do

sguada commented Dec 3, 2014

netheril96 commented Dec 3, 2014

netheril96 commented Dec 4, 2014

bhack commented Dec 21, 2014

sguada commented Dec 23, 2014

netheril96 commented Dec 23, 2014

bhack commented Dec 23, 2014

netheril96 commented Dec 23, 2014

bhack commented Dec 23, 2014

bhack commented Jan 9, 2015

netheril96 commented Jan 9, 2015

bhack commented Jan 9, 2015