Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fundamental Data Reading in C++ #8009

Merged
merged 14 commits into from
Feb 7, 2018
Merged

Conversation

JiayiFeng
Copy link
Collaborator

@JiayiFeng JiayiFeng commented Jan 31, 2018

Until now, Paddle Fluid's data feeding still wholly dependents on Python code. To get rid of the Python environment and achieve the goal of "wrapping the whole training by a while loop op", a C++ data feeding mechanism is required.

In this PR we show a fundamental C++ data feeding process, which implements the data reading(simulated by random data generator), shuffling and batching. We can regard this PR as the foundation of further developing for #7646.

Following concepts are introduced in this PR:

ReaderBase(code)

ReaderBase is the abstract base class of all Readers.

FileReader(code) and DecoratedReader(code)

These two classes are derived from the ReaderBase and will further be derived by respective specific readers. That is to say, in our design, there are two kinds of readers: file readers and decorated readers. A file reader reads from a file of some specific format, and yield only one instance of data at a time. e.g. RecordIO reader, jpg reader, .... A decorated reader takes another reader(both file reader and decorated reader are OK) as its 'underlying reader'. It gets data from its underlying reader, does some process on them(shuffling, or batching), then yields processed data. The output data of a decorated reader can be a single instance or a batch. ShuffleReader(code) and BatchReader(code) are both decorated readers.

All the readers share exactly the same interfaces defined in ReaderBase. So they can be decorated for more than one time: We can shuffle a reader's outputs and then batch the shuffle outputs. The interface consistency also allows related ops use readers without knowing what they are exactly.

ReaderHolder (code)

Different readers belong to different class types. It leads to a problem: How can we drop them into Variables and fetch them out by a unified method? For example, if a Variable holds a BatchReader, we can not get it by the following code:

var->Get<ReaderBase>("batch_reader");

we have to write:

var->Get<BatchReader>("batch_reader");

This requires each time getting a reader from a variable we must know the reader's type exactly. It is nearly impossible.

To solve this problem, we introduce ReaderHolder as a wrapper. It acts as an empty decorator of ReaderBase, which erasing reader's type. With ReaderHolder we are able to fetch all types of readers by var->Get<ReaderHolder>("...") and regard the obtained object as a reader.

To create and invoke readers, some now ops are introduced:

CreateReaderOp

Each reader has its creating op. File readers' creating ops have no input and yield the created file reader as its output. Decorated readers' creating ops take the underlying readers as inputs and then yield new decorated readers.

ReadOp(code)

A reader is only a Variable. It cannot trigger the reading process by itself. So we add the ReadOp to execute it. A ReadOp takes a reader Variable as input. Each time it runs, it invokes the reader‘s ReadNext() function and gets a new batch of data(or only one instance of data, if we use file reader directly). The output data of a reader are in the form of std::vector<LoDTenosr>, so the ReadOp also needs to split the vector and move LoDTensors to their respective output Variables.

@JiayiFeng JiayiFeng changed the title File Reader in C++ File Readers in C++ Jan 31, 2018
@JiayiFeng JiayiFeng changed the title File Readers in C++ [WIP] File Readers in C++ Jan 31, 2018
@JiayiFeng JiayiFeng changed the title [WIP] File Readers in C++ Fundamental Data Fedding in C++ Feb 6, 2018
@JiayiFeng JiayiFeng changed the title Fundamental Data Fedding in C++ Fundamental Data Reading in C++ Feb 6, 2018
// file readers

template <typename T>
class RandomReader : public FileReader {
Copy link
Collaborator

@reyoung reyoung Feb 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reader is used by unittests.

Maybe we should use `RandomDataGenerator` instead of `RandomReader`, because

1. It does not read anything.
2. It is confusing between `ShuffleReader` and `RandomReader`.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

ReaderBase* Get() const { return reader_.get(); }

void ReadNext(std::vector<LoDTensor>* out) { reader_->ReadNext(out); }
bool HasNext() const { return reader_->HasNext(); }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reader, maybe we should add a Reset() method to read from beginning?

Copy link
Collaborator Author

@JiayiFeng JiayiFeng Feb 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. And the further discussion is needed about what shall we exactly do when a reader finishes one pass reading.

And Reset has been used by ReaderHolder to make the API consistent with unique_ptr. So I will add readers a ReInit method instead.

namespace paddle {
namespace operators {

std::vector<framework::DDim> RestoreShapes(const std::vector<int>& shape_concat,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method should be static.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

break;
}
}
std::random_shuffle(buffer_.begin(), buffer_.end());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation could be very slow. However, we can optimize it later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I didn't know that before. I will add a TOOD here.

batch_shape[0] += ins_shape[0];
}

LoDTensor out_tensor;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can invoke MergeTensorOp here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MergeTensorOp can only merge two LoDTensor(true branch out and false branch out). However, in BatchReader we need to merge far more than two LoDTensor.

@reyoung
Copy link
Collaborator

reyoung commented Feb 7, 2018

@JiayiFeng Maybe we can move the PR description to design doc?

@JiayiFeng
Copy link
Collaborator Author

Sure. I will move them to a new design doc in some later PR.

Copy link
Collaborator

@reyoung reyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent

@JiayiFeng JiayiFeng merged commit 812cf15 into PaddlePaddle:develop Feb 7, 2018
@JiayiFeng JiayiFeng deleted the dev_reader branch February 7, 2018 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants