Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lod Tensor design doc #3746

Merged
merged 2 commits into from
Sep 3, 2017

Conversation

wangkuiyi
Copy link
Collaborator

@wangkuiyi wangkuiyi commented Aug 29, 2017

This design doc PR is going to replace #3454

@wangkuiyi wangkuiyi requested review from Superjomn, backyes, jacquesqiao, qingqing01, JiayiFeng and lcy-seso and removed request for backyes August 29, 2017 18:45

```
3
3 1 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里能存offset信息么?即0,3,4,6。这样顶上的3也不用存了。
原来Paddle中存的是位移信息,这样方便sequence相关的layer,比如maxlayer, seqlastinlayer等的操作。

Copy link
Contributor

@lcy-seso lcy-seso Aug 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

顶层的 3 应该总是等于 tensor 第一个维度的 size (也就是batch size)吧,所以应该不需要专门存这个信息?

似乎也并非如此 (看 batch size 如何定义),顶层的 “3” 总是一个scalar,总是等于下一层元素的个数(如果存储起始位置,就等于元素个数 - 1),但确实可以自动得到。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果存了顶层的3,那对于多层序列来说,这个3只有最顶上的有用,其他层都用不到。所以觉得没必要存,这样每层的操作是完全一样的。

Copy link
Contributor

@lcy-seso lcy-seso Aug 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯~ 这个3 没有实际的用处,看上去的好处是概念结构会比较统一。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

讨论过,感觉还是长度好点。 一是 slice 时不需要改 offset; 二是看起来好懂; 三是用的时候基本是按顺序,随机查找的机会不多

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一是 slice 时不需要改 offset; 二是看起来好懂; 三是用的时候基本是按顺序,随机查找的机会不多

  1. slice的时候只有在recurrentOp框架才会用到,但其他所有sequence相关的Op,每次都要重新算一下offset,非常没必要。 @lcy-seso
  2. offset看起来也很容易懂,没有比length难懂。
  3. 第三点对length和offset是一样的。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"其他所有sequence相关的Op" 指的是什么?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

指需要sequence信息的op,包括以下20多个op @wangkuiyi
image

3
3 1 2
3 2 4 1 2 3
||| || |||| | || |||
Copy link
Contributor

@luotao1 luotao1 Aug 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

之前和 @Superjom @qingqing01 @hedaoyuan 讨论的是:
对N层序列,Lod.size() = N,Lod[0]存的是句子的位移信息,Lod[1]存的是段落的位移信息,依次类推。(和目前Paddle的存法类似):

0,3,5,9,10,12,15
0,    9,10,     15

这样存有两个原因:

  • 小粒度的放在前面:不管对单层RNN还是多层RNN的配置,sequence相关的layer处理句子粒度的情况占绝大多数,这样只需取vector的第一个元素就能拿到所有句子粒度的信息。
  • 采用offset来排列。如果采用3 1 2来记录段落信息的话,需要和3 2 4 1 2 3配合,才能获得每个段落的长度信息。有两个不方便的地方:
    • 对sequence相关的layer,如果仅处理段落信息的时候,比如用maxLayer获得段落的max,如果用0, 9,10, 15,只需遍历一次,分别找到[0.9], [9,10], [10,15]之间的max元素即可。
    • 对GPU kernel来说,传给cuda函数的只能是指针类型的数据,传不了vector类型。如果传offset,一个参数即可。而不需要每次传入前转换成offset信息。 @hedaoyuan

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看上去有道理哟。是不是得写两个程序(以及调用实例),对比一下,就知道到底怎么弄比较好了?

Copy link
Contributor

@luotao1 luotao1 Sep 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sequenceLastInstanceLayer的forword为例:这个layer是取每个sequence或每个paragraph的最后一个元素。目前paddle的源代码:

  // sequenceLastInstaceLayer的kernel并不需要知道startPositions_是paragraph还是sequence,
  // 因为核心的计算是一样的。
  // startPositions_ =
  //    type_ ? input.subSequenceStartPositions : input.sequenceStartPositions;
  auto starts = startPositions_->getData(false); 
  MatrixPtr inputValue = getInputValue(0);
  MatrixPtr outputValue = getOutputValue();
  instanceIds_.clear(); //instanceId_是为了记录取出来的id index,这样backward的时候可以用。
  for (size_t seqId = 0; seqId < newBatchSize_; ++seqId) {
    int insId = reversed_ ? starts[seqId] : starts[seqId + 1] - 1;
    instanceIds_.push_back(insId);     
    outputValue->subMatrix(seqId, 1, tmpDest_)
          ->assign(*(inputValue->subMatrix(insId, 1, tmpSrc_)));
    }

如果改成以length形式存的话:

  // lengthPositions_=
  //    type_ ? input.subLengthPositions : input.lengthPositions;
  auto length= lengthPositions_->getData(false); 
  MatrixPtr inputValue = getInputValue(0);
  MatrixPtr outputValue = getOutputValue();
  instanceIds_.clear();
  int offset = 0; // 多余的行
  for (size_t seqId = 0; seqId < newBatchSize_; ++seqId) {
    int insId = reversed_ ? length[seqId] +offset : length[seqId + 1] - 1 +offset; // 需要多算一次加法。
    instanceIds_.push_back(insId);     
    outputValue->subMatrix(seqId, 1, tmpDest_)
          ->assign(*(inputValue->subMatrix(insId, 1, tmpSrc_)));
    offset+=length[seqId]; //多余的行
    }

因为outputValue都是连续内存,所以取的时候就多了一步将length转成offset的步骤。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangkuiyi @Superjom 上面写的两个对比的例子,我以为存length时候的结构是这样的:

9           1  5
3   2  4    1  2  3

如果存成:

3           1  2
3   2  4    1  2  3

那计算subSequenceLength的时候,需要3,1,23,2,4,1,2,3一起首先计算出lengthPositions。就不是上面多加的三行,至少十行以上了。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好像是这样。

应该要存成上面那种

Copy link
Contributor

@qingqing01 qingqing01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对于Slice,通常也需要指定[start, end]位置吧,我觉得可以看下SubSequenceLayer
和SliceProjection的这些实际slice的使用情况,或者其他框架slice op。


## Challenge of Variable-length Inputs

People usually represent a mini-batch by a Tensor. For example, a mini-batch of 32 images, each of size 32x32, is a 10x32x32 Tensor. So a transformation, T, of all images can be a matrix multiplication of the 32x32xO-dimensional tensor T and the 10x32x32 Tensor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

32 images - > 10 images,

typedef std::vector<std::vector> > LoD;
```

- The LoD index can is not necessary when there are only two levels and all elements of the second level have length 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can is not -> is not


## Slicing of LoD Tensor

Consider that we have a network with three levels of RNN: the top level one handles articles, the second level one handles sentences, and the basic level one handles words. This network requires that mini-batches represented by 4 level LoD Tensor, for example,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Usually, RNN only handles sequence (though non-sequence can be regarded as sequence with only one element, hardly any one uses RNN to handle non-sequence), I think there are two levels of RNN: the top level one handles articles (which is a sequence of sentences), the second level one handles sentences (which is a sequence of words), and the basic level ones are word embedding vectors.

  2. I do not quite understand here why it is a 4 level LoD tensor rather than 3 level LoD tensor to represent a nested sequence?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that 3 level is enough.

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Sep 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be 3. Changing.


```c++
typedef std::vector<std::vector> > LoD;
```
Copy link
Contributor

@lcy-seso lcy-seso Aug 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the definition (line 57) of LoD tensor mean the current design only support slice three-dimensional data? or to say slicing at two levels?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each level is stored in a vector<int>

so a vector<vector<int>> is enough to store any levels.

```

- The LoD index can is not necessary when there are only two levels and all elements of the second level have length 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an extra "can" in line 60.

In summary, as long as that the essential elements (words or images) have the same size, we can represent mini-batches by a LoD Tensor:

- The underlying tensor has size LxD1xD2x..., where D1xD2... is the size of the essential elements, and
- the first dimension size L has an additon property -- a LoD index as a nested vector:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact I understand about LoD tensor, please help me to figure out am I right:

  • One input batch (a tensor) has only one LoD tensor (if needed).
  • It can be regarded as a kind of splitting information attached to the first dimension of the input tensor.
  • It is not that each dimension has a LoD tensor (if needed).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, some split information + a tensor = LoD tensor

3
3 1 2
口口口 口 口口
```
Copy link
Contributor

@lcy-seso lcy-seso Aug 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • May be it is better to store the start positions as Paddle does currently, which is more convenient for slicing the original input, otherwise, all the layers that process sequence may have to compute the sequence start position in a batch.
  • On the other hand, with sequence start positions in hand (they are offsets of a sequence in a batch) it is very easy to get a sequence's length.

@qingqing01
Copy link
Contributor

qingqing01 commented Aug 30, 2017

我理解存储格式还是需要按照@Superjom之前文档的格式,举例:

单层

3 (batch size, 3条样本)
2  1 3 (分别2个、1个、3个word)
|| | ||| (word级别)

typedef vector<vector > LoD 存储:

std::vector<std::vector<int>>  LoD = {{0,2,3,6}} 
  • 1 = LoD.size: 表示level数
  • 3 = LoD[0].size - 1: 表示Batch Size数

双层

3 (batch size, 3条样本)
2       1    3 (sentece级别:分别2个、1个、3个句子)
2  3    2    3  2  1 (word 级别)
|| |||  ||  ||| || | 

typedef vector<vector > LoD 存储:

// LoD[0]:sentece信息
// LoD[1]: word信息
std::vector<std::vector<int>>  LoD = {{0,5,7,13}, {0,2,5,7,10,12,13}} 
or

// LoD[0]:word信息
// LoD[1]: sentence信息
std::vector<std::vector<int>>  LoD = {{0,2,5,7,10,12,13},{0,5,7,13}} 
  • 2 = LoD.size : 表示level数
  • 3 = LoD[0].size - 1: 对于上面存储,表示Batch Size数
    • 或 3 = LoD[LoD.size -1 ].size - 1 : 对于下面存储,表示Batch Size数

@lcy-seso
Copy link
Contributor

@qingqing01 我也支持这一种,否则序列相关操作的layer潜在都需要自己重新计算需要取的那一份数据在内存中的偏移,这个计算是没有必要的。

@lcy-seso
Copy link
Contributor

lcy-seso commented Aug 30, 2017

@qingqing01 对上面你贴出来的这种格式,我有一个问题,当batch size = 1 时,对这样一条数据:

std::vector<std::vector<int>>  LoD = {{0,2,3,6}} 

应该如何判断这是一条单层sequence的数据,还是一条双层序列数据,但是第一层 nested sequence 只有一个序列?

@luotao1
Copy link
Contributor

luotao1 commented Aug 30, 2017

应该如何判断这是一条单层sequence的数据,还是一条双层序列数据,但是第一层 nested sequence 只有一个序列?

这个很简单,看LoD.size()即可。如果是单层sequence,LoD.size()=1。
如果是双层,std::vector<std::vector> LoD = {{0,2,3,6}, {0,6}}

@lcy-seso
Copy link
Contributor

明白啦~ 漏看了 LoD.size 这一位~

```c++
typedef vector<vector<int> > LoD;

struct LoDTensor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is LodTensor a composition of Lod and Tensor* or a derived class from Tensor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@wangkuiyi
Copy link
Collaborator Author

I'd like to change to save start offsets instead of lengths in LoD as explained in #3746 (comment). However, it seems that those 0's in LoD are not necessary? @qingqing01

// LoD[0]:sentece信息
// LoD[1]: word信息
std::vector<std::vector<int>>  LoD = {{0,5,7,13}, {0,2,5,7,10,12,13}} 
or

// LoD[0]:word信息
// LoD[1]: sentence信息
std::vector<std::vector<int>>  LoD = {{0,2,5,7,10,12,13},{0,5,7,13}} 

@Superjomn
Copy link
Contributor

Superjomn commented Sep 3, 2017

I read codes from Luotao's comment, it seems that both approaches need the same lines of code.

The length-approach do poorly when randomly obtain elements, while the offset-approach do poorly in slice, and less concise.

We do need a concise data structure to make our new concept LoD easier to be understood, so I prefer the length-approach with some performance improvements.

Currently, LODTensor will only be used by RNNOp, which will access LoD Tensor's elements in some level sequentially, so we can add an iterator to length-approach and make it faster.

@luotao1 @wangkuiyi

Copy link
Contributor

@Superjomn Superjomn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wangkuiyi
Copy link
Collaborator Author

I agree with @Superjom that when multiple designs have similar complexity of implementations, we'd prefer the design that is the easiest to understand.

@wangkuiyi wangkuiyi merged commit 5e78359 into PaddlePaddle:develop Sep 3, 2017
@wangkuiyi
Copy link
Collaborator Author

wangkuiyi commented Sep 3, 2017

I forgot to follow some comments before merging this PR. I created #3837 to remind myself to do it.

@luotao1
Copy link
Contributor

luotao1 commented Sep 4, 2017

@Superjom LODTensor will be used in more than 20+ ops #3746 (comment) . In these ops, using offset is more convenient than using length. And in the future, we will add more sequence-related ops.

Though in RecurrentOp, using length is more convenient, considering that 20+ ops : 1 RecurrentOp, we should choose the offset.

@luotao1
Copy link
Contributor

luotao1 commented Sep 6, 2017

@wangkuiyi @Superjom 在CPU计算的时候,存length还是offset都差不多,前者多了O(1)的计算而已。但是在GPU计算的时候,必须存offset。以maxlayer的前向GPUkernel为例,代码在hl_cuda_sequence.cu

__global__ void KeMaxSequenceForward(real* input,
                                     const int* sequence,
                                     real* output,
                                     int* index,
                                     int numSequences,
                                     int dim) {
  int dimIdx = threadIdx.x;
  int sequenceId = blockIdx.x; // 随机获得一个block线程块
  if (sequenceId >= numSequences) return;
  int start = sequence[sequenceId]; // 取得该线程块对应的sequence的起始位置
  int end = sequence[sequenceId + 1]; // 取得该线程块对应的sequence的终止位置

  for (int i = dimIdx; i < dim; i += blockDim.x) {
    real tmp = -HL_FLOAT_MAX;
    int tmpId = -1;
    for (int insId = start; insId < end; insId++) {
      if (tmp < input[insId * dim + i]) {
        tmp = input[insId * dim + i];
        tmpId = insId;
      }
    }
    output[sequenceId * dim + i] = tmp;
    index[sequenceId * dim + i] = tmpId;
  }
}

从上面的代码可以看到:

  • 如果存offset,每个线程块直接取出对应sequence的起始和终止位置即可。
  • 如果存length,那么需要在每个线程块里加O(n^2)的额外操作,来获得对应sequence的起始和终止位置。 或者在gpu kernel前,将length转成offset格式,也是O(n^2)的复杂度。其中n是sequence的个数。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants