-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lod Tensor design doc #3746
Lod Tensor design doc #3746
Conversation
|
||
``` | ||
3 | ||
3 1 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里能存offset信息么?即0,3,4,6。这样顶上的3也不用存了。
原来Paddle中存的是位移信息,这样方便sequence相关的layer,比如maxlayer, seqlastinlayer等的操作。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
顶层的 3 应该总是等于 tensor 第一个维度的 size (也就是batch size)吧,所以应该不需要专门存这个信息?
似乎也并非如此 (看 batch size 如何定义),顶层的 “3” 总是一个scalar,总是等于下一层元素的个数(如果存储起始位置,就等于元素个数 - 1),但确实可以自动得到。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果存了顶层的3,那对于多层序列来说,这个3只有最顶上的有用,其他层都用不到。所以觉得没必要存,这样每层的操作是完全一样的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯~ 这个3 没有实际的用处,看上去的好处是概念结构会比较统一。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
讨论过,感觉还是长度好点。 一是 slice 时不需要改 offset; 二是看起来好懂; 三是用的时候基本是按顺序,随机查找的机会不多
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一是 slice 时不需要改 offset; 二是看起来好懂; 三是用的时候基本是按顺序,随机查找的机会不多
- slice的时候只有在recurrentOp框架才会用到,但其他所有sequence相关的Op,每次都要重新算一下offset,非常没必要。 @lcy-seso
- offset看起来也很容易懂,没有比length难懂。
- 第三点对length和offset是一样的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"其他所有sequence相关的Op" 指的是什么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
指需要sequence信息的op,包括以下20多个op @wangkuiyi :
3 | ||
3 1 2 | ||
3 2 4 1 2 3 | ||
||| || |||| | || ||| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
之前和 @Superjom @qingqing01 @hedaoyuan 讨论的是:
对N层序列,Lod.size() = N,Lod[0]存的是句子的位移信息,Lod[1]存的是段落的位移信息,依次类推。(和目前Paddle的存法类似):
0,3,5,9,10,12,15
0, 9,10, 15
这样存有两个原因:
- 小粒度的放在前面:不管对单层RNN还是多层RNN的配置,sequence相关的layer处理句子粒度的情况占绝大多数,这样只需取vector的第一个元素就能拿到所有句子粒度的信息。
- 采用offset来排列。如果采用
3 1 2
来记录段落信息的话,需要和3 2 4 1 2 3
配合,才能获得每个段落的长度信息。有两个不方便的地方:- 对sequence相关的layer,如果仅处理段落信息的时候,比如用maxLayer获得段落的max,如果用
0, 9,10, 15
,只需遍历一次,分别找到[0.9], [9,10], [10,15]之间的max元素即可。 - 对GPU kernel来说,传给cuda函数的只能是指针类型的数据,传不了vector类型。如果传offset,一个参数即可。而不需要每次传入前转换成offset信息。 @hedaoyuan
- 对sequence相关的layer,如果仅处理段落信息的时候,比如用maxLayer获得段落的max,如果用
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看上去有道理哟。是不是得写两个程序(以及调用实例),对比一下,就知道到底怎么弄比较好了?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
以sequenceLastInstanceLayer的forword为例:这个layer是取每个sequence或每个paragraph的最后一个元素。目前paddle的源代码:
// sequenceLastInstaceLayer的kernel并不需要知道startPositions_是paragraph还是sequence,
// 因为核心的计算是一样的。
// startPositions_ =
// type_ ? input.subSequenceStartPositions : input.sequenceStartPositions;
auto starts = startPositions_->getData(false);
MatrixPtr inputValue = getInputValue(0);
MatrixPtr outputValue = getOutputValue();
instanceIds_.clear(); //instanceId_是为了记录取出来的id index,这样backward的时候可以用。
for (size_t seqId = 0; seqId < newBatchSize_; ++seqId) {
int insId = reversed_ ? starts[seqId] : starts[seqId + 1] - 1;
instanceIds_.push_back(insId);
outputValue->subMatrix(seqId, 1, tmpDest_)
->assign(*(inputValue->subMatrix(insId, 1, tmpSrc_)));
}
如果改成以length形式存的话:
// lengthPositions_=
// type_ ? input.subLengthPositions : input.lengthPositions;
auto length= lengthPositions_->getData(false);
MatrixPtr inputValue = getInputValue(0);
MatrixPtr outputValue = getOutputValue();
instanceIds_.clear();
int offset = 0; // 多余的行
for (size_t seqId = 0; seqId < newBatchSize_; ++seqId) {
int insId = reversed_ ? length[seqId] +offset : length[seqId + 1] - 1 +offset; // 需要多算一次加法。
instanceIds_.push_back(insId);
outputValue->subMatrix(seqId, 1, tmpDest_)
->assign(*(inputValue->subMatrix(insId, 1, tmpSrc_)));
offset+=length[seqId]; //多余的行
}
因为outputValue都是连续内存,所以取的时候就多了一步将length转成offset的步骤。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangkuiyi @Superjom 上面写的两个对比的例子,我以为存length时候的结构是这样的:
9 1 5
3 2 4 1 2 3
如果存成:
3 1 2
3 2 4 1 2 3
那计算subSequenceLength的时候,需要3,1,2
和3,2,4,1,2,3
一起首先计算出lengthPositions。就不是上面多加的三行,至少十行以上了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好像是这样。
应该要存成上面那种
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对于Slice,通常也需要指定[start, end]位置吧,我觉得可以看下SubSequenceLayer
和SliceProjection的这些实际slice的使用情况,或者其他框架slice op。
|
||
## Challenge of Variable-length Inputs | ||
|
||
People usually represent a mini-batch by a Tensor. For example, a mini-batch of 32 images, each of size 32x32, is a 10x32x32 Tensor. So a transformation, T, of all images can be a matrix multiplication of the 32x32xO-dimensional tensor T and the 10x32x32 Tensor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
32 images - > 10 images,
typedef std::vector<std::vector> > LoD; | ||
``` | ||
|
||
- The LoD index can is not necessary when there are only two levels and all elements of the second level have length 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can is not -> is not
|
||
## Slicing of LoD Tensor | ||
|
||
Consider that we have a network with three levels of RNN: the top level one handles articles, the second level one handles sentences, and the basic level one handles words. This network requires that mini-batches represented by 4 level LoD Tensor, for example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Usually, RNN only handles sequence (though non-sequence can be regarded as sequence with only one element, hardly any one uses RNN to handle non-sequence), I think there are two levels of RNN: the top level one handles articles (which is a sequence of sentences), the second level one handles sentences (which is a sequence of words), and the basic level ones are word embedding vectors.
-
I do not quite understand here why it is a 4 level LoD tensor rather than 3 level LoD tensor to represent a nested sequence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that 3 level is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be 3. Changing.
|
||
```c++ | ||
typedef std::vector<std::vector> > LoD; | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the definition (line 57) of LoD tensor mean the current design only support slice three-dimensional data? or to say slicing at two levels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each level is stored in a vector<int>
so a vector<vector<int>>
is enough to store any levels.
``` | ||
|
||
- The LoD index can is not necessary when there are only two levels and all elements of the second level have length 1. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an extra "can" in line 60.
In summary, as long as that the essential elements (words or images) have the same size, we can represent mini-batches by a LoD Tensor: | ||
|
||
- The underlying tensor has size LxD1xD2x..., where D1xD2... is the size of the essential elements, and | ||
- the first dimension size L has an additon property -- a LoD index as a nested vector: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact I understand about LoD tensor, please help me to figure out am I right:
- One input batch (a tensor) has only one LoD tensor (if needed).
- It can be regarded as a kind of splitting information attached to the first dimension of the input tensor.
- It is not that each dimension has a LoD tensor (if needed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, some split information + a tensor = LoD tensor
3 | ||
3 1 2 | ||
口口口 口 口口 | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- May be it is better to store the start positions as Paddle does currently, which is more convenient for slicing the original input, otherwise, all the layers that process sequence may have to compute the sequence start position in a batch.
- On the other hand, with sequence start positions in hand (they are offsets of a sequence in a batch) it is very easy to get a sequence's length.
我理解存储格式还是需要按照@Superjom之前文档的格式,举例: 单层
typedef vector<vector > LoD 存储: std::vector<std::vector<int>> LoD = {{0,2,3,6}}
双层
typedef vector<vector > LoD 存储: // LoD[0]:sentece信息
// LoD[1]: word信息
std::vector<std::vector<int>> LoD = {{0,5,7,13}, {0,2,5,7,10,12,13}}
or
// LoD[0]:word信息
// LoD[1]: sentence信息
std::vector<std::vector<int>> LoD = {{0,2,5,7,10,12,13},{0,5,7,13}}
|
@qingqing01 我也支持这一种,否则序列相关操作的layer潜在都需要自己重新计算需要取的那一份数据在内存中的偏移,这个计算是没有必要的。 |
@qingqing01 对上面你贴出来的这种格式,我有一个问题,当batch size = 1 时,对这样一条数据: std::vector<std::vector<int>> LoD = {{0,2,3,6}} 应该如何判断这是一条单层sequence的数据,还是一条双层序列数据,但是第一层 nested sequence 只有一个序列? |
这个很简单,看LoD.size()即可。如果是单层sequence,LoD.size()=1。 |
明白啦~ 漏看了 LoD.size 这一位~ |
```c++ | ||
typedef vector<vector<int> > LoD; | ||
|
||
struct LoDTensor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is LodTensor a composition of Lod and Tensor* or a derived class from Tensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
I'd like to change to save start offsets instead of lengths in LoD as explained in #3746 (comment). However, it seems that those 0's in LoD are not necessary? @qingqing01
|
I read codes from Luotao's comment, it seems that both approaches need the same lines of code. The length-approach do poorly when randomly obtain elements, while the offset-approach do poorly in slice, and less concise. We do need a concise data structure to make our new concept LoD easier to be understood, so I prefer the length-approach with some performance improvements. Currently, LODTensor will only be used by RNNOp, which will access LoD Tensor's elements in some level sequentially, so we can add an iterator to length-approach and make it faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I agree with @Superjom that when multiple designs have similar complexity of implementations, we'd prefer the design that is the easiest to understand. |
I forgot to follow some comments before merging this PR. I created #3837 to remind myself to do it. |
@Superjom LODTensor will be used in more than 20+ ops #3746 (comment) . In these ops, using offset is more convenient than using length. And in the future, we will add more sequence-related ops. Though in RecurrentOp, using length is more convenient, considering that 20+ ops : 1 RecurrentOp, we should choose the offset. |
@wangkuiyi @Superjom 在CPU计算的时候,存length还是offset都差不多,前者多了O(1)的计算而已。但是在GPU计算的时候,必须存offset。以maxlayer的前向GPUkernel为例,代码在hl_cuda_sequence.cu
从上面的代码可以看到:
|
This design doc PR is going to replace #3454