Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduled sampling #29

Merged
merged 16 commits into from
Jun 19, 2017
Merged

Scheduled sampling #29

merged 16 commits into from
Jun 19, 2017

Conversation

wwhu
Copy link
Contributor

@wwhu wwhu commented May 8, 2017

resolve #11
Note: This model may encounter "Floating point exception" after training on several mini-batches.
Scheduled sampling need to use the api multiplex_layer (PaddlePaddle/Paddle#1753), which has not been implemented in current Paddle version. I implemented this layer in my repositories (https://github.com/wwhu/Paddle/blob/ss-dev/python/paddle/trainer_config_helpers/layers.py). I will post a PR to the official Paddle repository after I write the unit test for it.

@wwhu wwhu changed the title Ss dev Scheduled sampling May 8, 2017
schduled_type: is the type of the decay. It supports constant, linear,
exponential, and inverse_sigmoid right now.
a: parameter of the decay (MUST BE DOUBLE)
b: parameter of the decay (MUST BE DOUBLE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12-15行的注释放到18行init函数下面,因为这三个参数在init的时候才出现。

Get the schedule sampling rate. Usually not needed to be called by the users
'''

def getScheduleRate(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

33行注释放到36行下面。下同。因为后续如果别人在中间插了个函数,就不知道这段注释对应的是哪个函数了。

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small modifications.

if __name__ == "__main__":
schedule_generator = RandomScheduleGenerator("linear", 0.1, 500000)
true_token_flag = schedule_generator.processBatch(5)
pdb.set_trace()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please delete debug related codes.

@@ -0,0 +1,56 @@
import numpy as np
import math
import pdb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the debug module.

decoder_state=decoder_mem)

gru_out_memory = paddle.layer.memory(
name='gru_out', size=target_dict_dim) # , boot_with_const_id=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove useless comment.

@lcy-seso lcy-seso self-assigned this May 9, 2017
@wwhu
Copy link
Contributor Author

wwhu commented May 10, 2017

I have revised the code. Please review it. Thanks.
@lcy-seso @luotao1

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduled sampling should not be used in generation. Multiplex layer should only be created in training.

src_embedding = paddle.layer.embedding(
input=src_word_id,
size=word_vector_dim,
param_attr=paddle.attr.ParamAttr(name='_source_language_embedding'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the parameter name _source_language_embedding is not explicitly used, it can be removed to avoid such a hard code.

return data_reader


def seqToseq_net(source_dict_dim, target_dict_dim, is_generating=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on parameters like in random_schedule_generator.py.

input=backward_first)

def gru_decoder_with_attention_train(enc_vec, enc_proj, true_word,
true_token_flag):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on parameters like in random_schedule_generator.py.


return out

def gru_decoder_with_attention_test(enc_vec, enc_proj, current_word):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on parameters like in random_schedule_generator.py.

param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))

current_word = paddle.layer.multiplex(
input=[true_token_flag, true_word, generated_word_emb])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This layer should not be created in generating, because, in the generation, generated word is always used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multiplex layer is in the function gru_decoder_with_attention_train, which is only called during training.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I see.

size=target_dict_dim,
embedding_name='_target_language_embedding',
embedding_size=word_vector_dim)
group_inputs.append(trg_embedding)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the generation, target embedding is unknown. This configuration is not reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of trg_embedding is GeneratedInputV2. It shares the embedding matrix of the target language with the embedding matrix during training. It doesn't use ground-truth target words as inputs.

@wwhu
Copy link
Contributor Author

wwhu commented May 15, 2017

I have revised the comments and added the document in README.md.
The experimental results are not included in the document since I haven't tuned the hyper parameters of the model .
The models runs slowly on my CPU machine. So it will take some time to validate the performance of scheduled sampling.
@lcy-seso

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need modifications.

param_attr=paddle.attr.ParamAttr(name='_target_language_embedding'))

current_word = paddle.layer.multiplex(
input=[true_token_flag, true_word, generated_word_emb])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I see.

"""
The decoder step for training.
:param enc_vec: the encoder vector for attention
:type enc_vec: Layer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layer --> LayerOutput

:param enc_vec: the encoder vector for attention
:type enc_vec: Layer
:param enc_proj: the encoder projection for attention
:type enc_proj: Layer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layer --> LayerOutput

:param enc_proj: the encoder projection for attention
:type enc_proj: Layer
:param true_word: the ground-truth target word
:type true_word: Layer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layer --> LayerOutput

:param true_token_flag: the flag of using the ground-truth target word
:type true_token_flag: Layer
:return: the softmax output layer
:rtype: Layer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layer --> LayerOutput

- 反向Sigmoid衰减:`epsilon_i=k/(k+exp(i/k))`,其中`k>1`,`k`同样控制衰减的幅度。

## 模型实现
由于Scheduled Sampling是对Sequence to Sequence模型的改进,其整体实现框架与Sequence to Sequence模型较为相似。为突出本文重点,这里仅介绍与Scheduled Sampling相关的部分,完整的代码见`scheduled_sampling.py`。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

与scheduled sampling相关的,包括:

  1. 采样概率如何衰减
  2. multiplex layer如何使用

都需要解释,这几组产生采样概率的函数,超参数设置原则?

这里对数据reader进行封装,加入从`RandomScheduleGenerator`采样得到的`true_token_flag`作为另一组数据输入,控制解码使用的元素。

```python
schedule_generator = RandomScheduleGenerator("linear", 0.75, 1000000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.75, 1000000 这两个值是怎么选择的,请在 README 中解释,否则,用户很难确定这两个值的设置从何而来。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

前面提了下超参数需要用户调优。后面调优后会替换这两个值,并说明这是调优后的结果。

indexes = (numbers >= rate).astype('int32').tolist()
self.data_processed_ += batch_size
return indexes
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这样贴一段代码的效果和直接看代码是没啥区别的,请解释怎么使用,初始参数怎么设置。

self.data_processed_ += batch_size
return indexes
```
其中`__init__`方法定义了几种不同的衰减概率,`processBatch`方法根据该概率进行采样,最终确定解码时是使用真实元素还是使用生成的元素。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

贴一段代码,在代码后面附上这样一句话是没有有效的解释,和直接看代码是没区别的,作为读者看完这句话是充满了疑惑的。

  1. 其中__init__方法定义了几种 --> 定义了几种?这几种怎么选择?参数怎么设置?请有效地与上文介绍进行管理,指代不请。

  2. processBatch方法根据该概率进行采样 --> 该概率指代上一句的__init__里面定义的吗?__init__里面接受超参数,采样概率是如何变化的?

  3. 最终确定解码时是使用真实元素还是使用生成的元素。 --> 怎么确定的?

其中`__init__`方法定义了几种不同的衰减概率,`processBatch`方法根据该概率进行采样,最终确定解码时是使用真实元素还是使用生成的元素。


这里对数据reader进行封装,加入从`RandomScheduleGenerator`采样得到的`true_token_flag`作为另一组数据输入,控制解码使用的元素。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 这里对数据reader进行封装 --> 请展开多写两句,这句话放在这里,为啥对reader进行封装?请不要让读者去想。。。
  2. 控制解码使用的元素。--> 这里并不涉及“解码”过程,通常把生成整个序列称之为解码。

@lcy-seso
Copy link
Collaborator

请提multiplex layer 的v2 接口的PR,否则这个例子merege 之后无法运行。

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readme需要大改:

  1. random_schedule_generator.py不用放在paddle repo下面么,这样别的用户也可以使用 @lcy-seso
  2. 算法简介要更加通俗易懂,可以将论文中的图转成中文,结合图进行描述。
  3. 模型实现部分目前是大段贴代码,大段贴代码的部分都可以删去,只留下一些关键的文字性描述即可。

@@ -1 +1,164 @@
TBD
# Scheduled Sampling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

标题建议改成中文,下同所有的"Scheduled Sampling"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个好像还没有标准的中文翻译

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个例子不需要标准的中文翻译,用英文即可。暂时我也没有遇到广泛接受的中文翻译。


## 概述
序列生成任务的训练目标是在给定源输入的条件下,最大化目标序列的概率。训练时该模型将目标序列中的真实元素作为解码阶段每一步的输入,然后最大化下一个元素的概率。生成时上一步解码得到的元素被用作当前的输入,然后生成下一个元素。可见这种情况下训练阶段和生成阶段的解码层输入数据的概率分布并不一致。如果序列前面生成了错误的元素,后面的输入状态将会收到影响,而该误差会随着生成过程不断向后累积。
Scheduled Sampling是一种解决训练和生成时输入数据分布不一致的方法。在训练早期该方法主要使用真实元素作为解码输入,以将模型从随机初始化的状态快速引导至一个合理的状态。随着训练的进行该方法会逐渐更多的使用生成元素作为解码输入,以解决数据分布不一致的问题。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第4和第5行中间加空行,不然全部连在一起了。

# Scheduled Sampling

## 概述
序列生成任务的训练目标是在给定源输入的条件下,最大化目标序列的概率。训练时该模型将目标序列中的真实元素作为解码阶段每一步的输入,然后最大化下一个元素的概率。生成时上一步解码得到的元素被用作当前的输入,然后生成下一个元素。可见这种情况下训练阶段和生成阶段的解码层输入数据的概率分布并不一致。如果序列前面生成了错误的元素,后面的输入状态将会收到影响,而该误差会随着生成过程不断向后累积。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 如果有训练目标,就应该写生成目标。要不两个目标都可以不写,这儿建议可以全去掉,只讲训练和生成时的不同数据分布情况。
  2. “如果序列前面生成了错误的元素,后面的输入状态将会收到影响,而该误差会随着生成过程不断向后累积。”是引入Scheduled Sampling的原因么?如果不是,可以去掉。


## 概述
序列生成任务的训练目标是在给定源输入的条件下,最大化目标序列的概率。训练时该模型将目标序列中的真实元素作为解码阶段每一步的输入,然后最大化下一个元素的概率。生成时上一步解码得到的元素被用作当前的输入,然后生成下一个元素。可见这种情况下训练阶段和生成阶段的解码层输入数据的概率分布并不一致。如果序列前面生成了错误的元素,后面的输入状态将会收到影响,而该误差会随着生成过程不断向后累积。
Scheduled Sampling是一种解决训练和生成时输入数据分布不一致的方法。在训练早期该方法主要使用真实元素作为解码输入,以将模型从随机初始化的状态快速引导至一个合理的状态。随着训练的进行该方法会逐渐更多的使用生成元素作为解码输入,以解决数据分布不一致的问题。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 在训练早期该方法主要使用真实元素作为解码输入:真实元素应该是目标序列的真实元素
  2. 以将-》可以将
  3. 随着训练的进行,该方法XXX (全文注意分句)

Scheduled Sampling是一种解决训练和生成时输入数据分布不一致的方法。在训练早期该方法主要使用真实元素作为解码输入,以将模型从随机初始化的状态快速引导至一个合理的状态。随着训练的进行该方法会逐渐更多的使用生成元素作为解码输入,以解决数据分布不一致的问题。

## 算法简介
Scheduled Sampling主要应用在Sequence to Sequence模型的训练上,而生成阶段则不需要使用。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 主要应用在序列到序列模型的训练阶段,生成阶段不需要使用。
  2. Sequence to Sequence 改成 “序列到序列”,下同。

序列生成任务的训练目标是在给定源输入的条件下,最大化目标序列的概率。训练时该模型将目标序列中的真实元素作为解码阶段每一步的输入,然后最大化下一个元素的概率。生成时上一步解码得到的元素被用作当前的输入,然后生成下一个元素。可见这种情况下训练阶段和生成阶段的解码层输入数据的概率分布并不一致。如果序列前面生成了错误的元素,后面的输入状态将会收到影响,而该误差会随着生成过程不断向后累积。
Scheduled Sampling是一种解决训练和生成时输入数据分布不一致的方法。在训练早期该方法主要使用真实元素作为解码输入,以将模型从随机初始化的状态快速引导至一个合理的状态。随着训练的进行该方法会逐渐更多的使用生成元素作为解码输入,以解决数据分布不一致的问题。

## 算法简介
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

算法简介最好有图,现在的描述方式,小白用户看的很晕。

@lcy-seso
Copy link
Collaborator

If you are not going to finish this pr, please tell me. I will do it myself.

@wwhu
Copy link
Contributor Author

wwhu commented Jun 15, 2017

I will finish it ASAP. Sorry for the delay. @lcy-seso

@lcy-seso
Copy link
Collaborator

lcy-seso commented Jun 15, 2017

@wwhu You're welcome. I think the work is almost finished, only after small modifications, we can merge it first, and then keep on refining it. Thanks for your work.

@wwhu
Copy link
Contributor Author

wwhu commented Jun 15, 2017

@lcy-seso @luotao1 已按照上面的意见修改README和其他一些小问题

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix a small bug due to the updates of PaddlePaddle.

@@ -1 +1,164 @@
TBD
# Scheduled Sampling
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个例子不需要标准的中文翻译,用英文即可。暂时我也没有遇到广泛接受的中文翻译。


return cost
else:
trg_embedding = paddle.layer.GeneratedInputV2(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里有一个小问题,paddle 最近升级了, GeneratedInputV2 和 StaticInputV2 两个函数的 V2 后缀都不再需要。全部替换为 GeneratedInput和StaticInput吧。否则会报错。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost LGTM, I will further refactor and validate this demo.

@lcy-seso lcy-seso merged commit 82e8848 into PaddlePaddle:develop Jun 19, 2017
@wwhu wwhu deleted the ss-dev branch June 19, 2017 08:38
HongyuLi2018 pushed a commit that referenced this pull request Apr 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

example configuration for scheduled_sampling.
3 participants