-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add design/api.md #1088
Add design/api.md #1088
Conversation
|
||
for mb, pass_end in rd.read(): | ||
gm.feed(mb) | ||
ud.update(gm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里应该需要加上
gm.forward()
gm.backward()
gm.update()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
哦。我这里忘了说明白——gm.feed调用了forward和 backward。我加上了一个说明。
gm.update是什么意思呢?是更新模型参数吗?我以为是 ud.update 来做这个事儿的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确实应该是 ud.update(gm) 而不是 gm.update,我说的不对~
input_dis = paddle.layer.input(...) | ||
hidden_dis = paddle.layer.fc(intput_dis, ...) | ||
output_dis = paddle.layer.softmax(hidden_dis, ...) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For GAN, gm_dis and gm_gen update different subset of parameters. We need to have a mechanism to specify this. In the current GAN example, this is achieved by setting is_static according to is_discriminator_training. This is possible in the current GAN example because the configs of gm_gen and gm_dis are actually generated differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我理解有这个需求。我这个设计里是这样实现的(但是可能文字里没有说清楚):
-
每个“部分”描述成一个“网络”
-
每个“网络”用其output指代。
比如例子里有
output_dis
和output_gen
。这可以通过在每个layer中记录其所有input layers来实现,这样给定一个output layer,我们即可trace到整个网络中的所有layers。 -
每个“网络”的更新信息维护在一个 gradient machine。
例子中,有两个网络,但是却有三个gradient machines:gm_dis 对应 output_dis,gm_gen 对应 output_gen,gm_comp 对应 output_gen 和 output_dis 的组合。
-
对每个网络的更新,是通过 updater 来实现的。
updater的输入是 gradient machine,因为通过每次forward/backward 调用计算得到的 layer outputs (activations) 和 gradients 都保存在 gradient machine 里了,而且通过 gradient machine 可以查到其对应的network的信息。
在下面例子里:
ud.update(gm_dis)
利用 gm_dis 更新其对应的 output_dis 网络,其中没有显示指定需要被更新的部分,所以是更新gm_dis 对应的整个网络;而
ud.update(gm_comp, output_gen)
是利用 gm_comp 更新其对应的网络中的
output_gen
子网络。其中output_gen
是显示指定的需要被更新的部分。
|
||
### Model | ||
|
||
For deep learning, a model includes two parts: the topology and parameters. Currently, the concept *model* in Paddle contains only the topology, and parameters are in another concept *gradient machine*. This differs from the intuition and makes it difficult to save/load models. In this design, we should keep both topology and parameters in a *model*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is how tensorflow do it, just for reference:
-
Tensorflow has a concept of graph (topology). Load graph: https://godoc.org/github.com/tensorflow/tensorflow/tensorflow/go#Graph.Import
-
Save and load weights is an Operation on graph. Save and load in golang: https://gist.github.com/helinwang/7782c6b2815c334c77653fc0e52b6069
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一点我不是特别理解,为什么 graph (topology) 和 weights 要分开呢?按照“model是graph加上parameterss(weights)”的概念,貌似把两者放在一个 class Model 里更自然呀?
rd = paddle.data.amazon.product_review.new() | ||
mt = paddle.metric.new_acc() | ||
|
||
for mb, pass_end in rd.read(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe need a way to specify minibatch size, either in reader.new(int batch_size)
or reader.read(int batch_size)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确实。可以通过read函数指定mini batch size,也可以在创建reader的时候指定,比如
rd = paddle.data.amazon.product_review.new(minibatch_size=100)
fake_label = paddle.input.new(False, ...) | ||
real_label = paddle.input.new(True, ...) | ||
gm_dis.feed([fake, fake_label]) | ||
gm_dis.feed([mb, real_label]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this question is naive, will the computed gradient from second feed
override the gradients from first feed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个设计里是这么假设的:第二次 feed 调用产生的 layer outputs 会覆盖之前一次 feed调用产生的;第二次 feed 调用产生的gradients 也会覆盖之前一次调用产生的。
gm_dis.feed([mb, real_label]) | ||
ud.update(gm_dis) | ||
|
||
gm_comp.feed([mb, real_label]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since updater will only need to update gradients of output_gen
's predecessors'. One possible optimization here is to let gradient machine know this information. So it does not need to save any unrelated activation and gradients. E.g.,
gm_comp.feed([mb, real_label], output_gen)
ud.update(gm_comp, output_gen)
Maybe I don't have enough context, but from a first glance, I feel maybe it's easier that gradient machine be stateless (current design saves gradients inside gradient machine). E.g.,
// pseudo code
type gradientMachine
type gradient
type updater
var gm_comp gradientMachine
var ud updater
var gradient g = gm_comp.feed([mb, real_label], output_gen)
ud.update(g)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里可能是我没有说清楚,所以引起误会了。gm_comp
这个 gradient machine 对应的是 gen 和 dis 两个网络的组合。这个组合的输入是 gen 的输入,所以 input 只能是 mb。
此外,gradient machine被设计成有状态是intentionally的,因为在GAN的例子里,我们有两个“网络”,dis 和 gen,但是却需要记录三种 activation+gradients:dis, gen, comp。
ud.update(gm_comp, output_gen) # updates only the model whose output layer is output_gen. | ||
``` | ||
|
||
A key point here is that we use the output layer to indicate a model. I think that we can achieve this as long as each layer knows about its predecessors so that we can trace from the output layer upward till the input layers. Please be aware that we didn't compose two models in above example code; instead, we only created a gradient machine which covers both `model_gen` and `model_dis`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for reference: If I understand correctly, tensorflow do it by state what sub-graph needs update
output_gen_min = tf.train.AdamOptimizer(1e-2).minimize(output_gen)
output_gen_min.run(feed_dict={x: input, y_: label})
Another use case is we only want to update some weights inside a subgraph.
E.g., we want to update fc layer weights of ouput_gen
but not weights of hidden_gen
(predecessors of ouput_gen
). (I think people call it fine tune)
Tensorflow allow explicitly state weights that needs update.
fine_tune_step = tf.train.AdamOptimizer(1e-2).minimize(cross_entropy, var_list=[weights_of_output_gen])
fine_tune_step.run(feed_dict={x: input, y_: label})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
理解了。在这个设计里,也有类似的描述:
ud.update(gm_dis)
利用 gm_dis 更新其对应的 output_dis 网络,其中没有显示指定需要被更新的部分,所以是更新gm_dis 对应的整个网络;而
ud.update(gm_comp, output_gen)
是利用 gm_comp 更新其对应的网络中的 output_gen
子网络。其中 output_gen
是显示指定的需要被更新的部分。
|
||
1. *updater*, which encapsulates the updating algorithm. | ||
|
||
It seems that *cost function* is a property of *gradient machine*? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cost function should be a property of network topology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cost function和模型无关,只和训练方法有关,不应该是网络的一部分。
比如一个generative model,可以用朴素的“输出和输入相同”的criteria来训练,也可以用GAN criteria来训练,cost不同,但是网络相同。类似的,sequence to sequence model,可以用最小误差作为cost,也可以用CTC作为cost。
|
||
1. *Data* | ||
|
||
Models are trained using data sets. We hope to provide a set of utility data sets encapsulated in Python packages like `paddle.data.amazon.product_review` and `paddle.data.wikipedia.articles`. A reasonable idea might be that in each of these packages, we provide a `new` function that returns a reader object or a Python iterator. And the *reader* has a read method `read`, which, once called, returns a minibatch and a flag indicating if it reaches the end of a data set. For online learning, the flag would always be False. Therefore, a training loop might look like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning a standard python iterator is a good idea. It is no need to give a read method. The flag of the end of the data set is not necessary because standard python iterator will throw a StopIteration exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API的设计应该和语言无关。应该使用各种语言都有的功能,而不专门依赖Python语法。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python iterator只是一种类的调用约定。本质上是一个对象,但是这个对象实现了 next
方法。调用next的时候,返回一组数据。当iterator空了的时候,throw一个StopIteration的exception。
这和reader概念类似,next函数也就相当于read函数。Python的iterator就相当于java里面的List interface一样自然。
如果我们封装的是 "Python"的api,在不增加用户心智负担的情况下,接口还是尽量按照"Python"的逻辑实现吧。
### A Simple Network | ||
|
||
```python | ||
gm = paddle.gradient_machine.new(model) # gm uses default cost function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one thing we didn't discuss yet. How does user describe a neural network?
The most convenient way to implementation is to use a Python method for a network. For example
@network(input={'x': dense_vector(123), 'label': integer_value(10)})
def sample_network(x, label):
hidden = fc_layer(input=x, size=100)
prediction = fc_layer(input=hidden, size=label.size, act=SoftmaxActivation())
return classification_cost(input=prediction, label=label)
# Create the model.
model = sample_network()
model.randomParams()
model.save/load() ...
Another way to define neural network topology is passing the final value to a model creator. For example
x = data_layer(type=dense_vector(123))
hidden = fc_layer(input=x, size=100)
...
loss = classification_cost(input=prediction, label=label)
model = paddle.ModelCreator(loss)
...
It is hard to implement that in Paddle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在我写的正文里,没有强调,但是在我给徐老师和鹤麟的回复里有强调:
- 没有model这个概念了等价于network。所以应该既没有model,也没有model creator的概念才对。
- 每个network用它的output layer指代。具体请参见 Add design/api.md #1088 (comment)
另外,我理解API的设计思想最好不要用Python来描述,容易依赖到Python独特的语法,比如 @network
这样的decorator。API 应该是用“带对象的C”描述也能清晰的,才能支持各种 client languages。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同意, decorator给用户带来了极大的困扰
目前的分歧包括 神经网络模型是不是包括loss function?一方认为神经网络模型不包括loss function。原因
另一方认为神经网络模型应该加上loss function。原因
Paddle模型配置应该怎么定义?
def network(pixel):
hidden = fc_layer(input=pixel, size=200)
pred = fc_layer(input=hidden, size=10, act=SoftmaxActivation())
return pred
model = create_model(network, input={'pixel': dense_vector(784)})
pixel = data_layer(name='pixel', type=dense_vector(784))
hidden = fc_layer(input=pixel, size=200)
pred = fc_layer(input=hidden, size=10, act=SoftmaxActivation()) # pred store all neural network topology
model = create_model(pred) 问题是,第二种变成变量很难实现,因为那样基本上要把Paddle目前的网络解析重构一遍了。 网络的表示是否提供了足够灵活性,能描述GAN这样的问题分歧是,GAN这种网络,是需要『网络描述』时的灵活性,还是训练时的灵活性?GAN基本问题是在更新的时候选择更新参数。这其实是updater的update函数,需要在更新的时候,指定一个Predicate 吧。与网络描述似乎无关。 |
* update * update * remove community/junnyu * Update test_modeling.py * suggestion from ZHUI * add community/junnyu * rm gpt link * update * update readme * add large medium * 更新权重个数 * update gpt compare * update docs * add msra ner example Co-authored-by: yingyibiao <yyb0576@163.com> Co-authored-by: Zeyu Chen <chenzeyu01@baidu.com>
No description provided.