Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add design/api.md #1088

Closed
wants to merge 2 commits into from
Closed

Add design/api.md #1088

wants to merge 2 commits into from

Conversation

wangkuiyi
Copy link
Collaborator

No description provided.


for mb, pass_end in rd.read():
gm.feed(mb)
ud.update(gm)
Copy link
Member

@jacquesqiao jacquesqiao Jan 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该需要加上

gm.forward()
gm.backward()
gm.update()

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Jan 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

哦。我这里忘了说明白——gm.feed调用了forward和 backward。我加上了一个说明。

gm.update是什么意思呢?是更新模型参数吗?我以为是 ud.update 来做这个事儿的。

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确实应该是 ud.update(gm) 而不是 gm.update,我说的不对~

@wangkuiyi wangkuiyi requested a review from emailweixu January 6, 2017 19:38
input_dis = paddle.layer.input(...)
hidden_dis = paddle.layer.fc(intput_dis, ...)
output_dis = paddle.layer.softmax(hidden_dis, ...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GAN, gm_dis and gm_gen update different subset of parameters. We need to have a mechanism to specify this. In the current GAN example, this is achieved by setting is_static according to is_discriminator_training. This is possible in the current GAN example because the configs of gm_gen and gm_dis are actually generated differently.

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Jan 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我理解有这个需求。我这个设计里是这样实现的(但是可能文字里没有说清楚):

  1. 每个“部分”描述成一个“网络”

  2. 每个“网络”用其output指代。

    比如例子里有 output_disoutput_gen。这可以通过在每个layer中记录其所有input layers来实现,这样给定一个output layer,我们即可trace到整个网络中的所有layers。

  3. 每个“网络”的更新信息维护在一个 gradient machine。

    例子中,有两个网络,但是却有三个gradient machines:gm_dis 对应 output_dis,gm_gen 对应 output_gen,gm_comp 对应 output_gen 和 output_dis 的组合。

  4. 对每个网络的更新,是通过 updater 来实现的。

    updater的输入是 gradient machine,因为通过每次forward/backward 调用计算得到的 layer outputs (activations) 和 gradients 都保存在 gradient machine 里了,而且通过 gradient machine 可以查到其对应的network的信息。

    在下面例子里:

    ud.update(gm_dis)
    

    利用 gm_dis 更新其对应的 output_dis 网络,其中没有显示指定需要被更新的部分,所以是更新gm_dis 对应的整个网络;而

    ud.update(gm_comp, output_gen)
    

    是利用 gm_comp 更新其对应的网络中的 output_gen 子网络。其中 output_gen 是显示指定的需要被更新的部分。


### Model

For deep learning, a model includes two parts: the topology and parameters. Currently, the concept *model* in Paddle contains only the topology, and parameters are in another concept *gradient machine*. This differs from the intuition and makes it difficult to save/load models. In this design, we should keep both topology and parameters in a *model*.
Copy link
Contributor

@helinwang helinwang Jan 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is how tensorflow do it, just for reference:

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Jan 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一点我不是特别理解,为什么 graph (topology) 和 weights 要分开呢?按照“model是graph加上parameterss(weights)”的概念,貌似把两者放在一个 class Model 里更自然呀?

rd = paddle.data.amazon.product_review.new()
mt = paddle.metric.new_acc()

for mb, pass_end in rd.read():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe need a way to specify minibatch size, either in reader.new(int batch_size) or reader.read(int batch_size)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确实。可以通过read函数指定mini batch size,也可以在创建reader的时候指定,比如

rd = paddle.data.amazon.product_review.new(minibatch_size=100)

fake_label = paddle.input.new(False, ...)
real_label = paddle.input.new(True, ...)
gm_dis.feed([fake, fake_label])
gm_dis.feed([mb, real_label])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this question is naive, will the computed gradient from second feed override the gradients from first feed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个设计里是这么假设的:第二次 feed 调用产生的 layer outputs 会覆盖之前一次 feed调用产生的;第二次 feed 调用产生的gradients 也会覆盖之前一次调用产生的。

gm_dis.feed([mb, real_label])
ud.update(gm_dis)

gm_comp.feed([mb, real_label])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since updater will only need to update gradients of output_gen's predecessors'. One possible optimization here is to let gradient machine know this information. So it does not need to save any unrelated activation and gradients. E.g.,

gm_comp.feed([mb, real_label], output_gen)
ud.update(gm_comp, output_gen)

Maybe I don't have enough context, but from a first glance, I feel maybe it's easier that gradient machine be stateless (current design saves gradients inside gradient machine). E.g.,

// pseudo code
type gradientMachine
type gradient
type updater

var gm_comp gradientMachine
var ud updater
var gradient g = gm_comp.feed([mb, real_label], output_gen)
ud.update(g)

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Jan 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里可能是我没有说清楚,所以引起误会了。gm_comp 这个 gradient machine 对应的是 gen 和 dis 两个网络的组合。这个组合的输入是 gen 的输入,所以 input 只能是 mb。

此外,gradient machine被设计成有状态是intentionally的,因为在GAN的例子里,我们有两个“网络”,dis 和 gen,但是却需要记录三种 activation+gradients:dis, gen, comp。

ud.update(gm_comp, output_gen) # updates only the model whose output layer is output_gen.
```

A key point here is that we use the output layer to indicate a model. I think that we can achieve this as long as each layer knows about its predecessors so that we can trace from the output layer upward till the input layers. Please be aware that we didn't compose two models in above example code; instead, we only created a gradient machine which covers both `model_gen` and `model_dis`.
Copy link
Contributor

@helinwang helinwang Jan 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for reference: If I understand correctly, tensorflow do it by state what sub-graph needs update

output_gen_min = tf.train.AdamOptimizer(1e-2).minimize(output_gen)
output_gen_min.run(feed_dict={x: input, y_: label})

Another use case is we only want to update some weights inside a subgraph.
E.g., we want to update fc layer weights of ouput_gen but not weights of hidden_gen (predecessors of ouput_gen). (I think people call it fine tune)

Tensorflow allow explicitly state weights that needs update.

fine_tune_step = tf.train.AdamOptimizer(1e-2).minimize(cross_entropy, var_list=[weights_of_output_gen])
fine_tune_step.run(feed_dict={x: input, y_: label})

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

理解了。在这个设计里,也有类似的描述:

ud.update(gm_dis)

利用 gm_dis 更新其对应的 output_dis 网络,其中没有显示指定需要被更新的部分,所以是更新gm_dis 对应的整个网络;而

ud.update(gm_comp, output_gen)

是利用 gm_comp 更新其对应的网络中的 output_gen 子网络。其中 output_gen 是显示指定的需要被更新的部分。


1. *updater*, which encapsulates the updating algorithm.

It seems that *cost function* is a property of *gradient machine*?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost function should be a property of network topology.

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Jan 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cost function和模型无关,只和训练方法有关,不应该是网络的一部分。

比如一个generative model,可以用朴素的“输出和输入相同”的criteria来训练,也可以用GAN criteria来训练,cost不同,但是网络相同。类似的,sequence to sequence model,可以用最小误差作为cost,也可以用CTC作为cost。


1. *Data*

Models are trained using data sets. We hope to provide a set of utility data sets encapsulated in Python packages like `paddle.data.amazon.product_review` and `paddle.data.wikipedia.articles`. A reasonable idea might be that in each of these packages, we provide a `new` function that returns a reader object or a Python iterator. And the *reader* has a read method `read`, which, once called, returns a minibatch and a flag indicating if it reaches the end of a data set. For online learning, the flag would always be False. Therefore, a training loop might look like:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a standard python iterator is a good idea. It is no need to give a read method. The flag of the end of the data set is not necessary because standard python iterator will throw a StopIteration exception.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API的设计应该和语言无关。应该使用各种语言都有的功能,而不专门依赖Python语法。

Copy link
Collaborator

@reyoung reyoung Jan 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python iterator只是一种类的调用约定。本质上是一个对象,但是这个对象实现了 next 方法。调用next的时候,返回一组数据。当iterator空了的时候,throw一个StopIteration的exception。
这和reader概念类似,next函数也就相当于read函数。Python的iterator就相当于java里面的List interface一样自然。

如果我们封装的是 "Python"的api,在不增加用户心智负担的情况下,接口还是尽量按照"Python"的逻辑实现吧。

### A Simple Network

```python
gm = paddle.gradient_machine.new(model) # gm uses default cost function.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one thing we didn't discuss yet. How does user describe a neural network?

The most convenient way to implementation is to use a Python method for a network. For example

@network(input={'x': dense_vector(123), 'label': integer_value(10)})
def sample_network(x, label):
    hidden = fc_layer(input=x, size=100)
    prediction = fc_layer(input=hidden, size=label.size, act=SoftmaxActivation())
    return classification_cost(input=prediction, label=label)

# Create the model.
model = sample_network()
model.randomParams()

model.save/load() ...

Another way to define neural network topology is passing the final value to a model creator. For example

x = data_layer(type=dense_vector(123))
hidden = fc_layer(input=x, size=100)
...
loss = classification_cost(input=prediction, label=label)

model = paddle.ModelCreator(loss)
...

It is hard to implement that in Paddle.

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Jan 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在我写的正文里,没有强调,但是在我给徐老师和鹤麟的回复里有强调:

  1. 没有model这个概念了等价于network。所以应该既没有model,也没有model creator的概念才对。
  2. 每个network用它的output layer指代。具体请参见 Add design/api.md #1088 (comment)

另外,我理解API的设计思想最好不要用Python来描述,容易依赖到Python独特的语法,比如 @network 这样的decorator。API 应该是用“带对象的C”描述也能清晰的,才能支持各种 client languages。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同意, decorator给用户带来了极大的困扰

@reyoung
Copy link
Collaborator

reyoung commented Jan 9, 2017

目前的分歧包括

神经网络模型是不是包括loss function?

一方认为神经网络模型不包括loss function。原因

  • 模型预测的时候其实是不用loss function的。比如分类问题,可以用交叉熵,也可以用huber loss之类的其他loss。但是模型预测并不需要知道这些loss是什么。
  • 训练的时候,直接用 model.fit(output, label, cost="cross-entropy")

另一方认为神经网络模型应该加上loss function。原因

Paddle模型配置应该怎么定义?

  1. 函数
def network(pixel):
  hidden = fc_layer(input=pixel, size=200)
  pred = fc_layer(input=hidden, size=10, act=SoftmaxActivation())
  return pred

model = create_model(network, input={'pixel': dense_vector(784)})
  1. 网络的最后一个变量
pixel = data_layer(name='pixel', type=dense_vector(784))
hidden = fc_layer(input=pixel, size=200)
pred = fc_layer(input=hidden, size=10, act=SoftmaxActivation())  # pred store all neural network topology

model = create_model(pred)

问题是,第二种变成变量很难实现,因为那样基本上要把Paddle目前的网络解析重构一遍了。

网络的表示是否提供了足够灵活性,能描述GAN这样的问题

分歧是,GAN这种网络,是需要『网络描述』时的灵活性,还是训练时的灵活性?GAN基本问题是在更新的时候选择更新参数。这其实是updater的update函数,需要在更新的时候,指定一个Predicate 吧。与网络描述似乎无关。

This was referenced Jan 22, 2017
@wangkuiyi wangkuiyi closed this Feb 13, 2017
@wangkuiyi wangkuiyi deleted the api.md branch February 13, 2017 04:57
wangxicoding pushed a commit to wangxicoding/Paddle that referenced this pull request Dec 9, 2021
* update

* update

* remove community/junnyu

* Update test_modeling.py

* suggestion from ZHUI

* add community/junnyu

* rm gpt link

* update

* update readme

* add large medium

* 更新权重个数

* update gpt compare

* update docs

* add msra ner example

Co-authored-by: yingyibiao <yyb0576@163.com>
Co-authored-by: Zeyu Chen <chenzeyu01@baidu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants