Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CTR demo #57

Merged
merged 36 commits into from
Jun 1, 2017
Merged

CTR demo #57

merged 36 commits into from
Jun 1, 2017

Conversation

Superjomn
Copy link
Contributor

数据处理部分篇幅太长,单独写了一个markdown文件

generate脚本稍后补充

文档是org mode写的,之后转成.md文件,所以会有 .org的文件,可以用一个单独目录隐藏起来

@lcy-seso

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

关于形式化的初步意见。

ctr/README.org Outdated
@@ -0,0 +1,178 @@
#+title: 使用 Wide & Deep neural model 进行 CTR 预估
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

未来会将markdown 转为 html,这里删除org 文件吧。

ctr/README.md Outdated

<a id="org8f6a6fa"></a>

# 引用
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 使用二级标题,## 参考文献,目前每一篇里面只保留一个一级标题。
  2. 参考文献直接使用数字列表,去掉方括号。在引用文献的地方使用: [1] 这样的标记。
  3. 论文也请附上链接

ctr/README.md Outdated

下图展示了 LR 和一个 \(3x2\) 的 NN 模型的结构:

![img](./images/lr-vs-dnn.jpg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctr/README.md Outdated

<a id="orgab346e7"></a>

# 数据和任务抽象
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

每一篇只有一个一级标题,这里修改为二级标题

ctr/README.md Outdated


<a id="orgc299c2a"></a>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一级标题:#点击率预估,以后各小节为二级,三级等标题。

ctr/README.md Outdated
act=paddle.activation.Relu(),
name='dnn-fc-%d' % no)
_input_layer = fc
return _input_layer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

172 ~ 173 多余的空行去掉。

ctr/README.md Outdated

```

<a id="orgb4020a9"></a>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些标记先从markdown中删除,后面html统一渲染。

ctr/README.md Outdated

params = paddle.parameters.create(classification_cost)

optimizer = paddle.optimizer.Momentum(momentum=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有 paddle.init()不会出问题吗?

ctr/dataset.md Outdated
- `C14-C21` &#x2013; anonymized categorical variables


<a id="orgeaf74d5"></a>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

html 标记先去掉吧。

ctr/dataset.org Outdated
return res
#+END_SRC


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考文献作为单独的一节,二级标题。

@lcy-seso
Copy link
Collaborator

lcy-seso commented May 26, 2017

  1. 版本库中目前只保存markdown,和其它项目保持统一,请先删掉 org文件
  2. 后续会将markdwon 自动转换成html 和 jupyter notebook等形式,尽量使用markdown 原生语法。
  3. 用pre-commit做一下格式化,否则travis-CI 检查过不了。

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

针对文档的一些修改建议。

ctr/README.md Outdated

下图展示了 LR 和一个 \(3x2\) 的 NN 模型的结构:

![img](./images/lr-vs-dnn.jpg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 图没有居中
  2. 缺少图题
  3. 图片的命名统一使用“_”代替“-”和repo中其他例子保持一致。"lr-vs-dnn.jpg" --> "lr_vs_dnn.jpg"
  4. 和其它例子保持一致,使用下面的标记:


Figure 1. ×

ctr/README.md Outdated

![img](./images/lr-vs-dnn.jpg)

LR 的蓝色箭头部分可以直接类比到 NN 中对应的结构,可以看到 LR 和 NN 有一些共通之处(比如权重累加),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NN 是不是应该改成 DNN更好一些?

ctr/README.md Outdated

### LR vs DNN

下图展示了 LR 和一个 \(3x2\) 的 NN 模型的结构:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NN 是不是应该改为DNN更合适一些?因为上文并没有出现 NN 这个术语。

ctr/README.md Outdated
LR 的蓝色箭头部分可以直接类比到 NN 中对应的结构,可以看到 LR 和 NN 有一些共通之处(比如权重累加),
但前者的模型复杂度在相同输入维度下比后者可能低很多(从某方面讲,模型越复杂,越有潜力学习到更复杂的信息)。

如果 LR 要达到匹敌 NN 的学习能力,必须增加输入的维度,也就是增加特征的数量,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NN --> DNN。上文提出了DNN,但是没有提到NN。会为阅读者带来困惑。

ctr/README.md Outdated
我们可以将 `click` 作为学习目标,具体任务可以有以下几种方案:

1. 直接学习 click,0,1 作二元分类
2. Learning to rank, 具体用 pairwise rank(标签 1>0)或者 list rank
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list --> listwise

ctr/dataset.md Outdated

### 类别型特征

类别型特征有有限多种值,在模型中,我们一般使用 embedding table 将每种值映射为连续值的向量。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Embedding table --> Embedding

def __repr__(self):
return '<CategoryFeatureGenerator %d>' % len(self.dic)
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

增加一两句描述用户应该如何使用这个类来处理数据呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


def size(self):
return self.max_dim
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

增加一两句描述用户应该如何使用这个类来处理数据呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


def size(self):
return self.max_dim
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

增加一两句描述用户应该如何使用这个类来处理数据呢。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ctr/dataset.md Outdated

## 输入到 PaddlePaddle 中

Deep 和 Wide 两部分均以 `sparse_binary_vector` 的格式[1]输入,输入前需要将相关特征拼合,模型最终只接受 3 个 input,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考文献的标记请改为:[1]

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

关于文档的一些小意见。

ctr/README.md Outdated

### 模型简介

Wide & Deep Learning Model[3] 可以作为一种相对成熟的模型框架使用,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考文献的引用格式还是不太对。比如这里的3:

\[[3](#参考文献)\]

ctr/README.md Outdated

我们直接使用第一种方法做分类任务。

我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集[\[2\]](https://www.kaggle.com/c/avazu-ctr-prediction/data) 来演示模型。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考文献的引用格式还是不太对。比如这里的[2]:

\[[2](#参考文献)\]

ctr/README.md Outdated

## 背景介绍

CTR(Click-Through Rate)[\[1\]](https://en.wikipedia.org/wiki/Click-through_rate) 是用来表示用户点击一个特定链接的概率,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考文献的引用格式还是不太对。比如这里的[1]:

\[[1](#参考文献)\]

ctr/README.md Outdated

<p align="center">
<img src="images/lr_vs_dnn.jpg" width="620" hspace='10'/> <br/>
Figure 1. LR 和DNN模型结构对比
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

风格保持一致,DNN 前后都增加一个空格。

ctr/README.md Outdated

## 数据和任务抽象

我们可以将 `click` 作为学习目标,具体任务可以有以下几种方案:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

具体任务可以有以下几种方案: --> 具体的,任务可以有以下几种方案:

feeding=field_index,
event_handler=event_handler,
num_passes=100)
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 增加一个章节:##运行训练和测试
  2. 略微做一个简单的,step by step 的描述来解释 clone 了这个repo的用户该如何一步一步执行本例中的这套脚本,例如包括以下内容:
    • 先运行哪个脚本下载数据/准备环境。
    • 运行哪个脚本启动训练任务,是否需要修改某些参数。
    • 告诉用户那个脚本负责读数据,如果需要feed 自己的数据,应该修改哪个脚本。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ctr/dataset.md Outdated
2. newid = id % N
3. 用 newid 作为类别类特征使用

上面的方法尽管存在一定的碰撞概率,但能够处理任意数量的 ID 特征,并保留一定的效果[2]。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考文献的标记还是有些问题:

\[[2](#参考文献)\]

ctr/dataset.md Outdated

`CategoryFeatureGenerator` 需要先扫描数据集,得到该类别对应的项集合,之后才能开始生成特征。

我们的实验数据集[\[3\]](https://www.kaggle.com/c/avazu-ctr-prediction/data)已经经过shuffle,可以扫描前面一定数目的记录来近似总的类别项集合(等价于随机抽样),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考文献3 的标记有些问题。

\[[3](#参考文献)\]

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

增加一个自动下载数据的脚本。

## 运行训练和测试
训练模型需要如下步骤:

1. 下载训练数据,可以使用 Kaggle 上 CTR 比赛的数据\[[2](#参考文献)\]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 仿照models下sequence_tagging_for_ner在这个例子,增加一个data文件夹,data 文件夹下天机一个获取数据的脚本,https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/data/download.sh
  2. train.py增加一个main函数,main函数中指定usage中提到的四个函数的默认参数。
  3. 最终效果:用户首先执行下载数据脚本,再执行train.py 可以直接运行训练任务。

@lcy-seso
Copy link
Collaborator

lcy-seso commented Jun 1, 2017

Kaggle 的数据集无法通过脚本直接下载。修改一下README,加上一个step by step的过程,如何将原始数据提供给train.py脚本,启动训练任务:

  • 原始数据下载下是什么样的一个文件。
  • 需要做哪些处理?(比如解压)
  • 给 train.py 增加一个默认的main函数,可以直接执行。

@reyoung
Copy link
Collaborator

reyoung commented Jun 1, 2017

It seems an issue about virtualenv let this Unittest failed.

The related issue is here.

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lcy-seso lcy-seso merged commit 1af0222 into PaddlePaddle:develop Jun 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants