Skip to content

Commit 7aae0a5

Browse files
committed
first
0 parents  commit 7aae0a5

File tree

3 files changed

+315
-0
lines changed

3 files changed

+315
-0
lines changed

README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
## News
2+
Currently this repository contains 3-month raw data sample, and our 1-year URL data is available now: 2,869,657 candidate pairs. Please check our [paraphrase website](https://lanwuwei.github.io/language-net/) to download dataset.
3+
4+
## Paraphrase-dataset
5+
This repository contains code and data used in the following paper, please cite if you use it for your research:
6+
7+
@inproceedings{lan2017continuously,
8+
author = {Lan, Wuwei and Qiu, Siyu and He, Hua and Xu, Wei},
9+
title = {A Continuously Growing Dataset of Sentential Paraphrases},
10+
booktitle = {Proceedings of The 2017 Conference on Empirical Methods on Natural Language Processing (EMNLP)},
11+
year = {2017},
12+
publisher = {Association for Computational Linguistics},
13+
pages = {1235--1245},
14+
location = {Copenhagen, Denmark}
15+
url = {http://aclweb.org/anthology/D17-1127}
16+
}
17+
18+
## A few notes
19+
1. Put your own Twitter keys into config.py and modify line 59 in main.py before running the code.
20+
2. Training and testing file is the subset of raw data with human annotation, both files have the same format, each line contains: sentence1 \tab sentence2 \tab (n,6) \tab url
21+
3. For each sentence pair, there are 6 Amazon Mechanical Turk workers annotating it. 1 representa paraphrase and 0 represents non-paraphrase. So totally n out 6 workers think the pair is paraphrase. If n<=2, we treat them as non-paraphrase; if n>=4, we treat them as paraphrase; if n==3, we discard them.
22+
4. After discarding n==3, we can get 42200 for training and 9324 for testing.
23+
24+
## License
25+
It is released for non-commercial use under the CC BY-NC-SA 3.0 license. Use of the data must abide by the Twitter Terms of Service and Developer Policy.

index.html

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
2+
<!DOCTYPE html>
3+
<html>
4+
<head>
5+
<title>Homepage for Language Net</title>
6+
<link rel="stylesheet" type="text/css" href="project.css">
7+
8+
<script>
9+
function showHide(args) {
10+
var name = arguments[0]
11+
if (document.getElementById(name).style.display == 'block') {
12+
document.getElementById(name).style.display='none';
13+
} else {
14+
document.getElementById(name).style.display='block';
15+
}
16+
}
17+
</script>
18+
</head>
19+
20+
<body>
21+
22+
<br>
23+
<center>
24+
<h1 style="color:dodgerblue">Language-Net: The Large Scale Paraphrase Dataset</h1>
25+
</center>
26+
<br>
27+
28+
<h3 style="color: brown">The Corpus</h3>
29+
30+
<ul>
31+
<li>The Language-Net is a collection of sentence level paraphrases from Twitter by linking tweets through shared
32+
URLs. This corpus is the largest up to date with 51,524 human annotated sentence pairs: 42200 for training and 9324 for testing. It can grow 30,000
33+
new sentential paraphrases per month with ∼70% precision. Now we have 1-year data available: 2,869,657 candidate pairs! <br><br>
34+
The following paper introduces the corpus in detail:<br>
35+
<a class="publink" href="http://www.aclweb.org/anthology/D/D17/D17-1126.pdf">A Continuously Growing Dataset of Sentential Paraphrases</a>
36+
<br/><b><a href="https://lanwuwei.github.io/">Wuwei Lan</a></b>, Siyu Qiu, Hua He and Wei Xu. <cite>EMNLP 2017</cite>.
37+
<br/><a class="button" href="http://www.aclweb.org/anthology/D/D17/D17-1126.pdf">pdf</a> <a class="button" href="http://www.aclweb.org/anthology/D/D17/D17-1126.bib">BibTeX</a> <a class="button" href="https://lanwuwei.github.io/Wuwei_OSU_2017_v2.pdf">slides</a> <a class="button" href="https://lanwuwei.github.io/url-data-poster.pdf">poster</a>
38+
</li>
39+
</ul>
40+
41+
42+
<!-----Examples----->
43+
<a name="Examples"></a>
44+
<h3 style="color:brown">Example Pairs</h3>
45+
<ul>
46+
<table class="newstuff" style="border-collapse: separate;
47+
border-spacing: 0 1em;">
48+
<tr><th>Sentence 1</th> <th>Label</th> <th>Sentence 2</th></tr>
49+
<tr>
50+
<td style="padding:0 15px 0 15px;">Samsung halts production of its Galaxy Note 7 as battery problems linger.</td>
51+
<td style="padding:0 15px 0 15px;">True</td>
52+
<td style="padding:0 15px 0 15px;">#Samsung temporarily suspended production of its Galaxy #Note7 devices following reports</td>
53+
</tr>
54+
<tr>
55+
<td style="padding:0 15px 0 15px;">CO2 levels mark ‘new era’ in the world’s changing climate.</td>
56+
<td style="padding:0 15px 0 15px;">True</td>
57+
<td style="padding:0 15px 0 15px;">CO2 levels haven’t been this high for 3 to 5 million years.</td>
58+
</tr>
59+
<tr>
60+
<td style="padding:0 15px 0 15px;">The 7 biggest changes Obamacare made , and those that may disappear.</td>
61+
<td style="padding:0 15px 0 15px;">False</td>
62+
<td style="padding:0 15px 0 15px;">What a repeal of Obamacare would look like , in plain English.</td>
63+
</tr>
64+
<tr>
65+
<td style="padding:0 15px 0 15px;">Fraugster , a startup that uses AI to detect payment fraud , raises $5M.</td>
66+
<td style="padding:0 15px 0 15px;">False</td>
67+
<td style="padding:0 15px 0 15px;">AI is on the rise and in this case being applied to something worthwhile payment fraud.</td>
68+
</tr>
69+
</table>
70+
</ul>
71+
72+
<!----Published Results----->
73+
<a name="Baseline Results"></a>
74+
<h3 style="color:brown">Baseline Results</h3>
75+
<ul>
76+
<table class="newstuff" style="border-collapse: separate;
77+
border-spacing: 0 1em;">
78+
<tr><th>Publication</th> <th>Model</th> <th>F1</th></tr>
79+
<tr>
80+
<td style="padding:0 20px 0 20px;"><a href="https://www.aclweb.org/anthology/P/P09/P09-1053.pdf">Das et al.'09 </a></td>
81+
<td style="padding:0 20px 0 20px;">Logistic Regression: n-gram overlap features</td>
82+
<td style="padding:0 20px 0 20px;">0.683</td>
83+
</tr>
84+
<tr>
85+
<td style="padding:0 20px 0 20px;"><a href="https://cocoxu.github.io/publications/tacl2014-extracting-paraphrases-from-twitter.pdf">Xu et al.'14 </a></td>
86+
<td style="padding:0 20px 0 20px;">LEX-WMF: logistic regression + weighted matrix factorization</td>
87+
<td style="padding:0 20px 0 20px;">0.693</td>
88+
</tr>
89+
<tr>
90+
<td style="padding:0 20px 0 20px;"><a href="http://www.aclweb.org/anthology/N16-1108">He et al.'16 </a></td>
91+
<td style="padding:0 20px 0 20px;">PWIM: pairwise word interaction model</td>
92+
<td style="padding:0 20px 0 20px;">0.749</td>
93+
</tr>
94+
<tr>
95+
<td style="padding:0 20px 0 20px;"><a href="https://cocoxu.github.io/publications/Wuwei_NAACL_2018.pdf">Lan et al.'18 </a></td>
96+
<td style="padding:0 20px 0 20px;">Subword-PWIM: subword embedding based PWIM with multi-task LM</td>
97+
<td style="padding:0 20px 0 20px;">0.768</td>
98+
</tr>
99+
</table>
100+
</ul>
101+
102+
<!----Download----->
103+
<a name="Download"></a>
104+
<h3 style="color:brown">Download</h3>
105+
<ul>
106+
Please fill in the following <a href="https://frozen-ridge-97042.herokuapp.com/">form </a> to request access to the TwitterPPDB corpus and 1-year candidate pairs. It is released for non-commercial use under the CC BY-NC-SA 3.0
107+
license. Use of the data must abide by the Twitter Terms of Service and Developer Policy. For any comments or questions, please email <a href="mailto:lan.105@osu.edu">Wuwei Lan</a>.
108+
</ul>
109+
110+
<!----Related Resource----->
111+
<a name="Related Resource"></a>
112+
<h3 style="color:brown">Related Resource</h3>
113+
<ul>
114+
<a href="https://github.com/cocoxu/SemEval-PIT2015"> PIT-2015</a>: sentence level paraphrases from Twitter based on the same trending topic.
115+
Please check this <a href="https://github.com/cocoxu/SemEval-PIT2015">website </a> for more info.
116+
</ul>
117+
118+
</body>
119+
120+
</html>
121+

project.css

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
/* BODY AND WRAPPER */
2+
body {
3+
/* background-color: #838F98; */
4+
border-style: hidden;
5+
font-family: Helvetica, sans-serif;
6+
font-weight: 350;
7+
font-size: 16px;
8+
max-width: 950px;
9+
padding: .15in;
10+
word-spacing: normal;
11+
margin: auto;
12+
margin-top: 5px
13+
}
14+
15+
#wrapper {
16+
background-color: white;
17+
/* border-style: solid;
18+
border-width: 1px;
19+
border-radius:5px;
20+
box-shadow: 3px 3px 3px #888888; */
21+
font-family: sans-serif;
22+
max-width: 800px;
23+
padding: .15in;
24+
word-spacing: normal;
25+
margin: auto;
26+
margin-top: 5px
27+
}
28+
29+
/* LINK STYLES */
30+
a {
31+
text-decoration: none;
32+
}
33+
34+
a:link {
35+
color: #BF5700;
36+
}
37+
38+
a:visited {
39+
color: #6E2600;
40+
}
41+
42+
a:hover {
43+
color: #0000AF;
44+
}
45+
46+
a:active {
47+
color: #0000FF;
48+
}
49+
50+
a.email {
51+
display: block;
52+
margin: 10px 15px 10px 15px;
53+
font-size: 18px;
54+
font-weight: 700;
55+
color: #BF5700;
56+
}
57+
58+
a.email:hover {
59+
text-decoration: none;
60+
}
61+
62+
a.menu, a.menu:link, a.menu:visited {
63+
display: block;
64+
padding: 13px 15px 13px 15px;
65+
font-size: 110%;
66+
font-weight: 700;
67+
color: #333333;
68+
}
69+
70+
a.menu:hover, a.menu:active {
71+
color: #BF5700;
72+
}
73+
74+
a.publink {
75+
color: #000000;
76+
font-weight: bold;
77+
}
78+
79+
a.publink:hover {
80+
color: #BF5700;
81+
}
82+
83+
a.syslink {
84+
font-weight: bold;
85+
}
86+
87+
a.button, a.button:link, a.button:visited, a.button:hover, a.button:active {
88+
border-radius: 5px;
89+
background: #BF5700;
90+
color: #FFFFFF;
91+
padding: 2px 4px 2px 4px;
92+
font-size: 11px
93+
}
94+
95+
a.button:hover {
96+
background: #9E4600;
97+
}
98+
99+
/* HEADER */
100+
#name {
101+
float: left;
102+
vertical-align: middle;
103+
margin: 10px 5px;
104+
color: #333333;
105+
}
106+
107+
#header {
108+
border-top: 2px solid #333333;
109+
border-bottom: 2px solid #333333;
110+
overflow: hidden;
111+
margin: 0px 5px 10px 5px;
112+
display: inline-block;
113+
width: 100%;
114+
}
115+
116+
#header h1 {
117+
display: inline;
118+
padding: 10px;
119+
}
120+
121+
#header ul {
122+
margin: 0px;
123+
padding: 0px;
124+
list-style: none;
125+
line-height: normal;
126+
display: inline;
127+
float: right;
128+
}
129+
130+
#header li {
131+
display: inline-block;
132+
}
133+
134+
#picimg{
135+
border-radius:5px;
136+
box-shadow: 3px 3px 3px #888888;
137+
}
138+
139+
/* NORMAL STUFF */
140+
h1 {
141+
padding-right: 50px;
142+
font-size: xx-large;
143+
font-weight: lighter;
144+
}
145+
146+
h2 {
147+
font-family: sans-serif;
148+
font-size: x-large;
149+
font-weight: lighter;
150+
}
151+
152+
li:before {
153+
content: " ";
154+
font-size: large;
155+
color: #9E4600;
156+
}
157+
158+
li {
159+
list-style: none;
160+
margin: 25px 0px;
161+
line-height: 23px;
162+
}
163+
164+
p {
165+
margin: 5px;
166+
padding: 10px;
167+
line-height: 20px;
168+
}
169+

0 commit comments

Comments
 (0)