Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Sentence Order Prediction #1061
Adding Sentence Order Prediction #1061
Changes from 165 commits
430f942
39603c3
9b324f9
d3cc769
00bc40c
4e297b1
b75d0f5
1aadf48
8993b9e
a3f10e2
aa0d8b4
275d7a3
4b6b939
7252ea5
f0d9c56
00223c6
b0a8ec3
c4d2601
acb9d24
8bdec95
0d879b1
4f0a169
5bb8389
fb59ecc
04dbbda
3ddf564
e588909
dfa9fd9
46182a9
607bcd2
d1daf23
9539302
fc5f026
ce7f5c2
ffc7354
c50d75b
9649224
69a9364
697d62c
a4666da
fa13f6f
6b61e8b
3e10e3b
afc0938
dcff7e7
1c1e6fb
f25ee99
3f35212
fc85270
321bda8
ae92b78
4f36878
0467871
e3c5c79
6e96fd0
845bf4f
b41c268
6f82412
c749ea7
73222a5
fe39525
745836d
14caaab
4271a7a
918c0df
5ed0691
ffac8bf
fe86d96
eee439f
b61fa7c
55312e8
7d165cf
52f66c7
a0220f8
fe89674
2218e5b
cd75715
58b2914
052b1c0
b26927a
cd4b5a6
0d6d691
b3617fa
0af6476
e9f863c
89e44c5
0752771
5482ac2
6d85b27
d67e195
921e717
05d5750
ddcd357
9e4e3a7
b9b5f57
6b4c9d5
35130ca
20de779
4020c81
09f5903
4f45826
8ac8c70
6cee66e
3709696
3fb4e3e
a56b7c7
b84da1d
2a19c2c
b1ac702
2ba9651
214abb3
df0e556
c9ef3f4
b5c9469
d671173
5f9cc19
6333b66
fab7d8c
e77f096
8587bb5
58bb44c
c4723fc
f109381
23f440b
a17a612
ffc2740
060471c
4430441
8df2645
a55b447
238fdaf
9c42d98
4a6a9d4
dc59491
307ca7a
632858d
fe62a16
bb8ebda
431d6f7
e99f39a
089939a
689fd52
b28cab6
8735c78
32e00ac
df39c8a
5036612
283559e
98d873f
9fa473c
a4e6d10
6049772
8fc7b26
2dbf444
469164b
d36e985
470e6b8
e93d812
fd9b880
abe3230
03546fb
84b229f
8beac4a
0413e04
23a8df9
a2c1ed0
d398956
01f959e
670cde6
6530c16
ffc2cb9
3b92b7f
59c5635
a030f0b
821c7b6
e284ba9
298330f
337d53e
4f0fd05
62d4bba
0ff4596
c58bc2c
b5d711b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the comment on this from the earlier PR>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a contradiction—if it's unpreprocessed, why do we have a data preparation script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, like we talked about, it should be clear exactly what dataset you get when you run these scripts. Make it clear somewhere (maybe in the scripts readme) whether you're getting a specific dump, or just whatever is most recent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Where do we specify the specific dump that we're downloading? Does the NVidia code always download the latest dump? If so, that's a problem for reproducibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it downloads from here: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. @phu-pmh do you know if that's static or if it's overwritten with the latest dump?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it does keep changing. Phu is working on a modified script that reads a dump from a particular date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, seems like NVIDIA code always download the latest version. The dumb version we used is
https://dumps.wikimedia.org/enwiki/20200301/
. We can edit their code to download the same version as ours but it seems that there is copyright issues.According to @pyeres, "NVIDIA files are Apache 2.0 licensed, and (depending on whether we’re redistributing or modifying+redistributing, we’d need to add some notices".
What's your take on it? We can either specify our dump version or edit the script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sleepinyourhat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the script is easy to edit, one option would be to give instructions here for how to edit it to specify a version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update this in MLM branch's preprocessing script
jiant/scripts/mlm
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're going to give this level of detail, spell out how this works. How can you sample sentences of 2 tokens if there are few/no such sentences in the corpus, for example? Do you ever truncate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We truncate based on
target_seq_length
, which is defined inget_target_seq_length()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking for this to be clarified in the comment—I still don't understand the code comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of the
[:]
here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line incomplete?