You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
Hi, very nice repo here, just want to confirm with you the formatting of the data before it is being fed into preprocess.py.
Are the inputs delimited by new lines?
or can we assume that each input here before it sees preprocess.txt is itself is an article.
I have been trying to replicate the results of this work in other languages but to no avail. I understand that because of the span random sampling, theoretically the input should not make a huge difference unless of course each paragraph is significantly shorter than that.
Even though the Wikipedia - En corpus is considered to be a "canonical" dataset, I would appreciate if you could share the size of the dataset used to replicate your previous result just to observe what could be tweaked. Thanks!
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hi, very nice repo here, just want to confirm with you the formatting of the data before it is being fed into
preprocess.py
.preprocess.txt
is itself is an article.I have been trying to replicate the results of this work in other languages but to no avail. I understand that because of the span random sampling, theoretically the input should not make a huge difference unless of course each paragraph is significantly shorter than that.
Even though the Wikipedia - En corpus is considered to be a "canonical" dataset, I would appreciate if you could share the size of the dataset used to replicate your previous result just to observe what could be tweaked. Thanks!
The text was updated successfully, but these errors were encountered: