Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EnWik9 is XML dataset, not raw text #1272

Open
pbelevich opened this issue Apr 2, 2021 · 3 comments
Open

EnWik9 is XML dataset, not raw text #1272

pbelevich opened this issue Apr 2, 2021 · 3 comments
Assignees

Comments

@pbelevich
Copy link

pbelevich commented Apr 2, 2021

According to http://mattmahoney.net/dc/textdata:

The data is UTF-8 encoded XML consisting primarily of English text. enwik9 contains 243,426 article titles,

In [1]: from torchtext.datasets import EnWik9

In [2]: enwik9 = EnWik9()

In [3]: len(enwik9)
Out[3]: 1

In [4]: from torchtext.legacy.datasets import EnWik9

In [5]: enwik9 = EnWik9()

In [6]: len(enwik9)
Out[6]: 133220996
@pbelevich pbelevich changed the title EnWik9 is XML dataset, not a raw text EnWik9 is XML dataset, not raw text Apr 2, 2021
@zhangguanheng66
Copy link
Contributor

The new dataset now is an iterator.

@parmeet parmeet closed this as completed Apr 2, 2021
@pbelevich
Copy link
Author

That's super cool, but the new dataset doesn't preprocess XML:

In [1]: from torchtext.legacy.datasets import EnWik9

In [2]: enwik9 = EnWik9()

In [3]: enwik9[:30]
Out[3]:
['redirect',
 'applied',
 'ethics',
 'anarchism',
 'originated',
 'as',
 'a',
 'term',
 'of',
 'abuse',
 'first',
 'used',
 'against',
 'early',
 'working',
 'class',
 'radicals',
 'including',
 'the',
 'diggers',
 'of',
 'the',
 'english',
 'revolution',
 'and',
 'the',
 'sans',
 'culottes',
 'of',
 'the']

In [4]: from torchtext.datasets import EnWik9

In [5]: enwik9 = EnWik9()

In [6]: [next(enwik9[0]) for x in range(30)]
Out[6]:
['<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3" xml:lang="en">\n',
 '  <siteinfo>\n',
 '    <sitename>Wikipedia</sitename>\n',
 '    <base>http://en.wikipedia.org/wiki/Main_Page</base>\n',
 '    <generator>MediaWiki 1.6alpha</generator>\n',
 '    <case>first-letter</case>\n',
 '      <namespaces>\n',
 '      <namespace key="-2">Media</namespace>\n',
 '      <namespace key="-1">Special</namespace>\n',
 '      <namespace key="0" />\n',
 '      <namespace key="1">Talk</namespace>\n',
 '      <namespace key="2">User</namespace>\n',
 '      <namespace key="3">User talk</namespace>\n',
 '      <namespace key="4">Wikipedia</namespace>\n',
 '      <namespace key="5">Wikipedia talk</namespace>\n',
 '      <namespace key="6">Image</namespace>\n',
 '      <namespace key="7">Image talk</namespace>\n',
 '      <namespace key="8">MediaWiki</namespace>\n',
 '      <namespace key="9">MediaWiki talk</namespace>\n',
 '      <namespace key="10">Template</namespace>\n',
 '      <namespace key="11">Template talk</namespace>\n',
 '      <namespace key="12">Help</namespace>\n',
 '      <namespace key="13">Help talk</namespace>\n',
 '      <namespace key="14">Category</namespace>\n',
 '      <namespace key="15">Category talk</namespace>\n',
 '      <namespace key="100">Portal</namespace>\n',
 '      <namespace key="101">Portal talk</namespace>\n',
 '    </namespaces>\n',
 '  </siteinfo>\n',
 '  <page>\n']

Is is expected?

@parmeet
Copy link
Contributor

parmeet commented May 2, 2021

@pbelevich we have added functional to process the XML. Please refer to corresponding PR and doc for example usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants