Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing XML for enwik9 data #1292

Merged
merged 19 commits into from
May 2, 2021
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/source/data_functional.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,9 @@ torchtext.data.functional
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autofunction:: numericalize_tokens_from_iterator


:hidden:`filter_wikipedia_xml_from_iterator`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autofunction:: filter_wikipedia_xml_from_iterator
14 changes: 11 additions & 3 deletions docs/source/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,6 @@ The following datasets are available:
Text Classification
^^^^^^^^^^^^^^^^^^^

TextClassificationDataset
~~~~~~~~~~~~~~~~~~~~~~~~~

AG_NEWS
~~~~~~~

Expand Down Expand Up @@ -126,6 +123,7 @@ CoNLL2000Chunking

.. autofunction:: CoNLL2000Chunking


Question Answer
^^^^^^^^^^^^^^^

Expand All @@ -139,3 +137,13 @@ SQuAD 2.0
~~~~~~~~~

.. autofunction:: SQuAD2


Unsupervised Learning
^^^^^^^^^^^^^^^^^^^^^

EnWik9
~~~~~~

.. autofunction:: EnWik9

62 changes: 62 additions & 0 deletions torchtext/data/functional.py
Original file line number Diff line number Diff line change
Expand Up @@ -180,3 +180,65 @@ def numericalize_tokens_from_iterator(vocab, iterator, removed_tokens=None):
else:
yield iter(map(lambda x: vocab[x],
filter(lambda x: x not in removed_tokens, tokens)))


_patterns = [(r'<.*>', ''),
(r'&amp;', '&'),
(r'&lt;', '<'),
(r'&gt;', '>'),
(r'<ref[^<]*<\/ref>', ''),
(r'<[^>]*>', ''),
(r'\[http:[^] ]*', '['),
(r'\|thumb', ''),
(r'\|left', ''),
(r'\|right', ''),
(r'\|\d+px', ''),
(r'\[\[image:[^\[\]]*\|', ''),
(r'\[\[category:([^|\]]*)[^]]*\]\]', '[[$1]]'),
(r'\[\[[a-z\-]*:[^\]]*\]\]', ''),
(r'\[\[[^\|\]]*\|', '[['),
(r'\{\{[^\}]*\}\}', ''),
(r'\{[^\}]*\}', ''),
(r'\[', ''),
(r'\]', ''),
(r'&[^;]*;', ' '),
(r'A', 'a'), (r'B', 'b'), (r'C', 'c'),
(r'D', 'd'), (r'E', 'e'), (r'F', 'f'),
(r'G', 'g'), (r'H', 'h'), (r'I', 'i'),
(r'J', 'j'), (r'K', 'k'), (r'L', 'l'),
(r'M', 'm'), (r'N', 'n'), (r'O', 'o'),
(r'P', 'p'), (r'Q', 'q'), (r'R', 'r'),
(r'S', 's'), (r'T', 't'), (r'U', 'u'),
(r'V', 'v'), (r'W', 'w'), (r'X', 'x'),
(r'Y', 'y'), (r'Z', 'z'),
(r'0', ' zero '), (r'1', ' one '), (r'2', ' two '),
(r'3', ' three '), (r'4', ' four '), (r'5', ' five '),
(r'6', ' six '), (r'7', ' seven '), (r'8', ' eight '),
(r'9', ' nine '),
(r'[^a-z\n]+', ' '),
(r'\n ', ''),
(r'\s+', ' '),
(r'\n\s*\n', r'\n')
]


def filter_wikipedia_xml_from_iterator(raw_text_iterator):
Copy link
Contributor

@cpuhrsch cpuhrsch Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest we remove _from_iterator, it actually works for iterables. So you can also feed lists of strings into this, or a regular iterator or a dataset iterator or even just plain an open file

list(filter_wikipedia_xml(open('my_xml_file.txt')))

I'd suggest we verify those input types and include them in the documentation. It's a bit of a standard that our functionals consume iterables and return an iterator, but we haven't documented it in a single place like torchaudio does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I like the idea. Let me modify the doc and name accordingly.

Let's also have a separate discussion on torchtext standards so that we can include them on landing page (should be useful for contributing guidelines as well).

r"""Filter wikipedia xml lines according to https://github.com/facebookresearch/fastText/blob/master/wikifil.pl

args:
raw_text_iterator: Raw dataset iterator

Examples:
>>> from torchtext.data.functional import filter_wikipedia_xml_from_iterator
>>> from torchtext.datasets import EnWik9
>>> data_iter = EnWik9(split='train')
>>> filter_data_iter =filter_wikipedia_xml_from_iterator(data_iter)
"""

norm_transform = custom_replace(_patterns)
for line in raw_text_iterator:
if '#redirect' in line or '#REDIRECT' in line:
continue
line = list(norm_transform([line]))[0]
if line != ' ' and line != '':
Copy link
Contributor

@cpuhrsch cpuhrsch Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You could also do line.strip(), which will cover any line compromised only of white-space

line, = list(norm_transform([line]))
line = line.strip()
if line:
    yield line.strip()

assuming the result of norm_transform is always single entry

yield line.strip()