-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processing XML for enwik9 data #1292
Merged
Merged
Changes from 15 commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
6c3e100
preprocessing xml for enwik9 dataset
parmeet afd4d0e
Merge branch 'master' of github.com:pytorch/text into enwik9data
parmeet 8dbf48a
Merge branch 'master' of github.com:pytorch/text into enwik9data
parmeet 2a13569
fixing tests and moving common code-base to utils
parmeet 3b84202
fixing unicode error
parmeet 15c26ec
figuring out issue regarding OSError: libmkl_intel_lp64.so.1
parmeet 0ca7b7a
reverting
parmeet 2318d5c
Merge branch 'master' of github.com:pytorch/text into enwik9data
parmeet 26dae2c
creating pure iterator for wiki xml dumps
parmeet fda854e
fix linter
parmeet 99a94aa
fix linter
parmeet 306ffe0
creating functional for filtering wiki xml lines
parmeet 47bb685
removing line
parmeet 0e400d2
adding doc and example usage
parmeet 236f32b
remove redundancy in code
parmeet 24d7988
linter issue
parmeet bf07a30
minor code changes
parmeet e010241
Merge branch 'master' of github.com:pytorch/text into enwik9data
parmeet 84bdf9f
modified functional name and doc
parmeet File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -180,3 +180,65 @@ def numericalize_tokens_from_iterator(vocab, iterator, removed_tokens=None): | |
else: | ||
yield iter(map(lambda x: vocab[x], | ||
filter(lambda x: x not in removed_tokens, tokens))) | ||
|
||
|
||
_patterns = [(r'<.*>', ''), | ||
(r'&', '&'), | ||
(r'<', '<'), | ||
(r'>', '>'), | ||
(r'<ref[^<]*<\/ref>', ''), | ||
(r'<[^>]*>', ''), | ||
(r'\[http:[^] ]*', '['), | ||
(r'\|thumb', ''), | ||
(r'\|left', ''), | ||
(r'\|right', ''), | ||
(r'\|\d+px', ''), | ||
(r'\[\[image:[^\[\]]*\|', ''), | ||
(r'\[\[category:([^|\]]*)[^]]*\]\]', '[[$1]]'), | ||
(r'\[\[[a-z\-]*:[^\]]*\]\]', ''), | ||
(r'\[\[[^\|\]]*\|', '[['), | ||
(r'\{\{[^\}]*\}\}', ''), | ||
(r'\{[^\}]*\}', ''), | ||
(r'\[', ''), | ||
(r'\]', ''), | ||
(r'&[^;]*;', ' '), | ||
(r'A', 'a'), (r'B', 'b'), (r'C', 'c'), | ||
(r'D', 'd'), (r'E', 'e'), (r'F', 'f'), | ||
(r'G', 'g'), (r'H', 'h'), (r'I', 'i'), | ||
(r'J', 'j'), (r'K', 'k'), (r'L', 'l'), | ||
(r'M', 'm'), (r'N', 'n'), (r'O', 'o'), | ||
(r'P', 'p'), (r'Q', 'q'), (r'R', 'r'), | ||
(r'S', 's'), (r'T', 't'), (r'U', 'u'), | ||
(r'V', 'v'), (r'W', 'w'), (r'X', 'x'), | ||
(r'Y', 'y'), (r'Z', 'z'), | ||
(r'0', ' zero '), (r'1', ' one '), (r'2', ' two '), | ||
(r'3', ' three '), (r'4', ' four '), (r'5', ' five '), | ||
(r'6', ' six '), (r'7', ' seven '), (r'8', ' eight '), | ||
(r'9', ' nine '), | ||
(r'[^a-z\n]+', ' '), | ||
(r'\n ', ''), | ||
(r'\s+', ' '), | ||
(r'\n\s*\n', r'\n') | ||
] | ||
|
||
|
||
def filter_wikipedia_xml_from_iterator(raw_text_iterator): | ||
r"""Filter wikipedia xml lines according to https://github.com/facebookresearch/fastText/blob/master/wikifil.pl | ||
|
||
args: | ||
raw_text_iterator: Raw dataset iterator | ||
|
||
Examples: | ||
>>> from torchtext.data.functional import filter_wikipedia_xml_from_iterator | ||
>>> from torchtext.datasets import EnWik9 | ||
>>> data_iter = EnWik9(split='train') | ||
>>> filter_data_iter =filter_wikipedia_xml_from_iterator(data_iter) | ||
""" | ||
|
||
norm_transform = custom_replace(_patterns) | ||
for line in raw_text_iterator: | ||
if '#redirect' in line or '#REDIRECT' in line: | ||
continue | ||
line = list(norm_transform([line]))[0] | ||
if line != ' ' and line != '': | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: You could also do line.strip(), which will cover any line compromised only of white-space
assuming the result of norm_transform is always single entry |
||
yield line.strip() |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest we remove
_from_iterator
, it actually works for iterables. So you can also feed lists of strings into this, or a regular iterator or a dataset iterator or even just plain an open fileI'd suggest we verify those input types and include them in the documentation. It's a bit of a standard that our functionals consume iterables and return an iterator, but we haven't documented it in a single place like torchaudio does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I like the idea. Let me modify the doc and name accordingly.
Let's also have a separate discussion on torchtext standards so that we can include them on landing page (should be useful for contributing guidelines as well).