-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wikipedia: increase REXML entity expansion limit during XML parsing #199
base: master
Are you sure you want to change the base?
wikipedia: increase REXML entity expansion limit during XML parsing #199
Conversation
After this, we might face the following exception. $ cat example/wikipedia.rb
#!/usr/bin/env ruby
require "datasets"
wikipedia = Datasets::Wikipedia.new(language: :en, type: :articles)
wikipedia.each do |page|
p [
page.title,
page.namespace,
page.id,
page.restrictions,
page.redirect,
page.revision
]
end
$ ruby example/wikipedia.rb
Failed to read bzcat input: Errno::EPIPE: Broken pipe
/home/kodama/.rbenv/versions/3.3.2/lib/ruby/gems/3.3.0/gems/rexml-3.3.4/lib/rexml/parsers/streamparser.rb:25:in `parse': Missing end tag for '/mediawiki/page/revision/text' (REXML::ParseException)
Line: -1
Position: -1
Last 80 unconsumed characters:
from /home/kodama/work/ruby/red-datasets/lib/datasets/wikipedia.rb:52:in `block (2 levels) in each'
from /home/kodama/work/ruby/red-datasets/lib/datasets/wikipedia.rb:91:in `with_increased_entity_expansion_text_limit'
from /home/kodama/work/ruby/red-datasets/lib/datasets/wikipedia.rb:51:in `block in each'
from /home/kodama/work/ruby/red-datasets/lib/datasets/dataset.rb:78:in `block (2 levels) in extract_bz2'
from /home/kodama/work/ruby/red-datasets/lib/datasets/dataset.rb:56:in `pipe'
from /home/kodama/work/ruby/red-datasets/lib/datasets/dataset.rb:56:in `block in extract_bz2'
from /home/kodama/work/ruby/red-datasets/lib/datasets/dataset.rb:55:in `pipe'
from /home/kodama/work/ruby/red-datasets/lib/datasets/dataset.rb:55:in `extract_bz2'
from /home/kodama/work/ruby/red-datasets/lib/datasets/wikipedia.rb:76:in `open_data'
from /home/kodama/work/ruby/red-datasets/lib/datasets/wikipedia.rb:48:in `each'
from example/wikipedia.rb:6:in `<main>' |
It seems that your download failed. |
Using `Datasets::Wikipedia#each` raised an `entity expansion has grown too large (RuntimeError)`. This error occurs because the entity expansion limit in REXML is set by ruby/rexml#187, and `Datasets::Wikipedia#each` exceeds that limit. In Red Datasets, increasing the entity expansion limit is not a problem because we want to handle large datasets. Therefore, we temporarily increase the limit. ```ruby require 'datasets' wikipedia = Datasets::Wikipedia.new wikipedia.each do |wiki| pp wiki end ``` ```console $ cd red-datasets && bundle && bundle exec ruby wiki /home/otegami/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/rexml-3.3.4/lib/rexml/parsers/baseparser.rb:560:in `block in unnormalize': entity expansion has grown too large (RuntimeError) ```
I'm sorry for the delay in checking. The following error has occurred, and it seems necessary to determine if it's a data error on the Wikipedia side or if the download didn't go well. bundle exec ruby example/wikipedia.rb
/home/otegami/.rbenv/versions/3.3.5/lib/ruby/3.3.0/json/common.rb:3: warning: ostruct was loaded from the standard library, but will no longer be part of the default gems starting from Ruby 3.5.0.
You can add ostruct to your Gemfile or gemspec to silence this warning.
enwiki-latest-pages-articles.xml.bz2 - 075.2% [ 17.28GB/ 22.97GB] 04:11:52 47KB/sB/ss
bzcat: Data integrity error when decompressing.
Input file = (stdin), output file = (stdout)
It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.
You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
Failed to read bzcat input: Errno::EPIPE: Broken pipe
/home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/baseparser.rb:528:in `rescue in pull_event': #<ArgumentError: invalid byte sequence in UTF-8> (REXML::ParseException)
/home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/source.rb:231:in `scan_until'
/home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/source.rb:231:in `read_until'
/home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/baseparser.rb:507:in `pull_event'
/home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/baseparser.rb:242:in `pull'
/home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/streamparser.rb:32:in `parse'
/home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:52:in `block (2 levels) in each'
/home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:91:in `with_increased_entity_expansion_text_limit'
/home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:51:in `block in each'
/home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:78:in `block (2 levels) in extract_bz2'
/home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:56:in `pipe'
/home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:56:in `block in extract_bz2'
/home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:55:in `pipe'
/home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:55:in `extract_bz2'
/home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:76:in `open_data'
/home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:48:in `each'
example/wikipedia.rb:5:in `<main>'
...
Exception parsing
Line: -1
Position: -1
Last 80 unconsumed characters:
{{short description|American rapper}} {{Use mdy dates|date=February 2021}} {{UDP|
from /home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/baseparser.rb:432:in `pull_event'
from /home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/baseparser.rb:242:in `pull'
from /home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/streamparser.rb:32:in `parse'
from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:52:in `block (2 levels) in each'
from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:91:in `with_increased_entity_expansion_text_limit'
from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:51:in `block in each'
from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:78:in `block (2 levels) in extract_bz2'
from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:56:in `pipe'
from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:56:in `block in extract_bz2'
from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:55:in `pipe'
from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:55:in `extract_bz2'
from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:76:in `open_data'
from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:48:in `each'
from example/wikipedia.rb:5:in `<main>'
/home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/source.rb:231:in `scan_until': invalid byte sequence in UTF-8 (ArgumentError)
until str = @scanner.scan_until(pattern)
^^^^^^^
from /home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/source.rb:231:in `read_until'
from /home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/baseparser.rb:507:in `pull_event'
from /home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/baseparser.rb:242:in `pull'
from /home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/streamparser.rb:32:in `parse'
from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:52:in `block (2 levels) in each'
from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:91:in `with_increased_entity_expansion_text_limit'
from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:51:in `block in each'
from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:78:in `block (2 levels) in extract_bz2'
from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:56:in `pipe'
from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:56:in `block in extract_bz2'
from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:55:in `pipe'
from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:55:in `extract_bz2'
from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:76:in `open_data'
from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:48:in `each'
from example/wikipedia.rb:5:in `<main>' |
0c17f62
to
0e49e2b
Compare
Could you retry it to check whether the error is reproducible? |
Using
Datasets::Wikipedia#each
raised anentity expansion has grown too large (RuntimeError)
. This error occurs because the entity expansion limit in REXML is set by ruby/rexml#187, andDatasets::Wikipedia#each
exceeds that limit.In Red Datasets, increasing the entity expansion limit is not a problem because we want to handle large datasets. Therefore, we temporarily increase the limit.