Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wikipedia: increase REXML entity expansion limit during XML parsing #199

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

otegami
Copy link
Member

@otegami otegami commented Aug 5, 2024

Using Datasets::Wikipedia#each raised an entity expansion has grown too large (RuntimeError). This error occurs because the entity expansion limit in REXML is set by ruby/rexml#187, and Datasets::Wikipedia#each exceeds that limit.

In Red Datasets, increasing the entity expansion limit is not a problem because we want to handle large datasets. Therefore, we temporarily increase the limit.

require 'datasets'

wikipedia = Datasets::Wikipedia.new
wikipedia.each do |wiki|
  pp wiki
end
$ cd red-datasets && bundle && bundle exec ruby wiki
/home/otegami/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/rexml-3.3.4/lib/rexml/parsers/baseparser.rb:560:in `block in unnormalize': entity expansion has grown too large (RuntimeError)

@otegami
Copy link
Member Author

otegami commented Aug 8, 2024

entity expansion has grown too large (RuntimeError) might be resolved at stream.

After this, we might face the following exception.
I will make sure whether the original data is problem or not.

$ cat example/wikipedia.rb
#!/usr/bin/env ruby

require "datasets"

wikipedia = Datasets::Wikipedia.new(language: :en, type: :articles)
wikipedia.each do |page|
  p [
    page.title,
    page.namespace,
    page.id,
    page.restrictions,
    page.redirect,
    page.revision
  ]
end
$ ruby example/wikipedia.rb
Failed to read bzcat input: Errno::EPIPE: Broken pipe
/home/kodama/.rbenv/versions/3.3.2/lib/ruby/gems/3.3.0/gems/rexml-3.3.4/lib/rexml/parsers/streamparser.rb:25:in `parse': Missing end tag for '/mediawiki/page/revision/text' (REXML::ParseException)
Line: -1
Position: -1
Last 80 unconsumed characters:

	from /home/kodama/work/ruby/red-datasets/lib/datasets/wikipedia.rb:52:in `block (2 levels) in each'
	from /home/kodama/work/ruby/red-datasets/lib/datasets/wikipedia.rb:91:in `with_increased_entity_expansion_text_limit'
	from /home/kodama/work/ruby/red-datasets/lib/datasets/wikipedia.rb:51:in `block in each'
	from /home/kodama/work/ruby/red-datasets/lib/datasets/dataset.rb:78:in `block (2 levels) in extract_bz2'
	from /home/kodama/work/ruby/red-datasets/lib/datasets/dataset.rb:56:in `pipe'
	from /home/kodama/work/ruby/red-datasets/lib/datasets/dataset.rb:56:in `block in extract_bz2'
	from /home/kodama/work/ruby/red-datasets/lib/datasets/dataset.rb:55:in `pipe'
	from /home/kodama/work/ruby/red-datasets/lib/datasets/dataset.rb:55:in `extract_bz2'
	from /home/kodama/work/ruby/red-datasets/lib/datasets/wikipedia.rb:76:in `open_data'
	from /home/kodama/work/ruby/red-datasets/lib/datasets/wikipedia.rb:48:in `each'
	from example/wikipedia.rb:6:in `<main>'

@kou
Copy link
Member

kou commented Aug 8, 2024

It seems that your download failed.
Could you retry with wikipedia.clear_cache! before wikipedia.each?

Using `Datasets::Wikipedia#each` raised an `entity expansion has grown too large (RuntimeError)`.
This error occurs because the entity expansion limit in REXML is set by ruby/rexml#187,
and `Datasets::Wikipedia#each` exceeds that limit.

In Red Datasets, increasing the entity expansion limit is not a problem because we want to handle large datasets.
Therefore, we temporarily increase the limit.

```ruby
require 'datasets'

wikipedia = Datasets::Wikipedia.new
wikipedia.each do |wiki|
  pp wiki
end
```

```console
$ cd red-datasets && bundle && bundle exec ruby wiki
/home/otegami/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/rexml-3.3.4/lib/rexml/parsers/baseparser.rb:560:in `block in unnormalize': entity expansion has grown too large (RuntimeError)
```
@otegami
Copy link
Member Author

otegami commented Oct 6, 2024

I'm sorry for the delay in checking.

The following error has occurred, and it seems necessary to determine if it's a data error on the Wikipedia side or if the download didn't go well.
However, my home internet is slow, and it might take some time to check.
If you don't mind, I would appreciate it if you could help me check it together.

bundle exec ruby example/wikipedia.rb
/home/otegami/.rbenv/versions/3.3.5/lib/ruby/3.3.0/json/common.rb:3: warning: ostruct was loaded from the standard library, but will no longer be part of the default gems starting from Ruby 3.5.0.
You can add ostruct to your Gemfile or gemspec to silence this warning.
enwiki-latest-pages-articles.xml.bz2 - 075.2% [ 17.28GB/ 22.97GB] 04:11:52  47KB/sB/ss
bzcat: Data integrity error when decompressing.
	Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

Failed to read bzcat input: Errno::EPIPE: Broken pipe
/home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/baseparser.rb:528:in `rescue in pull_event': #<ArgumentError: invalid byte sequence in UTF-8> (REXML::ParseException)
/home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/source.rb:231:in `scan_until'
/home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/source.rb:231:in `read_until'
/home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/baseparser.rb:507:in `pull_event'
/home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/baseparser.rb:242:in `pull'
/home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/streamparser.rb:32:in `parse'
/home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:52:in `block (2 levels) in each'
/home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:91:in `with_increased_entity_expansion_text_limit'
/home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:51:in `block in each'
/home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:78:in `block (2 levels) in extract_bz2'
/home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:56:in `pipe'
/home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:56:in `block in extract_bz2'
/home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:55:in `pipe'
/home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:55:in `extract_bz2'
/home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:76:in `open_data'
/home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:48:in `each'
example/wikipedia.rb:5:in `<main>'
...
Exception parsing
Line: -1
Position: -1
Last 80 unconsumed characters:
{{short description|American rapper}} {{Use mdy dates|date=February 2021}} {{UDP|
	from /home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/baseparser.rb:432:in `pull_event'
	from /home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/baseparser.rb:242:in `pull'
	from /home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/streamparser.rb:32:in `parse'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:52:in `block (2 levels) in each'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:91:in `with_increased_entity_expansion_text_limit'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:51:in `block in each'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:78:in `block (2 levels) in extract_bz2'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:56:in `pipe'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:56:in `block in extract_bz2'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:55:in `pipe'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:55:in `extract_bz2'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:76:in `open_data'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:48:in `each'
	from example/wikipedia.rb:5:in `<main>'
/home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/source.rb:231:in `scan_until': invalid byte sequence in UTF-8 (ArgumentError)

      until str = @scanner.scan_until(pattern)
                                      ^^^^^^^
	from /home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/source.rb:231:in `read_until'
	from /home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/baseparser.rb:507:in `pull_event'
	from /home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/baseparser.rb:242:in `pull'
	from /home/otegami/.rbenv/versions/3.3.5/lib/ruby/gems/3.3.0/gems/rexml-3.3.8/lib/rexml/parsers/streamparser.rb:32:in `parse'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:52:in `block (2 levels) in each'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:91:in `with_increased_entity_expansion_text_limit'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:51:in `block in each'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:78:in `block (2 levels) in extract_bz2'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:56:in `pipe'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:56:in `block in extract_bz2'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:55:in `pipe'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/dataset.rb:55:in `extract_bz2'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:76:in `open_data'
	from /home/otegami/work/ruby/red-datasets/lib/datasets/wikipedia.rb:48:in `each'
	from example/wikipedia.rb:5:in `<main>'

@otegami otegami force-pushed the increase-rexml-entity-expansion-limit branch from 0c17f62 to 0e49e2b Compare October 6, 2024 01:03
@kou
Copy link
Member

kou commented Oct 6, 2024

Could you retry it to check whether the error is reproducible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants