Skip to content

Commit

Permalink
wikipedia-kyoto-japanese-english: increase REXML entity expansion lim…
Browse files Browse the repository at this point in the history
…it during XML parsing

Using `Datasets::WikipediaKyotoJapaneseEnglish#each` raised an `entity expansion has grown too large (RuntimeError)`.
This error occurs because the entity expansion limit in REXML is set by ruby/rexml#187,
and `Datasets::WikipediaKyotoJapaneseEnglish#each` exceeds that limit.

In Red Datasets, increasing the entity expansion limit is not a problem because we want to handle large datasets.
Therefore, we temporarily increase the limit.

How to reproduce:

```console
$ cd red-datasets && bundle
$ bundle exec ruby example/wikipedia-kyoto-japanese-english.rb
...
/home/otegami/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/rexml-3.3.4/lib/rexml/parsers/baseparser.rb:560:in `block in unnormalize': entity expansion has grown too large (RuntimeError)
...
```
  • Loading branch information
otegami committed Aug 1, 2024
1 parent 4ebf6ff commit c28ef0e
Showing 1 changed file with 15 additions and 2 deletions.
17 changes: 15 additions & 2 deletions lib/datasets/wikipedia-kyoto-japanese-english.rb
Original file line number Diff line number Diff line change
Expand Up @@ -88,8 +88,10 @@ def each(&block)
when :article
next unless base_name.end_with?(".xml")
listener = ArticleListener.new(block)
parser = REXML::Parsers::StreamParser.new(entry.read, listener)
parser.parse
with_increased_entity_expansion_text_limit do
parser = REXML::Parsers::StreamParser.new(entry.read, listener)
parser.parse
end
when :lexicon
next unless base_name == "kyoto_lexicon.csv"
is_header = true
Expand All @@ -106,6 +108,9 @@ def each(&block)
end

private

ENTITY_EXPANSION_TEXT_LIMIT = 163_840

def download_tar_gz
base_name = "wiki_corpus_2.01.tar.gz"
data_path = cache_dir_path + base_name
Expand All @@ -114,6 +119,14 @@ def download_tar_gz
data_path
end

def with_increased_entity_expansion_text_limit
default_limit = REXML::Security.entity_expansion_text_limit
REXML::Security.entity_expansion_text_limit = ENTITY_EXPANSION_TEXT_LIMIT
yield
ensure
REXML::Security.entity_expansion_text_limit = default_limit
end

class ArticleListener
include REXML::StreamListener

Expand Down

0 comments on commit c28ef0e

Please sign in to comment.