Skip to content

Commit

Permalink
wikipedia-kyoto-japanese-english: increase REXML entity expansion lim…
Browse files Browse the repository at this point in the history
…it during XML parsing (red-data-tools#198)

Using `Datasets::WikipediaKyotoJapaneseEnglish#each` raised an `entity
expansion has grown too large (RuntimeError)`. This error occurs because
the entity expansion limit in REXML is set by
ruby/rexml#187, and
`Datasets::WikipediaKyotoJapaneseEnglish#each` exceeds that limit.

In Red Datasets, increasing the entity expansion limit is not a problem
because we want to handle large datasets.
Therefore, we temporarily increase the limit.

## How to reproduce

```console
$ cd red-datasets && bundle
$ bundle exec ruby example/wikipedia-kyoto-japanese-english.rb
...
/home/otegami/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/rexml-3.3.4/lib/rexml/parsers/baseparser.rb:560:in `block in unnormalize': entity expansion has grown too large (RuntimeError)
...
```
  • Loading branch information
otegami authored Aug 5, 2024
1 parent 4ebf6ff commit a76b917
Showing 1 changed file with 14 additions and 1 deletion.
15 changes: 14 additions & 1 deletion lib/datasets/wikipedia-kyoto-japanese-english.rb
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,9 @@ def each(&block)
next unless base_name.end_with?(".xml")
listener = ArticleListener.new(block)
parser = REXML::Parsers::StreamParser.new(entry.read, listener)
parser.parse
with_increased_entity_expansion_text_limit do
parser.parse
end
when :lexicon
next unless base_name == "kyoto_lexicon.csv"
is_header = true
Expand All @@ -106,6 +108,9 @@ def each(&block)
end

private

ENTITY_EXPANSION_TEXT_LIMIT = 163_840

def download_tar_gz
base_name = "wiki_corpus_2.01.tar.gz"
data_path = cache_dir_path + base_name
Expand All @@ -114,6 +119,14 @@ def download_tar_gz
data_path
end

def with_increased_entity_expansion_text_limit
default_limit = REXML::Security.entity_expansion_text_limit
REXML::Security.entity_expansion_text_limit = ENTITY_EXPANSION_TEXT_LIMIT
yield
ensure
REXML::Security.entity_expansion_text_limit = default_limit
end

class ArticleListener
include REXML::StreamListener

Expand Down

0 comments on commit a76b917

Please sign in to comment.