Skip to content

Conversation

naitoh
Copy link
Contributor

@naitoh naitoh commented Jul 12, 2024

Why?

SAX2 parser expand user-defined entity references and character references but doesn't expand predefined entity references.

Change

  • text_unnormalized.rb
require 'rexml/document'
require 'rexml/parsers/sax2parser'
require 'rexml/parsers/pullparser'
require 'rexml/parsers/streamparser'

xml = <<EOS
<root>
  <A>&lt;P&gt;&#13; &lt;I&gt; &lt;B&gt; Text &lt;/B&gt;  &lt;/I&gt;</A>
</root>
EOS

class Listener
  def method_missing(name, *args)
    p [name, *args]
  end
end

puts "REXML(DOM)"
REXML::Document.new(xml).elements.each("/root/A") {|element| puts element.text}

puts ""
puts "REXML(Pull)"
parser = REXML::Parsers::PullParser.new(xml)
while parser.has_next?
    res = parser.pull
    p res
end

puts ""
puts "REXML(Stream)"
parser = REXML::Parsers::StreamParser.new(xml, Listener.new).parse

puts ""
puts "REXML(SAX)"
parser = REXML::Parsers::SAX2Parser.new(xml)
parser.listen(Listener.new)
parser.parse

Before (master)

$ ruby text_unnormalized.rb
REXML(DOM)
 <I> <B> Text </B>  </I>

REXML(Pull)
start_element: ["root", {}]
text: ["\n  ", "\n  "]
start_element: ["A", {}]
text: ["&lt;P&gt;&#13; &lt;I&gt; &lt;B&gt; Text &lt;/B&gt;  &lt;/I&gt;", "<P>\r <I> <B> Text </B>  </I>"]
end_element: ["A"]
text: ["\n", "\n"]
end_element: ["root"]
end_document: []

REXML(Stream)
[:tag_start, "root", {}]
[:text, "\n  "]
[:tag_start, "A", {}]
[:text, "<P>\r <I> <B> Text </B>  </I>"]
[:tag_end, "A"]
[:text, "\n"]
[:tag_end, "root"]

REXML(SAX)
[:start_document]
[:start_element, nil, "root", "root", {}]
[:progress, 6]
[:characters, "\n  "]
[:progress, 9]
[:start_element, nil, "A", "A", {}]
[:progress, 12]
[:characters, "&lt;P&gt;\r &lt;I&gt; &lt;B&gt; Text &lt;/B&gt;  &lt;/I&gt;"] #<= This
[:progress, 74]
[:end_element, nil, "A", "A"]
[:progress, 78]
[:characters, "\n"]
[:progress, 79]
[:end_element, nil, "root", "root"]
[:progress, 86]
[:end_document]

After(This PR)

$ ruby text_unnormalized.rb
REXML(SAX)
[:start_document]
[:start_element, nil, "root", "root", {}]
[:progress, 6]
[:characters, "\n  "]
[:progress, 9]
[:start_element, nil, "A", "A", {}]
[:progress, 12]
[:characters, "<P>\r <I> <B> Text </B>  </I>"]
[:progress, 74]
[:end_element, nil, "A", "A"]
[:progress, 78]
[:characters, "\n"]
[:progress, 79]
[:end_element, nil, "root", "root"]
[:progress, 86]
[:end_document]

@naitoh naitoh marked this pull request as ready for review July 13, 2024 00:06
@kou kou changed the title Fixed a problem with sax2 where text was not unnormalized Fix a bug that SAX2 parser doesn't expand the predefined entities for "characters" Jul 13, 2024
naitoh added 2 commits July 14, 2024 07:42
## Why?
:characters is not normalized in sax2.

## Change
- text_unnormalized.rb
```
require 'rexml/document'
require 'rexml/parsers/sax2parser'
require 'rexml/parsers/pullparser'
require 'rexml/parsers/streamparser'

xml = <<EOS
<root>
  <A>&lt;P&gt;&ruby#13; &lt;I&gt; &lt;B&gt; Text &lt;/B&gt;  &lt;/I&gt;</A>
</root>
EOS

class Listener
  def method_missing(name, *args)
    p [name, *args]
  end
end

puts "REXML(DOM)"
REXML::Document.new(xml).elements.each("/root/A") {|element| puts element.text}

puts ""
puts "REXML(Pull)"
parser = REXML::Parsers::PullParser.new(xml)
while parser.has_next?
    res = parser.pull
    p res
end

puts ""
puts "REXML(Stream)"
parser = REXML::Parsers::StreamParser.new(xml, Listener.new).parse

puts ""
puts "REXML(SAX)"
parser = REXML::Parsers::SAX2Parser.new(xml)
parser.listen(Listener.new)
parser.parse
```

## Before (master)
```
$ ruby text_unnormalized.rb
REXML(DOM)
 <I> <B> Text </B>  </I>

REXML(Pull)
start_element: ["root", {}]
text: ["\n  ", "\n  "]
start_element: ["A", {}]
text: ["&lt;P&gt;&ruby#13; &lt;I&gt; &lt;B&gt; Text &lt;/B&gt;  &lt;/I&gt;", "<P>\r <I> <B> Text </B>  </I>"]
end_element: ["A"]
text: ["\n", "\n"]
end_element: ["root"]
end_document: []

REXML(Stream)
[:tag_start, "root", {}]
[:text, "\n  "]
[:tag_start, "A", {}]
[:text, "<P>\r <I> <B> Text </B>  </I>"]
[:tag_end, "A"]
[:text, "\n"]
[:tag_end, "root"]

REXML(SAX)
[:start_document]
[:start_element, nil, "root", "root", {}]
[:progress, 6]
[:characters, "\n  "]
[:progress, 9]
[:start_element, nil, "A", "A", {}]
[:progress, 12]
[:characters, "&lt;P&gt;\r &lt;I&gt; &lt;B&gt; Text &lt;/B&gt;  &lt;/I&gt;"] #<= This
[:progress, 74]
[:end_element, nil, "A", "A"]
[:progress, 78]
[:characters, "\n"]
[:progress, 79]
[:end_element, nil, "root", "root"]
[:progress, 86]
[:end_document]
```

## After(This PR)
```
$ ruby text_unnormalized.rb
REXML(SAX)
[:start_document]
[:start_element, nil, "root", "root", {}]
[:progress, 6]
[:characters, "\n  "]
[:progress, 9]
[:start_element, nil, "A", "A", {}]
[:progress, 12]
[:characters, "<P>\r <I> <B> Text </B>  </I>"]
[:progress, 74]
[:end_element, nil, "A", "A"]
[:progress, 78]
[:characters, "\n"]
[:progress, 79]
[:end_element, nil, "root", "root"]
[:progress, 86]
[:end_document]
```
@naitoh naitoh force-pushed the fix_sax2_text_unnormalize branch from c89992f to 13d2c86 Compare July 13, 2024 22:55
@naitoh naitoh requested a review from kou July 13, 2024 23:03
@kou kou merged commit 4ebf21f into ruby:master Jul 14, 2024
@kou
Copy link
Member

kou commented Jul 14, 2024

Thanks.

@naitoh naitoh deleted the fix_sax2_text_unnormalize branch July 14, 2024 11:38
naitoh added a commit to naitoh/rexml that referenced this pull request Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants