Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sax parser with replace_entities = false still replaces entities #1284

Closed
kostya opened this issue May 10, 2015 · 6 comments
Closed

sax parser with replace_entities = false still replaces entities #1284

kostya opened this issue May 10, 2015 · 6 comments

Comments

@kostya
Copy link

kostya commented May 10, 2015

1.6.6.2

require 'bundler/setup'
require 'nokogiri'

class Doc < Nokogiri::XML::SAX::Document
  def characters(chars)
    p chars
  end
end

handler = Doc.new
parser = Nokogiri::HTML::SAX::Parser.new(handler)
parser.parse_memory("<body> &#61 - &#8211; &amp; &nbsp; </body>") do |ctx|
  ctx.replace_entities = false
end
" "
"="
" - "
"–"
" "
"&"
" "
" "
" "
@flavorjones
Copy link
Member

Hi,

Thank for reporting this.

Can you provide the output from nokogiri -v so that we know what your
environment looks like?

-m
On May 10, 2015 11:32 AM, "kostya" notifications@github.com wrote:

1.6.6.2

require 'bundler/setup'require 'nokogiri'
class Doc < Nokogiri::XML::SAX::Document
def characters(chars)
p chars
endend

handler = Doc.new
parser = Nokogiri::HTML::SAX::Parser.new(handler)
parser.parse_memory(" &#61 - – & ") do |ctx|
ctx.replace_entities = falseend

" "
"="
" - "
"–"
" "
"&"
" "


Reply to this email directly or view it on GitHub
#1284.

@kostya
Copy link
Author

kostya commented May 10, 2015

# Nokogiri (1.6.6.2)
    ---
    warnings: []
    nokogiri: 1.6.6.2
    ruby:
      version: 2.2.0
      platform: x86_64-darwin13
      description: ruby 2.2.0p0 (2014-12-25 revision 49005) [x86_64-darwin13]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/Users/kostya/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.6.2/ports/x86_64-apple-darwin13.0.0/libxml2/2.9.2"
      libxslt_path: "/Users/kostya/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.6.2/ports/x86_64-apple-darwin13.0.0/libxslt/1.1.28"
      libxml2_patches:
      - 0001-Revert-Missing-initialization-for-the-catalog-module.patch
      - 0002-Fix-missing-entities-after-CVE-2014-3660-fix.patch
      libxslt_patches:
      - 0001-Adding-doc-update-related-to-1.1.28.patch
      - 0002-Fix-a-couple-of-places-where-f-printf-parameters-wer.patch
      - 0003-Initialize-pseudo-random-number-generator-with-curre.patch
      - 0004-EXSLT-function-str-replace-is-broken-as-is.patch
      - 0006-Fix-str-padding-to-work-with-UTF-8-strings.patch
      - 0007-Separate-function-for-predicate-matching-in-patterns.patch
      - 0008-Fix-direct-pattern-matching.patch
      - 0009-Fix-certain-patterns-with-predicates.patch
      - 0010-Fix-handling-of-UTF-8-strings-in-EXSLT-crypto-module.patch
      - 0013-Memory-leak-in-xsltCompileIdKeyPattern-error-path.patch
      - 0014-Fix-for-bug-436589.patch
      - 0015-Fix-mkdir-for-mingw.patch
      compiled: 2.9.2
      loaded: 2.9.2

@tisba
Copy link

tisba commented May 13, 2016

Is there any update on this issue?

I also think that this test

@parser = XML::SAX::Parser.new(Doc.new) do |ctx|
is not correct, since the block passed to Nokogiri::XML::SAX::Parser.new gets ignored. You have to pass it to #parse in order to get access to Nokogiri::XML::SAX::ParserContext. As @kostya reported it still has no effect though :-/

# Nokogiri (1.6.7.2)
    ---
    warnings: []
    nokogiri: 1.6.7.2
    ruby:
      version: 2.3.0
      platform: x86_64-linux-gnu
      description: ruby 2.3.0p0 (2015-12-25 revision 53290) [x86_64-linux-gnu]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/var/fejo-dk/fejo-next/vendor/bundle/ruby/2.3.0/gems/nokogiri-1.6.7.2/ports/x86_64-pc-linux-gnu/libxml2/2.9.2"
      libxslt_path: "/var/fejo-dk/fejo-next/vendor/bundle/ruby/2.3.0/gems/nokogiri-1.6.7.2/ports/x86_64-pc-linux-gnu/libxslt/1.1.28"
      libxml2_patches:
      - 0001-Revert-Missing-initialization-for-the-catalog-module.patch
      - 0002-Fix-missing-entities-after-CVE-2014-3660-fix.patch
      - 0003-Stop-parsing-on-entities-boundaries-errors.patch
      - 0004-Cleanup-conditional-section-error-handling.patch
      - 0005-CVE-2015-1819-Enforce-the-reader-to-run-in-constant-.patch
      - 0006-Another-variation-of-overflow-in-Conditional-section.patch
      - 0007-Fix-an-error-in-previous-Conditional-section-patch.patch
      - 0008-CVE-2015-8035-Fix-XZ-compression-support-loop.patch
      - 0009-Updated-config.guess.patch
      - 0010-Fix-parsering-short-unclosed-comment-uninitialized-access.patch
      - 0011-Avoid-extra-processing-of-MarkupDecl-when-EOF.patch
      - 0012-Avoid-processing-entities-after-encoding-conversion-.patch
      - 0013-CVE-2015-7497-Avoid-an-heap-buffer-overflow-in-xmlDi.patch
      - 0014-CVE-2015-5312-Another-entity-expansion-issue.patch
      - 0015-Add-xmlHaltParser-to-stop-the-parser.patch
      - 0016-Detect-incoherency-on-GROW.patch
      - 0017-CVE-2015-7500-Fix-memory-access-error-due-to-incorre.patch
      - 0018-CVE-2015-8242-Buffer-overead-with-HTML-parser-in-pus.patch
      - 0019-Do-not-print-error-context-when-there-is-none.patch
      - 0020-xmlStopParser-reset-errNo.patch
      - 0021-Reuse-xmlHaltParser-where-it-makes-sense.patch
      libxslt_patches:
      - 0001-Adding-doc-update-related-to-1.1.28.patch
      - 0002-Fix-a-couple-of-places-where-f-printf-parameters-wer.patch
      - 0003-Initialize-pseudo-random-number-generator-with-curre.patch
      - 0004-EXSLT-function-str-replace-is-broken-as-is.patch
      - 0006-Fix-str-padding-to-work-with-UTF-8-strings.patch
      - 0007-Separate-function-for-predicate-matching-in-patterns.patch
      - 0008-Fix-direct-pattern-matching.patch
      - 0009-Fix-certain-patterns-with-predicates.patch
      - 0010-Fix-handling-of-UTF-8-strings-in-EXSLT-crypto-module.patch
      - 0013-Memory-leak-in-xsltCompileIdKeyPattern-error-path.patch
      - 0014-Fix-for-bug-436589.patch
      - 0015-Fix-mkdir-for-mingw.patch
      - 0016-Fix-for-type-confusion-in-preprocessing-attributes.patch
      - 0017-Updated-config.guess.patch
      compiled: 2.9.2
      loaded: 2.9.2

@rosenfeld
Copy link
Contributor

rosenfeld commented Jun 30, 2016

I do also experience this issue with Nokogiri 1.6.8. Here's how I reproduce in irb -r nokogiri:

class Parser < Nokogiri::XML::SAX::Document;def characters(v); p v; end;end
Nokogiri::HTML::SAX::Parser.new(Parser.new).parse("&#146;"){|ctx| ctx.replace_entities = false}
# I get "\u0092" rather than "&#146;"

This is causing some issues later in our application. I'd love to get this bug fixed. Right now the work-around I'm using is to .gsub /&(#\d{3};)/, '!!\1' and then gsub again after processed which fixes my specific case but it is a fragile solution.

@rosenfeld
Copy link
Contributor

rosenfeld commented Jun 30, 2016

If you want a failing test, just replace these lines:

https://github.com/sparklemotion/nokogiri/blob/master/test/html/sax/test_parser.rb#L108
https://github.com/sparklemotion/nokogiri/blob/master/test/html/sax/test_parser.rb#L126

So that replace_entities is set to false and expect 'daddy &amp; me' rather than 'daddy & me'.

Currently, setting replace_entities does absolutely nothing no matter what you set it to.

@flavorjones flavorjones added this to the v1.17.0 milestone Jun 30, 2024
flavorjones added a commit that referenced this issue Jul 1, 2024
On CRuby, this fixes the fact that the parser was registering errors
when encountering general (non-predefined) entities. Now these
entities are resolved properly and converted into `#characters`
callbacks. Fixes #1926.

On JRuby, the SAX parser now respects the `#replace_entities`
attribute, which was previously ignored AND defaulted incorrectly to
`true`. The default now matches CRuby -- `false` -- and the parser
behavior matches CRuby with respect to entities. Fixes #614.

This commit also includes some granular tests of how the sax parser
handles different entities under different circumstances, which should
be clarifying for user reports like #1284 and #1500 that expect
predefined entities and character references to be treated like parsed
entities (which they aren't).
@flavorjones
Copy link
Member

As part of my work in #1926, I investigated entity handling and documented its behavior in #3265. Here's a screenshot of the docs:

image

As you can see, character references and predefined entities will always result in a callback to #characters with the replacement text. This is just how the underlying parsers (libxml2 and xerces) behave.

Sorry it took so long to explain this, and I hope this is helpful.

@flavorjones flavorjones removed this from the v1.17.0 milestone Jul 1, 2024
flavorjones added a commit that referenced this issue Jul 2, 2024
On CRuby, this fixes the fact that the parser was registering errors
when encountering general (non-predefined) entities. Now these
entities are resolved properly and converted into `#characters`
callbacks. Fixes #1926.

On JRuby, the SAX parser now respects the `#replace_entities`
attribute, which was previously ignored AND defaulted incorrectly to
`true`. The default now matches CRuby -- `false` -- and the parser
behavior matches CRuby with respect to entities. Fixes #614.

This commit also includes some granular tests of how the sax parser
handles different entities under different circumstances, which should
be clarifying for user reports like #1284 and #1500 that expect
predefined entities and character references to be treated like parsed
entities (which they aren't).
flavorjones added a commit that referenced this issue Jul 2, 2024
**What problem is this PR intended to solve?**

#1926 described an issue wherein the SAX parser was not correctly
resolving and replacing internal entities, and was instead reporting an
error for each entity reference. This PR includes a fix for that
problem.

I've removed the unnecessary "SAX tuple" from the SAX implementation,
replacing it with the `_private` struct member that libxml2 makes
available. Then I set up the parser context structs so that we can use
libxml2's standard SAX callbacks where they're useful (which is how I
addressed the above issue).

This PR also introduces a new feature, a SAX handler callback
`Document#reference` which allows callers to get entity-specific name
and replacement text information (rather than relying on the
`Document#characters` callback). This can be used to solve the original
issue in #1926 with this code: searls/eiwa#11

The behavior of the SAX parser with respect to entities is complex
enough that I wrote up a short doc in the `XML::SAX::Document` docstring
with a table and explanation. I've also added warnings to remind users
that `#replace_entities` is not safe to set when parsing untrusted
documents.

In the Java implementation, I've fixed the `#replace_entities` option in
the SAX parser context and set it to the proper default (`false`),
fixing #614. I've also corrected the value of the URI argument to
`Document#start_element_namespace` which was a blank string when it
should have been `nil`.

I've added quite a bit of testing around the SAX parser's handling of
entities.

I added and clarified quite a bit of documentation around SAX parsing
generally. Exception messages have been clarified in a couple of places,
and made consistent between the C and Java implementations. This should
address questions asked in issues #1500 and #1284.

Finally, I cleaned up some of the C code that implements SAX parsing,
naming functions more explicitly (and moving towards some kind of
standard naming convention).

Closes #1926.
Closes #614.


**Have you included adequate test coverage?**

Yes!


**Does this change affect the behavior of either the C or the Java
implementations?**

Yes, but the implementations are much more consistent with each other
now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants