-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REXML don't support UTF8 BOM on Windows 11 #231
Comments
I think REXML supports UTF-8 with BOM. Lines 193 to 194 in b70388c
Could you give us an XML file that can reproduce this problem? |
if i install last version of Puppet and replace REXML tool from 3.3.6 on other older (3.2.6 or 3.2.5) - all works. if I replace on 3.2.7 and newer - puppet send errors. (UTF8 WorldWide support - disabled!) |
I can't seem to reproduce this in my m1 Mac environment...
|
on Windows 11 or 10 any build!!! |
and try not File.open - try File.read (same using puppetlabs-chocolatey in their code) |
I tried it on Windows 11 and it worked fine.
|
ok, thanks! maybe puppet did something wrong. |
encoding = [Encoding::IBM437, Encoding::IBM865].sample
bom = "\xef\xbb\xbf".b
bom.force_encoding(encoding).encode('utf-8').force_encoding(encoding).encode('utf-8')
#=> "" So the data passed to REXML is already corrupt by re-encode with multiple encoding several times. I think this is not a bug of REXML. Since the xml has If your |
Thanks! Do the following commands work in your environment?
|
@andryua A corrupted BOM is considered invalid XML because it is an invalid string. The old REXML did not have this check, so it would have worked with a corrupted BOM. Please let us know the results below to confirm that REXML is working correctly with an uncorrupted BOM in your environment(cp866).
If the above does not error, there may be a problem on the Puppet aggent side. |
output is on puppet versions - 8.10.0: but i used this command:
|
and one more thing version REXML 3.2.6 and 3.2.5 - working correctly, but 3.2.7 and newest - not. |
On puppet side i find only (provider chocolatey - https://github.com/puppetlabs/puppetlabs-chocolatey/blob/main/lib/puppet/provider/chocolateyconfig/windows.rb for example):
and if I changed File.read on File.open - everything woks fine. |
so. what i have:
if I installed clean puppet 8.10.0 (rexml 3.3.6 and ruby 3.2.5) and changed system encoding on if in chocolatey provider (puppetlabs-chocolatey) i changed
|
Thanks.
I would like to know the error results with Puppet aggent using REXML 3.2.7. |
Result - the same as in first post
|
We need REXML 3.2.7 error messages to resolve this issue. |
I found changes in baseparser.rb from 3.3.2 to 3.3.3 version (v3.3.2...v3.3.3).
I changed in 3.3.3 vrsion of this file to 3.3.2 (changed only this block) - everything works fine |
3.3.6 |
OK I have not seen any problems in your environment (cp866) using REXML and https://github.com/user-attachments/files/18328878/chocolatey.txt.
Yes,
REXML 3.3.3 (and later) now processes only valid XML. If you say REXML is problematic, please reproduce the code using only REXML and XML without Puppet aggent. If we can not reproduce it in our environment, we can not investigate it. |
Hi! I using this code on RUBY
and got result in cmd (same result in cygwin bash and powershell)
so, everything OK. But I don't understand, why puppet send error on the same file... |
and look this, please |
Yes, it is reproducible. And it is caused by File.read with wrong encoding. If file is read with wrong encoding, encoding conversion may corrupt the file content. Here's an example of JSON. It's not only for BOM but for any other non-ascii bytes. # coding: utf-8
require 'json'
File.binwrite 'a.json', {key: "ࠁ\""}.to_json
corrupt_json = File.read('a.json', encoding:'sjis:utf-8')
p File.binread('a.json').bytes == corrupt_json.bytes #=> false
puts corrupt_json #=> {"key":"燿―""} # " is not escaped, It's an invalid JSON
JSON.parse corrupt_json #=> JSON::ParserError Even if your xml doesn't have a multibyte character except BOM, You can think that the strict parsing of REXML found a potential bug of puppet, or just rediscovered that puppet doesn't support your encoding configuration. So I think there is no need to discuss the File.read encoding conversion issue here anymore. |
@andryua
Expected string
The BOM for UTF-8 is OK Case
Broken Case (with Encoding.default_external and Encoding.default_internal)
Encoding.default_external and Encoding.default_internal are causing File.read to pass corrupted strings to REXML.
|
thanks a lot! I sent them pull request with changing File.read on File.open, but still wait for approving it... |
@naitoh 👋 Chocolatey maintainer here 😄 I have been following along with interest this issue after it was first reported in our repository. I am a little confused by what is going on here though, so I am hoping that someone might be able to shed some light on this.... Is there something wrong with the way that the chocolatey.config file is encoded? Or is the case that something is wrong with rexml? Why are the File.read and the File.open not doing the same thing? Why is one reading the BOM in a different way to the other one? I am glad that there is a fix, and Puppet can switch to using File.open, but if File.open was working in a previous version of rexml, why it is no longer working? Apologies if this is a naive question, as I know nothing about file encoding 😄 |
#231 (comment)
require "rexml/document"
puts "Encoding.default_external: [#{Encoding.default_external}]"
puts "Encoding.default_internal: [#{Encoding.default_internal}]"
puts ""
puts "== A-1. [File.read('chocolatey.txt', encoding: Encoding::UTF_8)] =="
xml_string = File.read('chocolatey.txt', encoding: Encoding::UTF_8)
puts "xml_string.class: #{xml_string.class}"
puts "xml_string: #{xml_string.bytes.map { |b| b.to_s(16) }[0..3].join(' ')}"
puts REXML::XPath.each(REXML::Document.new(xml_string), "/chocolatey/config/add[@key]").first
puts ""
puts "== A-2. [File.read('chocolatey.txt')] =="
xml_string = File.read('chocolatey.txt')
puts "xml_string.class: #{xml_string.class}"
puts "xml_string: #{xml_string.bytes.map { |b| b.to_s(16) }[0..3].join(' ')}"
puts REXML::XPath.each(REXML::Document.new(xml_string), "/chocolatey/config/add[@key]").first $ ruby -Ecp866:utf-8 file_read.rb
Encoding.default_external: [IBM866]
Encoding.default_internal: [UTF-8]
== A-1. [File.read('chocolatey.txt', encoding: Encoding::UTF_8)] ==
xml_string.class: String
xml_string: ef bb bf 3c
<add description='Cache location if not TEMP folder. Replaces `$env:TEMP` value for choco.exe process. It is highly recommended this be set to make Chocolatey more deterministic in cleanup.' key='cacheLocation' value='C:\ProgramData\chocolatey\cache'/>
== A-2. [File.read('chocolatey.txt')] ==
xml_string.class: String
xml_string: d1 8f e2 95
/Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/parsers/baseparser.rb:524:in 'REXML::Parsers::BaseParser#pull_event': Malformed XML: Content at the start of the document (got 'я╗┐') (REXML::ParseException)
Line: 1
Position: 46
Last 80 unconsumed characters:
<?xml version="1.0" encoding="utf-8"?>
from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/parsers/baseparser.rb:247:in 'REXML::Parsers::BaseParser#pull'
from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/parsers/treeparser.rb:21:in 'REXML::Parsers::TreeParser#parse'
from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/document.rb:452:in 'REXML::Document#build'
from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/document.rb:103:in 'REXML::Document#initialize'
from file_read.rb:18:in 'Class#new'
from file_read.rb:18:in '<main>' In A-2 above, a broken string (xml_string =
No.
require "rexml/document"
puts "Encoding.default_external: [#{Encoding.default_external}]"
puts "Encoding.default_internal: [#{Encoding.default_internal}]"
puts ""
puts "== B-1. [File.open('chocolatey.txt', encoding: Encoding::UTF_8) & read] =="
xml_file = File.open('chocolatey.txt', encoding: Encoding::UTF_8)
puts "xml_file.class: #{xml_file.class}"
xml_string = xml_file.read
puts "xml_string.class: #{xml_string.class}"
puts "xml_string: #{xml_string.bytes.map { |b| b.to_s(16) }[0..3].join(' ')}"
puts REXML::XPath.each(REXML::Document.new(xml_string), "/chocolatey/config/add[@key]").first
puts ""
puts "== B-2. [File.open('chocolatey.txt' & read] =="
xml_file = File.open('chocolatey.txt')
puts "xml_file.class: #{xml_file.class}"
xml_string = xml_file.read
puts "xml_string.class: #{xml_string.class}"
puts "xml_string: #{xml_string.bytes.map { |b| b.to_s(16) }[0..3].join(' ')}"
puts REXML::XPath.each(REXML::Document.new(xml_string), "/chocolatey/config/add[@key]").first
With the support of #184, an invalid string preceding an XML declaration is no longer treated as invalid XML. A corrupted BOM is considered invalid XML because it is an invalid string. The old REXML did not have this check, so it would have worked with a corrupted BOM. Use |
Really? It seems that |
Oh, sorry.
Yes
Yes. @kou Thanks. |
@naitoh thank you for the detailed explanation, I appreciate it! |
Hi!
After update Puppet aggent to 8.9.0 (they updated REXML from 3.2.5 to 3.3.6 version) - puppet chocolatey provider can't use correct parsing XML witn UTF-8 BOM codding. Only if enable support beta UTF8 WordWide in Windows 10/11 - everything work fine. If disable this feature - puppet send error again. (tried on cyrrilic and english - same issues)
The text was updated successfully, but these errors were encountered: