Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REXML don't support UTF8 BOM on Windows 11 #231

Closed
andryua opened this issue Jan 9, 2025 · 35 comments
Closed

REXML don't support UTF8 BOM on Windows 11 #231

andryua opened this issue Jan 9, 2025 · 35 comments

Comments

@andryua
Copy link

andryua commented Jan 9, 2025

Hi!

After update Puppet aggent to 8.9.0 (they updated REXML from 3.2.5 to 3.3.6 version) - puppet chocolatey provider can't use correct parsing XML witn UTF-8 BOM codding. Only if enable support beta UTF8 WordWide in Windows 10/11 - everything work fine. If disable this feature - puppet send error again. (tried on cyrrilic and english - same issues)

Debug: Gathering sources from 'C:\ProgramData\chocolatey\config\chocolatey.config'.
Error: Could not prefetch chocolateysource provider 'windows': Malformed XML: Content at the start of the document (got '')
Line: 1
Position: 47
Last 80 unconsumed characters:
<?xml version="1.0" encoding="utf-8"?>
Warning: /Stage[main]/Windows::Chocolatey/Chocolateysource[chocolatey]: Skipping because provider prefetch failed
Debug: /Stage[main]/Windows::Chocolatey/Chocolateysource[chocolatey]: Resource is being skipped, unscheduling all events
@andryua andryua changed the title REXML don't support UTF8 with BOM on Windows 11 REXML don't support UTF8 BOM on Windows 11 Jan 9, 2025
@naitoh
Copy link
Contributor

naitoh commented Jan 9, 2025

I think REXML supports UTF-8 with BOM.

rexml/lib/rexml/source.rb

Lines 193 to 194 in b70388c

elsif @scanner.scan(/\xef\xbb\xbf/n)
detected_encoding = "UTF-8"

@andryua

Could you give us an XML file that can reproduce this problem?

@andryua
Copy link
Author

andryua commented Jan 9, 2025

@andryua
Copy link
Author

andryua commented Jan 9, 2025

if i install last version of Puppet and replace REXML tool from 3.3.6 on other older (3.2.6 or 3.2.5) - all works. if I replace on 3.2.7 and newer - puppet send errors. (UTF8 WorldWide support - disabled!)

@naitoh
Copy link
Contributor

naitoh commented Jan 9, 2025

I can't seem to reproduce this in my m1 Mac environment...

$ ruby -v
ruby 3.3.4 (2024-07-09 revision be1089c8ec) [arm64-darwin22]
$ gem list rexml
*** LOCAL GEMS ***

rexml (3.3.6)
$ ruby -r "rexml/document" -e 'REXML::Document.new(File.open("./chocolatey.txt"))' 

@andryua
Copy link
Author

andryua commented Jan 9, 2025

on Windows 11 or 10 any build!!!

@andryua
Copy link
Author

andryua commented Jan 10, 2025

and try not File.open - try File.read (same using puppetlabs-chocolatey in their code)

@naitoh
Copy link
Contributor

naitoh commented Jan 12, 2025

@andryua

I tried it on Windows 11 and it worked fine.
Do the following commands work in your environment?

>ruby -v
ruby 3.3.6 (2024-11-05 revision 75015d4c1f) [x64-mingw-ucrt]

>gem list rexml
rexml (3.3.6)

>ruby -r "rexml/document" -e 'puts REXML::XPath.each(REXML::Document.new(File.read(".\\Documents\\chocolatey.txt")),  "/chocolatey/config/add[@key]").inspect'
#<Enumerator: [<add key='cacheLocation' value='C:\ProgramData\chocolatey\cache' description='Cache location if not TEMP folder. Replaces `$env:TEMP` value for choco.exe process. It is highly recommended this be set to make Chocolatey more deterministic in cleanup.'/>, <add key='containsLegacyPackageInstalls' value='true'/>, <add key='commandExecutionTimeoutSeconds' value='2700' description='Default timeout for command execution. &apos;0&apos; for infinite.'/>, <add key='proxy' value='' description='Explicit proxy location.'/>, <add key='proxyUser' value='' description='Optional proxy user. Requires explicit proxy configured.'/>, <add key='proxyPassword' value='' description='Optional proxy password. Encrypted. Requires explicit proxy and proxyUser configured.'/>, <add key='webRequestTimeoutSeconds' value='30' description='Default timeout for web requests.'/>, <add key='proxyBypassList' value='' description='Optional proxy bypass list. Comma separated. Requires explicit proxy configured.'/>, <add key='proxyBypassOnLocal' value='true' description='Bypass proxy for local connections. Requires explicit proxy configured.'/>, <add key='upgradeAllExceptions' value='' description='A comma-separated list of package names that should not be upgraded when running `choco upgrade all&apos;. Defaults to empty.'/>, <add key='defaultTemplateName' value='' description='Default template name used when running &apos;choco new&apos; command.'/>, <add key='defaultPushSource' value='' description='Default source to push packages to when running &apos;choco push&apos; command.'/>]:each>

@andryua
Copy link
Author

andryua commented Jan 13, 2025

ok, thanks! maybe puppet did something wrong.

@andryua
Copy link
Author

andryua commented Jan 25, 2025

Hi!
I reopened, because if in read include encodding: 'bom|utf-8' - everuthing work fine, but, if we using read without encodding parameter - we got bom symbols before xml file content and we have errors like in first post. Look on my screenshots, please (i attached screenshots with version of ruby, rexml and codepage of system console in Windows 10 or 11). Thanks!

Image
Image
Image
p.s. maybe problem is in cp866? I using Ukainian language in Windows.

@tompng
Copy link
Member

tompng commented Jan 25, 2025

Malformed XML: Content at the start of the document (got '')

'' can be created by this step.

encoding = [Encoding::IBM437, Encoding::IBM865].sample
bom = "\xef\xbb\xbf".b
bom.force_encoding(encoding).encode('utf-8').force_encoding(encoding).encode('utf-8')
#=> ""

So the data passed to REXML is already corrupt by re-encode with multiple encoding several times. I think this is not a bug of REXML.

Since the xml has encoding="utf-8", reading it in ascii, binary or ascii-8bit works fine, but reading with other wrong encoding corrupts the data. Corrupt data can't be repaired by re-encoding or any other process.

If your Encoding.default_external is not ascii, ascii-8bit or it does not match the file encoding, you need to specify the encoding when reading the file.

@naitoh
Copy link
Contributor

naitoh commented Jan 25, 2025

@tompng

So the data passed to REXML is already corrupt by re-encode with multiple encoding several times. I think this is not a bug of REXML.

Thanks!
I think so.

@andryua

Do the following commands work in your environment?

>ruby -r "rexml/document" -e 'puts REXML::XPath.each(REXML::Document.new(File.read(".\\Documents\\chocolatey.txt")),  "/chocolatey/config/add[@key]").inspect'

@andryua
Copy link
Author

andryua commented Jan 27, 2025

Hi! I run on working version with old version REXML and Ruby:

Image
Image
Image

I run after puppet upgraded with new (not last) REXML and Ruby:

Image

if using REXML version 3.2.5 (or 3.2.6) - all works fine, only if I upgrade ti newest version of REXML - error

@naitoh
Copy link
Contributor

naitoh commented Jan 27, 2025

@andryua
With the support of #184, an invalid string preceding an XML declaration is no longer treated as invalid XML.

A corrupted BOM is considered invalid XML because it is an invalid string.
So, this is probably why the latest REXML has an error.

The old REXML did not have this check, so it would have worked with a corrupted BOM.

Please let us know the results below to confirm that REXML is working correctly with an uncorrupted BOM in your environment(cp866).

>ruby -r "rexml/document" -e 'puts REXML::XPath.each(REXML::Document.new(File.read(".\\Documents\\chocolatey.txt")),  "/chocolatey/config/add[@key]").inspect'

If the above does not error, there may be a problem on the Puppet aggent side.

@andryua
Copy link
Author

andryua commented Jan 27, 2025

>ruby -r "rexml/document" -e 'puts REXML::XPath.each(REXML::Document.new(File.read(".\\Documents\\chocolatey.txt")),  "/chocolatey/config/add[@key]").inspect'

output is on puppet versions - 8.10.0:

Image

but i used this command:

>ruby -r "rexml/document" -e "puts REXML::XPath.each(REXML::Document.new(File.read('C:\\ProgramData\\chocolatey\\config\\chocolatey.config')),  '/chocolatey/config/add[@key]').inspect"

@andryua
Copy link
Author

andryua commented Jan 27, 2025

and one more thing

version REXML 3.2.6 and 3.2.5 - working correctly, but 3.2.7 and newest - not.

@andryua
Copy link
Author

andryua commented Jan 27, 2025

On puppet side i find only (provider chocolatey - https://github.com/puppetlabs/puppetlabs-chocolatey/blob/main/lib/puppet/provider/chocolateyconfig/windows.rb for example):

 config = REXML::Document.new File.read(choco_config)

and if I changed File.read on File.open - everything woks fine.

@andryua
Copy link
Author

andryua commented Jan 27, 2025

so. what i have:

ruby 3.2.5, rexml - 3.3.6 (puppet 8.10.0)
chocolatey.config - utf-8 with BOM
cp866 - system encoding

doenst work!
ruby 3.2.4, rexml - 3.2.5 (puppet 8.7.0)
chocolatey.config - utf-8 with BOM
cp866 - system encoding

all works fine
ruby 3.2.5 and rexml 3.2.6 (copy rexml by manual)
chocolatey.config - utf-8 with BOM
cp866 - system encoding

all works fine

if i copied rexml newest than 3.2.6 - errors

if I installed clean puppet 8.10.0 (rexml 3.3.6 and ruby 3.2.5) and changed system encoding on 65001 - works fine, if return on cp866 - errors.

if in chocolatey provider (puppetlabs-chocolatey) i changed

config = REXML::Document.new File.read(choco_config) on config = REXML::Document.new File.open(choco_config) - works fine

@naitoh
Copy link
Contributor

naitoh commented Jan 27, 2025

@andryua

>ruby -r "rexml/document" -e 'puts REXML::XPath.each(REXML::Document.new(File.read(".\\Documents\\chocolatey.txt")),  "/chocolatey/config/add[@key]").inspect'

output is on puppet versions - 8.10.0:

Image

but i used this command:

>ruby -r "rexml/document" -e "puts REXML::XPath.each(REXML::Document.new(File.read('C:\\ProgramData\\chocolatey\\config\\chocolatey.config')),  '/chocolatey/config/add[@key]').inspect"

Thanks.
I would like to know the REXML version of the above results.

and one more thing

version REXML 3.2.6 and 3.2.5 - working correctly, but 3.2.7 and newest - not.

I would like to know the error results with Puppet aggent using REXML 3.2.7.

@andryua
Copy link
Author

andryua commented Jan 28, 2025

Result - the same as in first post

Error: Could not prefetch chocolateysource provider 'windows': Malformed XML: Content at the start of the document (got '')

@naitoh
Copy link
Contributor

naitoh commented Jan 28, 2025

@andryua

Result - the same as in first post

Error: Could not prefetch chocolateysource provider 'windows': Malformed XML: Content at the start of the document (got '')

We need REXML 3.2.7 error messages to resolve this issue.

@andryua
Copy link
Author

andryua commented Jan 28, 2025

I so really sorry - not 3.2.7. After 3.3.3 - puppet send errors like on screenshot below

Image

@andryua
Copy link
Author

andryua commented Jan 28, 2025

I found changes in baseparser.rb from 3.3.2 to 3.3.3 version (v3.3.2...v3.3.3).

3.3.2 
485 if @tags.empty? and @have_root
              unless /\A\s*\z/.match?(text)
                raise ParseException.new("Malformed XML: Extra content at the end of the document (got '#{text}')", @source)
              end
3.3.3
482   if @tags.empty?
              unless /\A\s*\z/.match?(text)
                if @have_root
                  raise ParseException.new("Malformed XML: Extra content at the end of the document (got '#{text}')", @source)
                else
                  raise ParseException.new("Malformed XML: Content at the start of the document (got '#{text}')", @source)
                end
              end

I changed in 3.3.3 vrsion of this file to 3.3.2 (changed only this block) - everything works fine

@naitoh
Copy link
Contributor

naitoh commented Jan 28, 2025

@andryua

I so really sorry - not 3.2.7. After 3.3.3 - puppet send errors like on screenshot below

OK
I see.

#231 (comment)

output is on puppet versions - 8.10.0:
Image
but i used this command:

>ruby -r "rexml/document" -e "puts REXML::XPath.each(REXML::Document.new(File.read('C:\\ProgramData\\chocolatey\\config\\chocolatey.config')),  '/chocolatey/config/add[@key]').inspect"

Thanks. I would like to know the REXML version of the above results.

I would like to know the REXML version of the above results, please.

@andryua
Copy link
Author

andryua commented Jan 29, 2025

I would like to know the REXML version of the above results, please.

3.3.6

@naitoh
Copy link
Contributor

naitoh commented Jan 29, 2025

@andryua

I would like to know the REXML version of the above results, please.

3.3.6

OK
I see.

I have not seen any problems in your environment (cp866) using REXML and https://github.com/user-attachments/files/18328878/chocolatey.txt.
Therefore, I do not see any problem with REXML.

I found changes in baseparser.rb from 3.3.2 to 3.3.3 version (v3.3.2...v3.3.3).

Yes,
I have already commented on this change as follows

#231 (comment)

With the support of #184, an invalid string preceding an XML declaration is no longer treated as invalid XML.

A corrupted BOM is considered invalid XML because it is an invalid string. So, this is probably why the latest REXML has an error.

The old REXML did not have this check, so it would have worked with a corrupted BOM.

REXML 3.3.3 (and later) now processes only valid XML.

If you say REXML is problematic, please reproduce the code using only REXML and XML without Puppet aggent.

If we can not reproduce it in our environment, we can not investigate it.

@andryua
Copy link
Author

andryua commented Feb 4, 2025

Hi!

I using this code on RUBY

require 'rexml/document'

def validate_xml(file_path)
  begin
    xml_content = File.read(file_path)

    doc = REXML::Document.new(xml_content)

    root = doc.root
    puts "Root element: #{root.name}"

    if root.elements['sources']
      puts "Element 'sources' was find!"
    else
      puts "Element 'sources' didn't find."
    end

    puts "XML file is valid."
  rescue REXML::ParseException => e
    puts "Error parsing of XML file: #{e.message}"
  rescue => e
    puts "Error: #{e.message}"
  end
end

validate_xml('c:\\ProgramData\\chocolatey\\config\\chocolatey.config')

and got result in cmd (same result in cygwin bash and powershell)

C:\Program Files\Puppet Labs\Puppet\puppet\bin>ruby.exe test.rb
Root element: chocolatey
Element 'sources' was find!
XML file is valid.

C:\Program Files\Puppet Labs\Puppet\puppet\bin>ruby.exe --version
ruby 3.2.5 (2024-07-26 revision 31d0f1a2e7) [x64-mingw32]

C:\Program Files\Puppet Labs\Puppet\puppet\bin>gem.bat which rexml
C:/Program Files/Puppet Labs/Puppet/puppet/lib/ruby/gems/3.2.0/gems/rexml-3.3.6/lib/rexml.rb

so, everything OK. But I don't understand, why puppet send error on the same file...

@andryua
Copy link
Author

andryua commented Feb 5, 2025

and look this, please

puppetlabs/puppet#9532 (comment)

@tompng
Copy link
Member

tompng commented Feb 5, 2025

Yes, it is reproducible. And it is caused by File.read with wrong encoding.
See puppetlabs/puppet#9532 (comment)

If file is read with wrong encoding, encoding conversion may corrupt the file content.
In general, corrupt file can't be parsed correctly regardless the corruption is caused by encoding conversion or by any other reason.

Here's an example of JSON. It's not only for BOM but for any other non-ascii bytes.

# coding: utf-8
require 'json'
File.binwrite 'a.json', {key: "ࠁ\""}.to_json
corrupt_json = File.read('a.json', encoding:'sjis:utf-8')
p File.binread('a.json').bytes == corrupt_json.bytes #=> false
puts corrupt_json #=> {"key":"燿―""} # " is not escaped, It's an invalid JSON
JSON.parse corrupt_json #=> JSON::ParserError

Even if your xml doesn't have a multibyte character except BOM, You can think that the strict parsing of REXML found a potential bug of puppet, or just rediscovered that puppet doesn't support your encoding configuration.

So I think there is no need to discuss the File.read encoding conversion issue here anymore.

@naitoh
Copy link
Contributor

naitoh commented Feb 6, 2025

@andryua
I see.

$ ruby -Ecp866:utf-8 -rrexml -e "REXML::Document.new File.open('chocolatey.txt')"
#=> No Error
$ ruby -Ecp866:utf-8 -rrexml -e "REXML::Document.new File.read('chocolatey.txt')"
/Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/parsers/baseparser.rb:524:in 'REXML::Parsers::BaseParser#pull_event': Malformed XML: Content at the start of the document (got 'я╗┐') (REXML::ParseException)
Line: 1
Position: 46
Last 80 unconsumed characters:
<?xml version="1.0" encoding="utf-8"?>
	from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/parsers/baseparser.rb:247:in 'REXML::Parsers::BaseParser#pull'
	from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/parsers/treeparser.rb:21:in 'REXML::Parsers::TreeParser#parse'
	from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/document.rb:452:in 'REXML::Document#build'
	from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/document.rb:103:in 'REXML::Document#initialize'
	from -e:1:in 'Class#new'
	from -e:1:in '<main>

Expected string

$ hexdump chocolatey.txt  |head -n 1
0000000 bbef 3cbf 783f 6c6d 7620 7265 6973 6e6f

#=> ef bb bf 3c 3f 78 6d 6c ..

The BOM for UTF-8 is ef bb bf.

OK Case

$ ruby -e "puts File.read('chocolatey.txt').bytes.map { |b| b.to_s(16) }[0..7]"
ef
bb
bf
3c
3f
78
6d
6c

Broken Case (with Encoding.default_external and Encoding.default_internal)

$ ruby -Ecp866:utf-8 -e "puts File.read('chocolatey.txt').bytes.map { |b| b.to_s(16) }[0..7]"
d1
8f
e2
95
97
e2
94
90

d1 8f e2 is not a UTF-8 BOM.

Encoding.default_external and Encoding.default_internal are causing File.read to pass corrupted strings to REXML.
Use File.open when calling REXML on the Puppet aggent side.

File.read case #=> Error

$ ruby -Ecp866:utf-8 -r "rexml/document" -e 'puts REXML::XPath.each(REXML::Document.new(File.read("chocolatey.txt")),  "/chocolatey/config/add[@key]").inspect'
/Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/parsers/baseparser.rb:524:in 'REXML::Parsers::BaseParser#pull_event': Malformed XML: Content at the start of the document (got 'я╗┐') (REXML::ParseException)
Line: 1
Position: 46
Last 80 unconsumed characters:
<?xml version="1.0" encoding="utf-8"?>
	from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/parsers/baseparser.rb:247:in 'REXML::Parsers::BaseParser#pull'
	from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/parsers/treeparser.rb:21:in 'REXML::Parsers::TreeParser#parse'
	from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/document.rb:452:in 'REXML::Document#build'
	from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/document.rb:103:in 'REXML::Document#initialize'
	from -e:1:in 'Class#new'
	from -e:1:in '<main>'

File.open case #=> No Error

$ ruby -Ecp866:utf-8 -r "rexml/document" -e 'puts REXML::XPath.each(REXML::Document.new(File.open("chocolatey.txt")),  "/chocolatey/config/add[@key]").inspect'
#<Enumerator: [<add key='cacheLocation' value='C:\ProgramData\chocolatey\cache' description='Cache location if not TEMP folder. Replaces `$env:TEMP` value for choco.exe process. It is highly recommended this be set to make Chocolatey more deterministic in cleanup.'/>, <add key='containsLegacyPackageInstalls' value='true'/>, <add key='commandExecutionTimeoutSeconds' value='2700' description='Default timeout for command execution. &apos;0&apos; for infinite.'/>, <add key='proxy' value='' description='Explicit proxy location.'/>, <add key='proxyUser' value='' description='Optional proxy user. Requires explicit proxy configured.'/>, <add key='proxyPassword' value='' description='Optional proxy password. Encrypted. Requires explicit proxy and proxyUser configured.'/>, <add key='webRequestTimeoutSeconds' value='30' description='Default timeout for web requests.'/>, <add key='proxyBypassList' value='' description='Optional proxy bypass list. Comma separated. Requires explicit proxy configured.'/>, <add key='proxyBypassOnLocal' value='true' description='Bypass proxy for local connections. Requires explicit proxy configured.'/>, <add key='upgradeAllExceptions' value='' description='A comma-separated list of package names that should not be upgraded when running `choco upgrade all&apos;. Defaults to empty.'/>, <add key='defaultTemplateName' value='' description='Default template name used when running &apos;choco new&apos; command.'/>, <add key='defaultPushSource' value='' description='Default source to push packages to when running &apos;choco push&apos; command.'/>]:each>

@andryua
Copy link
Author

andryua commented Feb 6, 2025

thanks a lot! I sent them pull request with changing File.read on File.open, but still wait for approving it...
thanks again!!!

@andryua andryua closed this as completed Feb 6, 2025
@gep13
Copy link

gep13 commented Feb 6, 2025

@naitoh 👋 Chocolatey maintainer here 😄

I have been following along with interest this issue after it was first reported in our repository.

I am a little confused by what is going on here though, so I am hoping that someone might be able to shed some light on this....

Is there something wrong with the way that the chocolatey.config file is encoded?

Or is the case that something is wrong with rexml?

Why are the File.read and the File.open not doing the same thing? Why is one reading the BOM in a different way to the other one?

I am glad that there is a fix, and Puppet can switch to using File.open, but if File.open was working in a previous version of rexml, why it is no longer working?

Apologies if this is a naive question, as I know nothing about file encoding 😄

@naitoh
Copy link
Contributor

naitoh commented Feb 7, 2025

@gep13

Is there something wrong with the way that the chocolatey.config file is encoded?

Yes.

#231 (comment)
It seems that chocolatey.config is a valid XML but Puppet reads it with invalid encoding conversion.

  • file_read.rb
require "rexml/document"

puts "Encoding.default_external: [#{Encoding.default_external}]"
puts "Encoding.default_internal: [#{Encoding.default_internal}]"

puts ""
puts "== A-1. [File.read('chocolatey.txt', encoding: Encoding::UTF_8)] =="
xml_string = File.read('chocolatey.txt', encoding: Encoding::UTF_8)
puts "xml_string.class: #{xml_string.class}"
puts "xml_string: #{xml_string.bytes.map { |b| b.to_s(16) }[0..3].join(' ')}"
puts REXML::XPath.each(REXML::Document.new(xml_string),  "/chocolatey/config/add[@key]").first

puts ""
puts "== A-2. [File.read('chocolatey.txt')] =="
xml_string = File.read('chocolatey.txt')
puts "xml_string.class: #{xml_string.class}"
puts "xml_string: #{xml_string.bytes.map { |b| b.to_s(16) }[0..3].join(' ')}"
puts REXML::XPath.each(REXML::Document.new(xml_string),  "/chocolatey/config/add[@key]").first
$ ruby  -Ecp866:utf-8 file_read.rb 
Encoding.default_external: [IBM866]
Encoding.default_internal: [UTF-8]

== A-1. [File.read('chocolatey.txt', encoding: Encoding::UTF_8)] ==
xml_string.class: String
xml_string: ef bb bf 3c
<add description='Cache location if not TEMP folder. Replaces `$env:TEMP` value for choco.exe process. It is highly recommended this be set to make Chocolatey more deterministic in cleanup.' key='cacheLocation' value='C:\ProgramData\chocolatey\cache'/>

== A-2. [File.read('chocolatey.txt')] ==
xml_string.class: String
xml_string: d1 8f e2 95
/Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/parsers/baseparser.rb:524:in 'REXML::Parsers::BaseParser#pull_event': Malformed XML: Content at the start of the document (got 'я╗┐') (REXML::ParseException)
Line: 1
Position: 46
Last 80 unconsumed characters:
<?xml version="1.0" encoding="utf-8"?>
	from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/parsers/baseparser.rb:247:in 'REXML::Parsers::BaseParser#pull'
	from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/parsers/treeparser.rb:21:in 'REXML::Parsers::TreeParser#parse'
	from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/document.rb:452:in 'REXML::Document#build'
	from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/document.rb:103:in 'REXML::Document#initialize'
	from file_read.rb:18:in 'Class#new'
	from file_read.rb:18:in '<main>'

In A-2 above, a broken string (xml_string = d1 8f e2 95) is passed to REXML. (Expected string is ef bb bf 3c)

Or is the case that something is wrong with rexml?

No.

Why are the File.read and the File.open not doing the same thing? Why is one reading the BOM in a different way to the other one?

File.read and file = File.open & file.read have the same behavior.
This is an encoding processing problem related to the read method.

  • file_open.rb
require "rexml/document"

puts "Encoding.default_external: [#{Encoding.default_external}]"
puts "Encoding.default_internal: [#{Encoding.default_internal}]"

puts ""
puts "== B-1. [File.open('chocolatey.txt', encoding: Encoding::UTF_8) & read] =="
xml_file = File.open('chocolatey.txt', encoding: Encoding::UTF_8)
puts "xml_file.class: #{xml_file.class}"
xml_string = xml_file.read
puts "xml_string.class: #{xml_string.class}"
puts "xml_string: #{xml_string.bytes.map { |b| b.to_s(16) }[0..3].join(' ')}"
puts REXML::XPath.each(REXML::Document.new(xml_string),  "/chocolatey/config/add[@key]").first


puts ""
puts "== B-2. [File.open('chocolatey.txt' & read] =="
xml_file = File.open('chocolatey.txt')
puts "xml_file.class: #{xml_file.class}"
xml_string = xml_file.read
puts "xml_string.class: #{xml_string.class}"
puts "xml_string: #{xml_string.bytes.map { |b| b.to_s(16) }[0..3].join(' ')}"
puts REXML::XPath.each(REXML::Document.new(xml_string),  "/chocolatey/config/add[@key]").first
$ ruby  -Ecp866:utf-8 file_open.rb
Encoding.default_external: [IBM866]
Encoding.default_internal: [UTF-8]

== B-1. [File.open('chocolatey.txt', encoding: Encoding::UTF_8) & read] ==
xml_file.class: File
xml_string.class: String
xml_string: ef bb bf 3c
<add description='Cache location if not TEMP folder. Replaces `$env:TEMP` value for choco.exe process. It is highly recommended this be set to make Chocolatey more deterministic in cleanup.' key='cacheLocation' value='C:\ProgramData\chocolatey\cache'/>

== B-2. [File.open('chocolatey.txt' & read] ==
xml_file.class: File
xml_string.class: String
xml_string: d1 8f e2 95
/Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/parsers/baseparser.rb:524:in 'REXML::Parsers::BaseParser#pull_event': Malformed XML: Content at the start of the document (got 'я╗┐') (REXML::ParseException)
Line: 1
Position: 46
Last 80 unconsumed characters:
<?xml version="1.0" encoding="utf-8"?>
	from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/parsers/baseparser.rb:247:in 'REXML::Parsers::BaseParser#pull'
	from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/parsers/treeparser.rb:21:in 'REXML::Parsers::TreeParser#parse'
	from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/document.rb:452:in 'REXML::Document#build'
	from /Users/naitoh/.rbenv/versions/3.4.1/lib/ruby/gems/3.4.0/gems/rexml-3.4.1/lib/rexml/document.rb:103:in 'REXML::Document#initialize'
	from file_open.rb:23:in 'Class#new'
	from file_open.rb:23:in '<main>'

I am glad that there is a fix, and Puppet can switch to using File.open, but if File.open was working in a previous version of rexml, why it is no longer working?

With the support of #184, an invalid string preceding an XML declaration is no longer treated as invalid XML.

A corrupted BOM is considered invalid XML because it is an invalid string.
So this is why the error occurs in the latest REXML.

The old REXML did not have this check, so it would have worked with a corrupted BOM.

Use File.open('chocolatey.txt') or File.read('chocolatey.txt', encoding: Encoding::UTF_8) when calling REXML, please.

@kou
Copy link
Member

kou commented Feb 7, 2025

Is there something wrong with the way that the chocolatey.config file is encoded?

Yes.

Really? It seems that chocolatey.config is a valid XML but Puppet reads it with invalid encoding conversion.

@naitoh
Copy link
Contributor

naitoh commented Feb 7, 2025

Is there something wrong with the way that the chocolatey.config file is encoded?

Yes.

Really? It seems that chocolatey.config is a valid XML but Puppet reads it with invalid encoding conversion.

Oh, sorry.
I said it wrong.

chocolatey.config is a valid XML.

Yes

but Puppet reads it with invalid encoding conversion.

Yes.

@kou Thanks.

@gep13
Copy link

gep13 commented Feb 7, 2025

@naitoh thank you for the detailed explanation, I appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants