Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/xml: Decoder does not reject xml-ProcInst preceeded by other data #65691

Open
Tracked by #68293
Merovius opened this issue Feb 13, 2024 · 11 comments
Open
Tracked by #68293
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@Merovius
Copy link
Contributor

Go version

go version go1.22.0 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/mero/.cache/go-build'
GOENV='/home/mero/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/mero/pkg/mod'
GONOPROXY='*.edg.ag,*.vwd.com,bitbucket.org/vauwede/*,infrontfinance.dev/*,iqag.dev/*'
GONOSUMDB='*.edg.ag,*.vwd.com,bitbucket.org/vauwede/*,infrontfinance.dev/*,iqag.dev/*'
GOOS='linux'
GOPATH='/home/mero'
GOPRIVATE='*.edg.ag,*.vwd.com,bitbucket.org/vauwede/*,infrontfinance.dev/*,iqag.dev/*'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/home/mero/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/home/mero/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.22.0'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/mero/src/r/priips-import/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build194995846=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Playground link:

package main

import (
	"encoding/xml"
	"fmt"
	"io"
	"strings"
)

func main() {
	s := ` <?xml version="1.0"?><!DOCTYPE x><x></x>` // note the space
	d := xml.NewDecoder(strings.NewReader(s))
	for {
		t, err := d.Token()
		fmt.Printf("%#v, %v\n", t, err)
		if err == io.EOF {
			break
		}
	}

	fmt.Println()
	fmt.Println(xml.Unmarshal([]byte(s), new(X)))
}

type X struct{}

What did you see happen?

xml.CharData{0x20}, <nil>
xml.ProcInst{Target:"xml", Inst:[]uint8{0x76, 0x65, 0x72, 0x73, 0x69, 0x6f, 0x6e, 0x3d, 0x22, 0x31, 0x2e, 0x30, 0x22}}, <nil>
xml.Directive{0x44, 0x4f, 0x43, 0x54, 0x59, 0x50, 0x45, 0x20, 0x78}, <nil>
xml.StartElement{Name:xml.Name{Space:"", Local:"x"}, Attr:[]xml.Attr{}}, <nil>
xml.EndElement{Name:xml.Name{Space:"", Local:"x"}}, <nil>
<nil>, EOF

<nil>

What did you expect to see?

xml.CharData{0x20}, <nil>
<nil>, invalid XML: xml declaration is not the first token

invalid XML: xml declaration is not the first token

The specification tells us that a valid XML document must not have whitespace before the <?xml …?> instruction (if present):

document ::= prolog element Misc*
prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?

However, both the Decoder and Unmarshal accept an input with leading whitespace.

This tripped us up, because we use the Decoder to convert an XML document into a token stream, manipulate that and then use an Encoder.EncodeToken to translate it back into XML. This broke when fed such an invalid input, because the Encoder checks this invariant (it returns an error), but the Decoder doesn't: https://go.dev/play/p/Gv89nolZNtM

@Merovius
Copy link
Contributor Author

Note: this also happens with Decoder.Strict == true: https://go.dev/play/p/bRhbEqXPdTr

@thanm thanm added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 13, 2024
@thanm
Copy link
Contributor

thanm commented Feb 13, 2024

@rsc per owners

@thanm
Copy link
Contributor

thanm commented Feb 13, 2024

Thanks for the report. Would you be willing to send a CL?

@Merovius
Copy link
Contributor Author

Sure

@Merovius Merovius changed the title encoding/xml: Decoder does not reject xml-ProcInst preceeded by whitespace encoding/xml: Decoder does not reject xml-ProcInst preceeded by other data Feb 14, 2024
@Merovius
Copy link
Contributor Author

This doesn't just affect whitespace, but also other tokens, which is definitely not correct: https://go.dev/play/p/7cfsv_Ez5UX

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/564035 mentions this issue: encoding/xml: reject XML declaration after start of document

@nkitchen
Copy link

The fix for this issue depends on an implicit assumption that the input stream of a Decoder contains only 1 XML document. That expectation is not expressed by the documentation of the package. Should it be a valid use case to use a single Decoder to parse the concatenation of multiple documents?

@Merovius
Copy link
Contributor Author

@nkitchen I'm not sure that's the only way in which the Decoder assumes that. Casually skimming the code, it seems to at least also assume it with namespace-handling. And note that the Encoder also assumes that there is a single document. Personally, I think it's a fair assumption (the behavior of json.Decoder always stroke me as kind of strange).

Long-term we probably want an encoding/xml/v2, which, like the proposed encoding/json/v2 really separates the tokenization and validation/unmarshalling steps.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/570175 mentions this issue: Revert "encoding/xml: reject XML declaration after start of document"

@rsc
Copy link
Contributor

rsc commented Mar 8, 2024

See my comments on CL 570175. At the least this change requires a proposal and a GODEBUG for compatibility.

@rsc rsc reopened this Mar 8, 2024
gopherbot pushed a commit that referenced this issue Mar 8, 2024
This reverts commit 8a0fbd7.

Reason for revert: Breaking real-world tests inside Google,
which means it probably breaks real-world tests outside Google.

One instance I have seen is a <!-- --> comment (often a copyright notice) before the procinst.

Another test checks that a canonicalizer can handle a test input that simply has procinsts mid-XML.

XML is full of contradictions, XML implementations more so. If we are going to start being picky, that probably needs to be controlled by a GODEBUG (and a proposal).

For #65691 (will reopen manually).

Change-Id: Ib52d0944b1478e71744a2a35b271fdf7e1c972ca
Reviewed-on: https://go-review.googlesource.com/c/go/+/570175
Reviewed-by: Than McIntosh <thanm@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Russ Cox <rsc@golang.org>
@dmitshur dmitshur added this to the Backlog milestone May 23, 2024
@DemiMarie
Copy link
Contributor

Long-term we probably want an encoding/xml/v2, which, like the proposed encoding/json/v2 really separates the tokenization and validation/unmarshalling steps.

I think that after this encoding/xml should be deprecated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

7 participants