Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/go,os: build cache checksum errors in x/tools/cmd/callgraph.TestCallgraph on windows/arm64 #50706

Open
bcmills opened this issue Jan 20, 2022 · 16 comments
Labels
arch-arm64 NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows Tools This label describes issues relating to any tools in the x/tools repository.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Jan 20, 2022

--- FAIL: TestCallgraph (11.99s)
    main_test.go:85: err: exit status 1: stderr: go build pkg: loading compiled Go files from cache: reading srcfiles list: cache entry not found: bad checksum
        
    main_test.go:100: got:
         <root> --> pkg.init
        pkg.main2 --> (pkg.D).f
        pkg.main --> pkg.main2
        pkg.main --> (pkg.C).f
        <root> --> pkg.main
        
    main_test.go:100: got:
         (*os.File).setDeadline --> (*os.File).checkValid
        (time.Time).IsZero --> (*time.Time).sec
        (time.Time).IsZero --> (*time.Time).nsec
        internal/poll.setDeadlineImpl --> (time.Time).IsZero
…
FAIL
FAIL	golang.org/x/tools/cmd/callgraph	12.869s

greplogs --dashboard -md -l -e 'reading srcfiles list: cache entry not found: bad checksum' --since=2021-01-01

2022-01-19T20:29:36-7c251d6-9de1ac6/windows-arm64-10
2021-11-02T15:54:27-058ed05-c3cb1ec/windows-arm64-10
2021-11-01T13:50:47-513e3fb-4a84298/windows-arm64-10
2021-10-14T17:38:39-e69ba9d-011fd00/windows-arm64-10
2021-09-14T02:53:17-384e5da-ee91bb8/windows-arm64-10

@gopherbot gopherbot added the Tools This label describes issues relating to any tools in the x/tools repository. label Jan 20, 2022
@gopherbot gopherbot added this to the Unreleased milestone Jan 20, 2022
@bcmills
Copy link
Contributor Author

bcmills commented Jan 20, 2022

The loading compiled Go files from cache error string is a hapax legomenon in the Go project; it definitely comes from here, in cmd/go/internal/work:
https://cs.opensource.google/go/go/+/master:src/cmd/go/internal/work/exec.go;l=739;drc=2580d0e08d5e9f979b943758d3c49877fb2324cb
The reading srcfiles list comes from here:
https://cs.opensource.google/go/go/+/master:src/cmd/go/internal/work/exec.go;l=1006;drc=master

The error appears to indicate file corruption in the cmd/go build cache, but I don't have any theories as to how that corruption is occurring or why it seems to only affect this one test on this one builder, and the test understandably doesn't provide much detail on the sequence or timing of the go invocations it is running.

Given the failure mode, I think the bug is more likely in os, syscall, or cmd/go itself than in x/tools/cmd/callgraph. windows/arm64 is not a first-class port and lacks a longtest builder, so it may be that x/tools/cmd/callgraph is incidentally triggering an underlying bug in an interaction that is being skipped (or isn't covered at all) in the os and/or cmd/go tests.

We are also running a relatively old Windows 10 build (#48946, CC @golang/release, @zx2c4), so I can't rule out a bug in the underlying platform either.

@bcmills
Copy link
Contributor Author

bcmills commented Jan 20, 2022

This is a release-blocker via #11811, but given that this is not a first-class port and appears to be a platform-specific bug affecting only one test, I plan to add a test skip for this specific builder in x/tools/cmd/callgraph.TestCallgraph and then move this issue to the Backlog without investigating further.

If we also observe this failure mode on the new windows-arm64-11 builder once that is up and running, and/or if we upgrade windows/arm64 to a first-class port, we can reprioritize an investigation.

@bcmills bcmills modified the milestones: Unreleased, Go1.18 Jan 20, 2022
@bcmills bcmills changed the title x/tools/cmd/callgraph: TestCallgraph failures with "cache entry not found: bad checksum" on windows-arm64-10 cmd/go,os: build cache checksum errors in x/tools/cmd/callgraph.TestCallgraph on windows-arm64-10 Jan 20, 2022
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/379734 mentions this issue: cmd/callgraph: skip TestCallgraph on the windows-arm64-10 builder

@heschi heschi added the NeedsFix The path to resolution is known, but the work has not been done. label Jan 20, 2022
@rsc
Copy link
Contributor

rsc commented Jan 21, 2022

The 'bad checksum' means we read a file that was named for a sha256 hash and the content did not match that sha256.

@rsc
Copy link
Contributor

rsc commented Jan 21, 2022

The fact that this is only windows/arm64 and that we've seen absolutely no mentions of it on other systems or in other bug reports makes me feel okay with this not being a release-blocker. If there really is corruption, the content-addressed and checksum-checked nature of the cache means that the system is either failstop or works correctly. So far we are getting no reports of failstop other than this one.

@bcmills
Copy link
Contributor Author

bcmills commented Jan 21, 2022

I agree, but the failure rate for TestCallgraph is high enough that I think we should at least add that skip in the interim to avoid masking other failures on the builders.

@rsc
Copy link
Contributor

rsc commented Jan 23, 2022

t.Skips are always OK in my book.

gopherbot pushed a commit to golang/tools that referenced this issue Jan 24, 2022
We don't know whether this failure is due to a Go bug or a platform
bug, so we'll skip it on the one builder to reduce noise, but not the
GOOS/GOARCH as a whole. If we do not observe failures on other
windows/arm64 builders, we can perhaps chalk it up to a platform bug.
If we do observe failures on other builders, then we'll have more data
to investigate with.

For golang/go#50706

Change-Id: I52511dd4a5cff80953823d9cf901975ff4657457
Reviewed-on: https://go-review.googlesource.com/c/tools/+/379734
Trust: Bryan Mills <bcmills@google.com>
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
Trust: Daniel Martí <mvdan@mvdan.cc>
@bcmills bcmills modified the milestones: Go1.18, Backlog Jan 26, 2022
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/381314 mentions this issue: cmd/go: cache debugging

@rsc
Copy link
Contributor

rsc commented Jan 27, 2022

I uploaded https://go-review.googlesource.com/c/go/+/381314 just to have around if we need to patch it in to investigate this further. No intent to submit it.

@bcmills
Copy link
Contributor Author

bcmills commented Feb 24, 2022

If we also observe this failure mode on the new windows-arm64-11 builder once that is up and running, and/or if we upgrade windows/arm64 to a first-class port, we can reprioritize an investigation.

Now observed on windows-arm64-11 as well:

greplogs --dashboard -md -l -e 'reading srcfiles list: cache entry not found: bad checksum' --since=2022-01-20

2022-02-17T17:36:57-cda4201-eaf0405/windows-arm64-11

@bcmills bcmills changed the title cmd/go,os: build cache checksum errors in x/tools/cmd/callgraph.TestCallgraph on windows-arm64-10 cmd/go,os: build cache checksum errors in x/tools/cmd/callgraph.TestCallgraph on windows/arm64 Feb 24, 2022
@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 24, 2022
@gopherbot gopherbot removed the NeedsFix The path to resolution is known, but the work has not been done. label Feb 24, 2022
@bcmills
Copy link
Contributor Author

bcmills commented Apr 1, 2022

A couple more. Whatever the cause, this is not fixed in Windows 11.

greplogs --dashboard -md -l -e 'reading srcfiles list: cache entry not found: bad checksum' --since=2022-02-24

2022-04-01T20:25:27-153e30b-32ff9b5/windows-arm64-11
2022-04-01T17:19:22-cda13e2-df89f2b/windows-arm64-11

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/397996 mentions this issue: cmd/callgraph: expand windows/arm64 skip to the whole platform

gopherbot pushed a commit to golang/tools that referenced this issue Apr 4, 2022
This test produces apparent file corruption on all of the
windows/arm64 builders. I suspect that this is a low-level bug (in
either the platform itself or the Go standard library on
windows/arm64).

Since windows/arm64 is not yet a first-class port, this test can be
skipped for now. However, if windows/arm64 becomes a first-class port
the underlying file-corruption bug should be investigated and fixed.

Updates golang/go#50706.

Change-Id: I0bc80cefee50895d40acc658286eb7ef8790493a
Reviewed-on: https://go-review.googlesource.com/c/tools/+/397996
Reviewed-by: Russ Cox <rsc@golang.org>
Trust: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
gopls-CI: kokoro <noreply+kokoro@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@bcmills
Copy link
Contributor Author

bcmills commented Jun 23, 2022

@qmuntal, this is one of the issues I think should block promoting windows/arm64 to a first class port.

@qmuntal
Copy link
Member

qmuntal commented Jun 27, 2022

@bcmills which is the ~occurrence rate of this issue?

I've been running TestCallgraph for a couple of days on a windows/arm64 VM with plenty of capacity and haven't triggered the issue. I'm using a Go and x/tools version that reproduced this issue in the past inside the official builder.

Will keep it running a couple more days, but I'm leaning towards a HW capacity issue related to #51019.

@bcmills
Copy link
Contributor Author

bcmills commented Jun 27, 2022

@qmuntal, before we started skipping TestCallgraph on arm64 the failure rate was about one per ~month of development. It's a bit tricky to estimate the failure rate per run from that, but a rough back-of-the-envelope is 1 failure / month / (~30 commits/day * 30 days/month), or a failure rate of about 0.1% of commits.

It does seem plausible that this could be a defect (or a bad interaction with a platform bug) somewhere in the virtualization stack used to host the builder.

@qmuntal
Copy link
Member

qmuntal commented Jan 29, 2024

I haven't been able to reproduce this failure locally yet. Let's unskip TestCallgraph and see if it still fails in the new Azure builders, which doesn't use qemu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows Tools This label describes issues relating to any tools in the x/tools repository.
Projects
None yet
Development

No branches or pull requests

5 participants