Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: AIX builders fail with "unknown directive "Disconnected" " #68481

Closed
ayappanec opened this issue Jul 17, 2024 · 15 comments
Closed

x/build: AIX builders fail with "unknown directive "Disconnected" " #68481

ayappanec opened this issue Jul 17, 2024 · 15 comments
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-AIX
Milestone

Comments

@ayappanec
Copy link

ayappanec commented Jul 17, 2024

AIX golang CI recently started failing with the below error.

--- FAIL: TestImportTestdata (0.05s)
    gcimporter_test.go:58: compile: /dev/null:1: unknown directive "Disconnected"
    gcimporter_test.go:59: go tool compile generics.go failed: exit status 1
--- FAIL: TestTypeNamingOrder (0.04s)
    gcimporter_test.go:58: compile: /dev/null:1: unknown directive "Disconnected"
    gcimporter_test.go:59: go tool compile g.go failed: exit status 1

https://build.golang.org/log/28dadf4b964f28d9137fb36e3f3181f03394faa7

Looks like there is some problem with the CI machine because our internal CI is working fine.
Any idea what could be the problem ? Any hints will be useful to fix that CI machine.

@seankhliao seankhliao changed the title AIX golang CI fails with "unknown directive "Disconnected" " x/build: AIX builders fail with "unknown directive "Disconnected" " Jul 17, 2024
@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Jul 17, 2024
@gopherbot gopherbot added this to the Unreleased milestone Jul 17, 2024
@cherrymui
Copy link
Member

cc @golang/aix

@cherrymui cherrymui added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-AIX labels Jul 17, 2024
@gabyhelp
Copy link

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

@cherrymui
Copy link
Member

These tests also shows flaky but high frequency failures on the netbsd-arm-bsiegert builder, e.g. https://build.golang.org/log/7d88bbec36f25867219a11c16791e0c24ea622e7 . Not sure if they are separate builder issues or they are related.

cc @bsiegert

@pmur
Copy link
Contributor

pmur commented Jul 18, 2024

I was pinged by the AIX maintainers to take a look. Something on the VM is misbehaving, /dev/null is returning a single line of sshd logging. It claims an uptime of 1011 days. It's probably due for a reboot, so I've rebooted it.

@ayappanec I suspect this VM might be due for updates. Would you be able to investigate or verify it is up to date?

@bsiegert
Copy link
Contributor

This may actually be caused by the Go test in some way. The other day, I noticed that /dev/null on the netbsd-arm builder had been overwritten with a file containing one line of text.

Could it be that something specifies /dev/null as output file and the program is misbehaving somehow?

@ayappanec
Copy link
Author

I was pinged by the AIX maintainers to take a look. Something on the VM is misbehaving, /dev/null is returning a single line of sshd logging. It claims an uptime of 1011 days. It's probably due for a reboot, so I've rebooted it.

@ayappanec I suspect this VM might be due for updates. Would you be able to investigate or verify it is up to date?

Yes, the machine is due for updates. I will check on this.

@cherrymui
Copy link
Member

It wouldn't be surprising if some tests write to /dev/null, but usually that should be fine just like it is common to do things like some shell command > /dev/null, which should just discard the output. Unless /dev/null somehow becomes a regular file on the machine? I don't know how that could happen, like running as root, deleting /dev/null and recreating it as a regular file?

@pmur
Copy link
Contributor

pmur commented Jul 19, 2024

It appeared something was recreating /dev/null as a regular file. When non-empty, it had the full contents of an sshd log message, without the syslog prefix. It seems unlikely to be the fault of the Go tooling or CI tooling.

I modified sshd_config and syslog to place sshd's logs into /var/log/messages last night. It still seems intact. If we start seeing builds pass again, I'll close this issue.

I think everything is running as root, so there are few guardrails. I don't have enough background with AIX to say what is clobbering the file, but sshd seems suspect.

@bsiegert
Copy link
Contributor

I assume this happens if the software thinks it's extra clever by creating the file under a different name and using rename(2) to put it into place.

@dmitshur dmitshur moved this to In Progress in Go Release Jul 22, 2024
@pmur
Copy link
Contributor

pmur commented Jul 23, 2024

This is still happening. I wonder if a test is rewriting the file. I couldn't reproduce it when running the go dist tests on the latest 1.22 or 1.23rc releases.

@cherrymui
Copy link
Member

@bsiegert this sounds plausible. The go command does move the output file from the temp WORK directory to the output. But it special cases os.DevNull https://cs.opensource.google/go/go/+/master:src/cmd/go/internal/work/build.go;l=499 . It could be that some other tool is not that careful. But I would guess that should be pretty reproducible, though. Maybe try running tests for the subrepos as well?

@pmur
Copy link
Contributor

pmur commented Jul 23, 2024

I think I found the culprit after cycling through the x repos. Something in x/oscar (which coincidentally is failing to build on CI now) seems to be causing /dev/null to be deleted. I'll look more into this later today.

@dmitshur
Copy link
Contributor

Thanks for finding that. Note that x/oscar is defined PT.TOOL, a repository intended to only be tested on a few first-class platforms like Linux/Windows/macOS, in the LUCI build configuration (see here).

Hopefully this builder can be migrated to LUCI soon (issue #67299) since we're migrating away from the coordinator, and so the coordinator won't be maintained indefinitely. But in the short term it would be fine to reconfigure the coordinator not to test x/oscar on the GOOS=aix builder.

@pmur
Copy link
Contributor

pmur commented Jul 23, 2024

And, the culprit is found. I've opened #68558.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/600515 mentions this issue: internal/httprr: do not delete /dev/null

@github-project-automation github-project-automation bot moved this from In Progress to Done in Go Release Jul 24, 2024
tmc pushed a commit to tmc/clones that referenced this issue Oct 31, 2024
Rework TestErrors to use tmp files contained within its test
specific tmpdir.  This fixes the accidental deletion of /dev/null
on the AIX builder when run as a privileged user.

Fixes golang/go#68558
Fixes golang/go#68481

Change-Id: I31c4ca3ea7963b013516ce6d85bbc91c7483981e
Reviewed-on: https://go-review.googlesource.com/c/oscar/+/600515
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-AIX
Projects
Archived in project
Development

No branches or pull requests

7 participants