Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors running Ansible across current cluster #2531

Closed
50 of 55 tasks
rvagg opened this issue Jan 28, 2021 · 13 comments
Closed
50 of 55 tasks

Errors running Ansible across current cluster #2531

rvagg opened this issue Jan 28, 2021 · 13 comments
Labels

Comments

@rvagg
Copy link
Member

rvagg commented Jan 28, 2021

I decided to just run our ansible worker create script across our cluster today in light of the sudo security flaw that was announced. In the process I've encountered a bunch of errors that we should probably look in to. Recording here as a list so we can tick them off. I'd appreciate help in dealing with these.

I didn't run any updates on:

  • Windows - shouldn't need it, I think they all do auto updates and regular restarts anyway
  • IBM platforms (including ibmi, aix, rhel-s390x, zos) I'll let @nodejs/build IBM folks deal with that
  • macOS - @AshCripps would you mind doing these? I'm a little afraid to just run the scripts against the existing infra.

Errors:

Failed 'not secret' (Jenkins secret not in secrets/inventory.yml)

  • test-digitalocean-debian9-x64-1
  • test-digitalocean-centos5-x86-1 Removed
  • test-softlayer-centos5-x64-2 Removed
  • test-softlayer-centos5-x64-1 Removed
  • release-digitalocean-centos5-x64-1 Removed
  • release-digitalocean-centos6-x86-1 Removed ansible: add RHEL 8 x64 instances #2886
  • release-softlayer-centos5-x86-1 Removed
  • test-digitalocean-ubuntu1404-x64-1
  • test-digitalocean-ubuntu1404-x86-1 Removed ansible: add RHEL 8 x64 instances #2886
  • test-digitalocean-ubuntu1604-x86-2
  • test-digitalocean-ubuntu1804-x64-1
  • test-rackspace-ubuntu1204-x64-1 (should remove)
  • test-requireio_rvagg-ubuntu1404-arm64_odroidxu-1 (should remove)
  • test-requireio_rvagg-ubuntu1404-arm64_odroidxu-2 (should remove)
  • test-requireio_rvagg-ubuntu1404-arm64_odroidxu3-1 (should remove)
  • release-scaleway-ubuntu1604-armv7l-1 (should remove)
  • release-scaleway-ubuntu1604-armv7l-2 (should remove)

Unreachable

  • test-joyent-smartos18-x64-1 Removed
  • test-joyent-smartos18-x64-2 Removed
  • test-digitalocean-freebsd10-x64-1 (should remove?)
  • test-joyent-freebsd10-x64-1 Removed
  • test-joyent-freebsd10-x64-2 Removed
  • test-rackspace-freebsd10-x64-1 (should remove?)
  • test-digitalocean-ubuntu1204-x64-1 (should remove)
  • test-digitalocean-ubuntu1204-x64-2 (should remove)
  • test-scaleway-ubuntu1804-armv7l-2 (should remove)
  • test-scaleway-ubuntu1804-armv7l-3 (should remove)
  • test-mininodes-ubuntu1604-arm64_odroid_c2-1 (should remove)
  • test-mininodes-ubuntu1604-arm64_odroid_c2-2 (should remove)
  • test-mininodes-ubuntu1604-arm64_odroid_c2-3 (should remove)

Failed on 'baselayout : run ccache installer'

Failed update

Failed 'bootstrap : install libselinux-python bindings'

Failed 'set hostname'

(We should remove these entirely)

  • test-softlayer-ubuntu1404-x64-1 (remove?)
  • test-softlayer-ubuntu1404-x86-1 (remove?)

Failed 'baselayout : centos7 | install ius'

"Name or service not known", is IUS still a thing?

Failed 'jenkins-worker : install tap2junit'

An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ModuleNotFoundError: No module named 'pkg_resources'
{"changed": false, "msg": "Failed to import the required Python library (setuptools) on test-digitalocean-freebsd11-x64-1's Python /usr/local/bin/python. Please read module documentation and install in the appropriate location. If the required library is installed, but Ansible is using the wrong Python interpreter, please consult the documentation on ansible_python_interpreter"}

Failed uninstall node

FAILED! => {"changed": false, "cmd": "node -v", "delta": "0:00:01.462308", "end": "2021-01-27 20:56:35.373427", "failed_when_result": true, "rc": 0, "start": "2021-01-27 20:56:33.911119", "stderr": "", "stderr_lines": [], "stdout": "v9.11.2", "stdout_lines": ["v9.11.2"]}

No "containers" config

(these probably need to be removed, they were part of my next-gen containerisation experiment iirc)

  • test-digitalocean-ubuntu2004_docker-x64-1
  • test-digitalocean-ubuntu2004_docker-x64-2
@AshCripps
Copy link
Member

Running into some ansible issues myself currently with this error

* Failed to parse /Users/ash/github/build/ansible/plugins/inventory/nodejs_yaml.py with ini plugin: /Users/ash/github/build/ansible/plugins/inventory/nodejs_yaml.py:25: Expected key=value host variable assignment, got: __future__

But once i get it working ill get on to ansibling the macs.

@richardlau
Copy link
Member

FTR IBM i and z/OS do not have sudo available/installed.

@richardlau
Copy link
Member

Ran the ansible scripts against our LinuxONE (rhel-s390x) hosts and they run into an error in the baselayout : run ccache installer task. I've opened #2533 to cover separately. The scripts did get past the package update task so sudo has been updated on all of the LinuxONE hosts (to sudo-1.8.23-10.el7_9.1.s390x)

@richardlau
Copy link
Member

I've updated the AIX hosts. All IBM platforms that have sudo have been updated to patch the advisory.

@AshCripps
Copy link
Member

Ive run against all the macs only error I faced is this on release-nearform-macos10.15-x64-1 which im currently investigating:

TASK [jenkins-worker : download slave.jar] ********************************************************************************************************************************************************************************************
task path: /Users/ash/github/build/ansible/roles/jenkins-worker/tasks/main.yml:165
fatal: [release-nearform-macos10.15-x64-1]: FAILED! => {"changed": false, "dest": "/Users/iojs/slave.jar", "elapsed": 0, "gid": 0, "group": "wheel", "mode": "0644", "msg": "Request failed: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)>", "owner": "root", "size": 877037, "state": "file", "uid": 0, "url": "https://ci.nodejs.org/jnlpJars/slave.jar"}

@richardlau
Copy link
Member

I'm looking at test-nearform_intel-ubuntu1604-x64-1.

Manually running `apt get update` fails with an expired gpg key:
# apt-get update
Hit:1 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu xenial InRelease
Get:2 http://security.ubuntu.com/ubuntu xenial-security InRelease [109 kB]
Hit:3 http://ie.archive.ubuntu.com/ubuntu xenial InRelease
Hit:4 http://ie.archive.ubuntu.com/ubuntu xenial-updates InRelease
Hit:5 http://ie.archive.ubuntu.com/ubuntu xenial-backports InRelease
Get:6 https://ftp.heanet.ie/mirrors/cran.r-project.org/bin/linux/ubuntu xenial/ InRelease [3,607 B]
Err:6 https://ftp.heanet.ie/mirrors/cran.r-project.org/bin/linux/ubuntu xenial/ InRelease
  The following signatures were invalid: KEYEXPIRED 1602869253  KEYEXPIRED 1602869253  KEYEXPIRED 1602869253
Fetched 113 kB in 0s (124 kB/s)
Reading package lists... Done
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://ftp.heanet.ie/mirrors/cran.r-project.org/bin/linux/ubuntu xenial/ InRelease: The following signatures were invalid: KEYEXPIRED 1602869253  KEYEXPIRED 1602869253  KEYEXPIRED 1602869253
W: Failed to fetch https://ftp.heanet.ie/mirrors/cran.r-project.org/bin/linux/ubuntu/xenial/InRelease  The following signatures were invalid: KEYEXPIRED 1602869253  KEYEXPIRED 1602869253  KEYEXPIRED 1602869253
W: Some index files failed to download. They have been ignored, or old ones used instead.
#

@richardlau
Copy link
Member

richardlau commented Feb 19, 2021

I'm looking at test-nearform_intel-ubuntu1604-x64-1.

Manually running apt get update fails with an expired gpg key:

root@test-nearform--intel-ubuntu1604-x64-1:~# apt-key list | grep "expired: "
pub   2048R/E084DAB9 2010-10-19 [expired: 2020-10-16]
root@test-nearform--intel-ubuntu1604-x64-1:~# apt-key adv --keyserver keyserver.ubuntu.com --refresh-keys
Executing: /tmp/tmp.mckX5xKLZr/gpg.1.sh --keyserver
keyserver.ubuntu.com
--refresh-keys
gpg: refreshing 7 keys from hkp://keyserver.ubuntu.com
gpg: requesting key 437D05B5 from hkp server keyserver.ubuntu.com
gpg: requesting key C0B21F32 from hkp server keyserver.ubuntu.com
gpg: requesting key EFE21092 from hkp server keyserver.ubuntu.com
gpg: requesting key FBB75451 from hkp server keyserver.ubuntu.com
gpg: requesting key E084DAB9 from hkp server keyserver.ubuntu.com
gpg: requesting key AF4F7421 from hkp server keyserver.ubuntu.com
gpg: requesting key BA9EF27F from hkp server keyserver.ubuntu.com
gpg: key 437D05B5: "Ubuntu Archive Automatic Signing Key <ftpmaster@ubuntu.com>" 43 new signatures
gpg: key C0B21F32: "Ubuntu Archive Automatic Signing Key (2012) <ftpmaster@ubuntu.com>" 17 new signatures
gpg: key EFE21092: "Ubuntu CD Image Automatic Signing Key (2012) <cdimage@ubuntu.com>" 65 new signatures
gpg: key FBB75451: "Ubuntu CD Image Automatic Signing Key <cdimage@ubuntu.com>" 93 new signatures
gpg: key E084DAB9: "Michael Rutter <marutter@gmail.com>" 2 new signatures
gpg: key AF4F7421: "Sylvestre Ledru - Debian LLVM packages <sylvestre@debian.org>" 1 new signature
gpg: key BA9EF27F: "Launchpad Toolchain builds" not changed
gpg: Total number processed: 7
gpg:              unchanged: 1
gpg:         new signatures: 221
root@test-nearform--intel-ubuntu1604-x64-1:~# apt-key list | grep "expired: "
root@test-nearform--intel-ubuntu1604-x64-1:~#

Refreshing keys has allowed the create worker ansible playbook to successfully run on test-nearform_intel-ubuntu1604-x64-1.

@richardlau
Copy link
Member

Removed the smartos 15/16 hosts from the list in the description as they've been removed: #2552

@richardlau
Copy link
Member

richardlau commented Mar 17, 2021

Failed update

  • test-digitalocean-freebsd11-x64-2 ("pkg: repository meta /var/db/pkg/FreeBSD.meta has wrong version 2")

Fixed by running

$ pkg --version
1.10.5
$ sudo pkg bootstrap -f
The package management tool is not yet installed on your system.
Do you want to fetch and install it now? [y/N]: y
Bootstrapping pkg from pkg+http://pkg.FreeBSD.org/FreeBSD:11:amd64/latest, please wait...
Verifying signature with trusted certificate pkg.freebsd.org.2013102301... done
Installing pkg-1.16.3...
Newer FreeBSD version for package pkg:
To ignore this error set IGNORE_OSVERSION=yes
- package: 1104001
- running kernel: 1102000
Ignore the mismatch and continue? [y/N]:
package pkg is already installed, forced install
Extracting pkg-1.16.3: 100%
$ pkg --version
1.16.3
$

1.16.3 matches the version of pkg on test-digitalocean-freebsd11-x64-1.

Running the playbook on test-digitalocean-freebsd11-x64-2 now errors at the same point as test-digitalocean-freebsd11-x64-1

An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ModuleNotFoundError: No module named 'pkg_resources'
fatal: [test-digitalocean-freebsd11-x64-2]: FAILED! => {"changed": false, "msg": "Failed to import the required Python library (setuptools) on test-digitalocean-freebsd11-x64-2's Python /usr/local/bin/python. Please read the module documentation and install it in the appropriate location. If the required library is installed, but Ansible is using the wrong Python interpreter, please consult the documentation on ansible_python_interpreter"}

@richardlau
Copy link
Member

Failed 'baselayout : centos7 | install ius'

"Name or service not known", is IUS still a thing?

  • test-rackspace-centos7-x64-1
  • release-digitalocean-centos7-x64-1

- name: centos7 | install ius
when: "arch != 'arm64' and arch != 'ppc64'"
yum:
name: "https://centos{{ ansible_distribution_major_version }}.iuscommunity.org/ius-release.rpm"
state: present

I'm not sure what we're installing from there, but the "centos*.iuscommunity.org" URLs were retired in 2020: iusrepo/announce#18

@richardlau
Copy link
Member

richardlau commented Apr 13, 2021

Re. "Failed 'not secret' (Jenkins secret not in secrets/inventory.yml)" -- Having double checked we no longer have them in neither public nor release Jenkins, I've removed all the centos5 entries from secrets/inventory.yml in addition to the ones marked "should remove", and added the missing secrets for the others in the list.

Edit: Centos 5 was removed from Jenkins in #1984

richardlau added a commit that referenced this issue Apr 14, 2021
Removed:
release-requireio-osx1010-x64-1
release-scaleway-ubuntu1604-armv7l-1
release-scaleway-ubuntu1604-armv7l-2
test-requireio_rvagg-ubuntu1404-arm64_odroidxu-1
test-requireio_rvagg-ubuntu1404-arm64_odroidxu-2
test-requireio_rvagg-ubuntu1404-arm64_odroidxu3-1
test-requireio-osx1010-x64-1

Refs: #2531
richardlau added a commit that referenced this issue Apr 23, 2021
Remove:
test-mininodes-ubuntu1604-arm64_odroid_c2-1
test-mininodes-ubuntu1604-arm64_odroid_c2-2
test-mininodes-ubuntu1604-arm64_odroid_c2-3
test-scaleway-ubuntu1804-armv7l-1
test-scaleway-ubuntu1804-armv7l-2
test-scaleway-ubuntu1804-armv7l-3

Refs: #2531
@richardlau
Copy link
Member

Failed uninstall node

  • test-digitalocean-ubuntu1604-x86-1
FAILED! => {"changed": false, "cmd": "node -v", "delta": "0:00:01.462308", "end": "2021-01-27 20:56:35.373427", "failed_when_result": true, "rc": 0, "start": "2021-01-27 20:56:33.911119", "stderr": "", "stderr_lines": [], "stdout": "v9.11.2", "stdout_lines": ["v9.11.2"]}

This was a snap install of Node.js.

root@test-digitalocean-ubuntu1604-x86-1:~# which node
/snap/bin/node
root@test-digitalocean-ubuntu1604-x86-1:~#

Removed via snap remove node.

richardlau added a commit that referenced this issue May 17, 2021
- Update `python` to point to `python2` (for consistency with other
platforms in our CI; use `python3` for Python 3).
- Use `py38-pip` to match the version of Python 3 installed.
- Create a `pip3` symlink as Ansible's `pip` task fails without it.
- `-slaveLog` Jenkins agent parameter has been obsoleted. Use the
`-o` parameter to `daemon` to redirect stdout/stderr to the log file.

Refs: #2531
richardlau added a commit that referenced this issue May 24, 2021
Tasks registering variables in Ansible will always set the variable
regardless of any `when` clause(s). This means if more than one task
registers the same variable, the last one "wins". Fix the bootstrap
role on Fedora by correctly registering `has_libselinux` in both
`fedora30` and non-`fedora30` cases.

Refs: #2531
@github-actions
Copy link

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants