Runner seems to have difficulties to process large outputs #998

stanislav-zaprudskiy · 2022-02-15T15:29:06Z

We use ansible-runner (tried both 2.1.1 and devel) in combination with AWX (19.4) all running in Kubernetes.

Apart of normal AWX jobs, ansible-runner EEs also execute inventory sync jobs using our custom inventory plugin. The command under the hood is similar to that below:

ansible-inventory --list --export -i /runner/project/awx.sh --output /runner/artifacts/<job-id>/output.json --playbook-dir /runner/project/ -vvv

And whenever the plugin generates "large" artifacts, we run into the following errors in AWX (varying from case to case - see below 3 the distinct examples), even though the corresponding EE Pod finishes successfully (e.g. {"status": "successful", "runner_ident": "<job-id>"} message followed by base64 encoded zipfile and successful termination of the Pod).

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1499, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3029, in run
    res = self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3081, in _run_internal
    res = list(first_future.done)[0].result()
  File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
  File "/usr/lib64/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/utils/common.py", line 1102, in wrapper_cleanup_new_process
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3128, in processor
    return ansible_runner.interface.run(
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/interface.py", line 257, in run
    r.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/streaming.py", line 249, in run
    self.artifacts_callback(data)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/streaming.py", line 228, in artifacts_callback
    unstream_dir(self._input, length, self.artifact_dir)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/utils/streaming.py", line 69, in unstream_dir
    with zipfile.ZipFile(tmp.name, "r") as archive:
  File "/usr/lib64/python3.8/zipfile.py", line 1269, in __init__
    self._RealGetContents()
  File "/usr/lib64/python3.8/zipfile.py", line 1336, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1499, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3029, in run
    res = self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3081, in _run_internal
    res = list(first_future.done)[0].result()
  File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
  File "/usr/lib64/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/utils/common.py", line 1102, in wrapper_cleanup_new_process
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3128, in processor
    return ansible_runner.interface.run(
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/interface.py", line 257, in run
    r.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/streaming.py", line 249, in run
    self.artifacts_callback(data)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/streaming.py", line 228, in artifacts_callback
    unstream_dir(self._input, length, self.artifact_dir)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/utils/streaming.py", line 64, in unstream_dir
    data = source.read(chunk_size)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/utils/base64io.py", line 282, in read
    results.write(base64.b64decode(data))
  File "/usr/lib64/python3.8/base64.py", line 87, in b64decode
    return binascii.a2b_base64(s)
binascii.Error: Incorrect padding

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1499, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3029, in run
    res = self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3081, in _run_internal
    res = list(first_future.done)[0].result()
  File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
  File "/usr/lib64/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/utils/common.py", line 1102, in wrapper_cleanup_new_process
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3128, in processor
    return ansible_runner.interface.run(
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/interface.py", line 257, in run
    r.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/streaming.py", line 249, in run
    self.artifacts_callback(data)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/streaming.py", line 228, in artifacts_callback
    unstream_dir(self._input, length, self.artifact_dir)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/utils/streaming.py", line 86, in unstream_dir
    archive.extract(info.filename, path=target_directory)
  File "/usr/lib64/python3.8/zipfile.py", line 1630, in extract
    return self._extract_member(member, path, pwd)
  File "/usr/lib64/python3.8/zipfile.py", line 1700, in _extract_member
    with self.open(member, pwd=pwd) as source, \
  File "/usr/lib64/python3.8/zipfile.py", line 1530, in open
    fheader = zef_file.read(sizeFileHeader)
  File "/usr/lib64/python3.8/zipfile.py", line 763, in read
    self._file.seek(self._pos)
OSError: [Errno 22] Invalid argument

Failures are reproducible on /runner/artifacts/<job-id>/output.json files ranging from 80 to 400 MB (raw), for which the corresponding zipfile ranges from 5 to 11 MB (extracted from K8s logs). And no errors are detected on generated output.json inventory files of few KBs.

Are there reasonable thresholds or recommendations for artifacts size? Is there anything which we could configure for now, except of re-designing the inventories/plugin?

The text was updated successfully, but these errors were encountered:

AlanCoding · 2022-03-22T17:04:58Z

I can't reproduce this

https://github.com/AlanCoding/test-playbooks/blob/master/inventories/dyn_inventory_large_1000.py

This creates random keys / values in hostvars to maximize the compressed size. Locally looking at the sizes:

$ zip alan/output.zip alan/output.json
updating: alan/output.json (deflated 32%)
(env) [alancoding@alan-red-hat test-playbooks]$ ls -l --block-size=M alan
total 390M
-rw-rw-r--. 1 alancoding alancoding 233M Mar 22 12:46 output.json
-rw-rw-r--. 1 alancoding alancoding 158M Mar 22 12:46 output.zip

These should exceed your values sufficiently. Yet I'm able to run this in AWX, as a container_group job without a problem.

stanislav-zaprudskiy · 2022-03-23T07:50:35Z

I'll check your example script on our AWX deployment, and will report back with results.

stanislav-zaprudskiy · 2022-04-04T07:57:48Z

@AlanCoding, thank you for looking into that. Below are the details of our additional testing.

We did check the inventory script you referenced, and we ran into the same issues as reported originally. The stack traces are actually produced by AWX task container, as we could now see - the jobs in AWX are just marked as in Error state w/o any details in UI/API at all (another major problem we've been observing randomly for months with any other regular jobs executed via AWX, which we were not able to troubleshoot yet).

We suspected PostgeSQL, which we run separately and as a cluster (3 nodes), so we did the similar tests on a separate fresh clean installation of AWX with PostgreSQL managed by the operator. Issue was still there.

Next, we suspected that multi-node K8s cluster may be the cause, so we made a new fresh AWX installation making sure that AWX, operator managed PostgreSQL and automation-job Pods all run on the same k8s node (though, there were other nodes in the cluster running different unrelated Pods). Issue was still there.

We also played around EEs - but at the end performed all tests using the default awx-ee:latest EE (using e.g. receptor:devel among other). Issue was still there.

Finally, we performed the same test but now using K8s from Docker for Mac (running on a laptop), and this time all worked w/o a hitch. Both your example inventory and our own inventory scripts made it to completion and registered hosts in AWX. All previous (failing) tests, however, were performed on a normal multi-node K8s installation in our OpenStack cloud.

Also, looking more thoroughly on the automation-job- Pod logs reveals that base64 encoded zip archive stream is not being streamed to "the end" on normal K8s compared to Docker for Mac K8s - I see that using kubectl logs command. It's just cut down somewhere in-between on normal K8s, while on Docker for Mac K8s I could spot eof: true message at the end of the stream. ~~Apparently, this causes the zip archive stream to either be failing on decoding, or on subsequent unzipping (original stack traces).~~ - I now have cases with {"eof": true} reported in kubectl logs, but Error job and errors in AWX.

Would you have any ideas to check further? Anything we could do on top to better understand the cause?

kubelet config diff docker-desktop vs cloud node

{                                                                               {
  "enableServer": true,                                                           "enableServer": true,
  "staticPodPath": "/etc/kubernetes/manifests",                               <
  "syncFrequency": "1m0s",                                                        "syncFrequency": "1m0s",
  "fileCheckFrequency": "20s",                                                    "fileCheckFrequency": "20s",
  "httpCheckFrequency": "30s",                                                |   "httpCheckFrequency": "20s",
  "address": "0.0.0.0",                                                           "address": "0.0.0.0",
  "port": 10250,                                                                  "port": 10250,
  "tlsCertFile": "/var/lib/kubelet/pki/kubelet.crt",                              "tlsCertFile": "/var/lib/kubelet/pki/kubelet.crt",
  "tlsPrivateKeyFile": "/var/lib/kubelet/pki/kubelet.key",                        "tlsPrivateKeyFile": "/var/lib/kubelet/pki/kubelet.key",
                                                                              >   "rotateCertificates": true,
  "authentication": {                                                             "authentication": {
    "x509": {                                                                       "x509": {
      "clientCAFile": "/run/config/pki/ca.crt"                                |       "clientCAFile": "/etc/kubernetes/certs/kubelet-clients-ca.pem"
    },                                                                              },
    "webhook": {                                                                    "webhook": {
      "enabled": true,                                                                "enabled": true,
      "cacheTTL": "2m0s"                                                              "cacheTTL": "2m0s"
    },                                                                              },
    "anonymous": {                                                                  "anonymous": {
      "enabled": false                                                        |       "enabled": true
    }                                                                               }
  },                                                                              },
  "authorization": {                                                              "authorization": {
    "mode": "Webhook",                                                              "mode": "Webhook",
    "webhook": {                                                                    "webhook": {
      "cacheAuthorizedTTL": "5m0s",                                                   "cacheAuthorizedTTL": "5m0s",
      "cacheUnauthorizedTTL": "30s"                                                   "cacheUnauthorizedTTL": "30s"
    }                                                                               }
  },                                                                              },
  "registryPullQPS": 5,                                                           "registryPullQPS": 5,
  "registryBurst": 10,                                                            "registryBurst": 10,
  "eventRecordQPS": 5,                                                            "eventRecordQPS": 5,
  "eventBurst": 10,                                                               "eventBurst": 10,
  "enableDebuggingHandlers": true,                                                "enableDebuggingHandlers": true,
  "healthzPort": 10248,                                                           "healthzPort": 10248,
  "healthzBindAddress": "127.0.0.1",                                              "healthzBindAddress": "127.0.0.1",
  "oomScoreAdj": -999,                                                            "oomScoreAdj": -999,
  "clusterDomain": "cluster.local",                                               "clusterDomain": "cluster.local",
  "clusterDNS": [                                                                 "clusterDNS": [
    "10.96.0.10"                                                              |     "198.18.128.2"
  ],                                                                              ],
  "streamingConnectionIdleTimeout": "4h0m0s",                                     "streamingConnectionIdleTimeout": "4h0m0s",
  "nodeStatusUpdateFrequency": "30s",                                         |   "nodeStatusUpdateFrequency": "10s",
  "nodeStatusReportFrequency": "30s",                                         |   "nodeStatusReportFrequency": "5m0s",
  "nodeLeaseDurationSeconds": 40,                                             |   "nodeLeaseDurationSeconds": 20,
  "imageMinimumGCAge": "2m0s",                                                    "imageMinimumGCAge": "2m0s",
  "imageGCHighThresholdPercent": 85,                                              "imageGCHighThresholdPercent": 85,
  "imageGCLowThresholdPercent": 80,                                               "imageGCLowThresholdPercent": 80,
  "volumeStatsAggPeriod": "1m0s",                                                 "volumeStatsAggPeriod": "1m0s",
  "cgroupRoot": "kubepods",                                                   <
  "cgroupsPerQOS": true,                                                          "cgroupsPerQOS": true,
  "cgroupDriver": "cgroupfs",                                                 |   "cgroupDriver": "systemd",
  "cpuManagerPolicy": "none",                                                     "cpuManagerPolicy": "none",
  "cpuManagerReconcilePeriod": "10s",                                             "cpuManagerReconcilePeriod": "10s",
  "memoryManagerPolicy": "None",                                                  "memoryManagerPolicy": "None",
  "topologyManagerPolicy": "none",                                                "topologyManagerPolicy": "none",
  "topologyManagerScope": "container",                                            "topologyManagerScope": "container",
  "runtimeRequestTimeout": "2m0s",                                                "runtimeRequestTimeout": "2m0s",
  "hairpinMode": "promiscuous-bridge",                                            "hairpinMode": "promiscuous-bridge",
  "maxPods": 110,                                                                 "maxPods": 110,
  "podPidsLimit": -1,                                                             "podPidsLimit": -1,
  "resolvConf": "/etc/resolv.conf",                                               "resolvConf": "/etc/resolv.conf",
  "cpuCFSQuota": true,                                                            "cpuCFSQuota": true,
  "cpuCFSQuotaPeriod": "100ms",                                                   "cpuCFSQuotaPeriod": "100ms",
  "nodeStatusMaxImages": 50,                                                      "nodeStatusMaxImages": 50,
  "maxOpenFiles": 1000000,                                                        "maxOpenFiles": 1000000,
  "contentType": "application/vnd.kubernetes.protobuf",                           "contentType": "application/vnd.kubernetes.protobuf",
  "kubeAPIQPS": 5,                                                                "kubeAPIQPS": 5,
  "kubeAPIBurst": 10,                                                             "kubeAPIBurst": 10,
  "serializeImagePulls": true,                                                    "serializeImagePulls": true,
  "evictionHard": {                                                               "evictionHard": {
    "imagefs.available": "15%",                                                     "imagefs.available": "15%",
    "memory.available": "100Mi",                                                    "memory.available": "100Mi",
    "nodefs.available": "10%",                                                      "nodefs.available": "10%",
    "nodefs.inodesFree": "5%"                                                       "nodefs.inodesFree": "5%"
  },                                                                              },
  "evictionPressureTransitionPeriod": "5m0s",                                     "evictionPressureTransitionPeriod": "5m0s",
  "enableControllerAttachDetach": true,                                           "enableControllerAttachDetach": true,
  "makeIPTablesUtilChains": true,                                                 "makeIPTablesUtilChains": true,
  "iptablesMasqueradeBit": 14,                                                    "iptablesMasqueradeBit": 14,
  "iptablesDropBit": 15,                                                          "iptablesDropBit": 15,
  "failSwapOn": false,                                                        |   "featureGates": {
                                                                              >     "CSIMigration": true,
                                                                              >     "CSIMigrationOpenStack": true,
                                                                              >     "ExpandCSIVolumes": true,
                                                                              >     "IPv6DualStack": false
                                                                              >   },
                                                                              >   "failSwapOn": true,
  "containerLogMaxSize": "10Mi",                                                  "containerLogMaxSize": "10Mi",
  "containerLogMaxFiles": 5,                                                      "containerLogMaxFiles": 5,
  "configMapAndSecretChangeDetectionStrategy": "Watch",                           "configMapAndSecretChangeDetectionStrategy": "Watch",
  "systemReservedCgroup": "systemreserved",                                   |   "enforceNodeAllocatable": [
  "kubeReservedCgroup": "podruntime",                                         |     "pods"
  "volumePluginDir": "/usr/libexec/kubernetes/kubelet-plugins/volume/exec/",  |   ],
                                                                              >   "volumePluginDir": "/var/lib/kubelet/volumeplugins",
  "logging": {                                                                    "logging": {
    "format": "text"                                                                "format": "text"
  },                                                                              },
  "enableSystemLogHandler": true,                                                 "enableSystemLogHandler": true,
  "shutdownGracePeriod": "0s",                                                    "shutdownGracePeriod": "0s",
  "shutdownGracePeriodCriticalPods": "0s",                                        "shutdownGracePeriodCriticalPods": "0s",
  "enableProfilingHandler": true,                                                 "enableProfilingHandler": true,
  "enableDebugFlagsHandler": true,                                                "enableDebugFlagsHandler": true,
  "kind": "KubeletConfiguration",                                                 "kind": "KubeletConfiguration",
  "apiVersion": "kubelet.config.k8s.io/v1beta1"                                   "apiVersion": "kubelet.config.k8s.io/v1beta1"
}

This time we run AWX 20.0.1 (but clearly see the problem starting 19.4.0), K8s 1.21.5 (and saw the problem on 1.20.8 earlier), and for ansible-runner it's stable-2.10-latest and all other combinations (e.g. awx-ee:latest).
As for Docker for Mac K8s the version is 1.21.3.

stanislav-zaprudskiy · 2022-04-05T16:12:08Z

I did some more tests using K8s cluster in GCP and AWS (managed by Gardener), and for both cases I ran into the same failures using the inventory script referenced.

Steps to reproduce:

Get a K8s cluster

Deploy AWX in basic configuration

spec:
  admin_password_secret: awx-admin-password
  admin_user: admin
  annotations: 'checksum/custom-certs.yaml: 8ea379be3c8912aff3144b84eac92edd23ae36d1f2a8ba755a61b2d25d4923e9'
  bundle_cacert_secret: awx-custom-certs
  create_preload_data: true
  ee_resource_requirements: {}
  garbage_collect_secrets: false
  hostname: awx-local.some.hostname
  image_pull_policy: IfNotPresent
  ingress_type: none
  loadbalancer_port: 80
  loadbalancer_protocol: http
  nodeport_port: 30080
  postgres_configuration_secret: awx-postgres-configuration
  projects_persistence: false
  projects_storage_access_mode: ReadWriteMany
  projects_storage_size: 8Gi
  replicas: 1
  route_tls_termination_mechanism: Edge
  service_type: ClusterIP
  task_privileged: false
  task_resource_requirements: {}
  web_resource_requirements: {}

Add new project https://github.com/AlanCoding/test-playbooks.git
Add new inventory
Add new inventory source for the inventory, sourced from the project created earlier, using inventory file inventories/dyn_inventory_large_1000.py
Sync the inventory
A new automation-job will be spawned using the default AWX EE quay.io/ansible/awx-ee:latest (and in the same namespace as AWX).
Upon completion of the script run, the automation-job Pod finishes successfully. However, the sync job is in Error status w/o any errors in the output.

At the same time, AWX Pod register the following messages:

[pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:00,937 DEBUG    [6d2eb1fe2036428987ec6b89a6fca66b] awx.main.scheduler Running task manager.
[pod/awx-654b597b58-jj42n/awx-ee] DEBUG 2022/04/05 15:58:00 Stdout complete - closing channel for: DavZpxzf
[pod/awx-654b597b58-jj42n/awx-ee] WARNING 2022/04/05 15:58:00 Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection
[pod/awx-654b597b58-jj42n/awx-ee] DEBUG 2022/04/05 15:58:00 Client disconnected from control service @
[pod/awx-654b597b58-jj42n/awx-ee] WARNING 2022/04/05 15:58:00 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection
[pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,069 DEBUG    [6d2eb1fe2036428987ec6b89a6fca66b] awx.main.dispatch task 3e857f4e-9626-4eda-95aa-510e20906189 starting awx.main.analytics.analytics_tasks.send_subsystem_metrics(*[])
[pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,150 DEBUG    [6d2eb1fe2036428987ec6b89a6fca66b] awx.main.scheduler Starting Scheduler
[pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,241 DEBUG    [6d2eb1fe2036428987ec6b89a6fca66b] awx.main.scheduler Finishing Scheduler
[pod/awx-654b597b58-jj42n/awx-ee] DEBUG 2022/04/05 15:58:01 Client connected to control service @
[pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,890 ERROR    [1ab2556ef2364c7d95745517224ca0cb] awx.main.tasks.jobs inventory_update 8 (running) Exception occurred while running task
[pod/awx-654b597b58-jj42n/awx-task] Traceback (most recent call last):
[pod/awx-654b597b58-jj42n/awx-task]   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 560, in run
[pod/awx-654b597b58-jj42n/awx-task]     res = receptor_job.run()
[pod/awx-654b597b58-jj42n/awx-task]   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 269, in run
[pod/awx-654b597b58-jj42n/awx-task]     res = self._run_internal(receptor_ctl)
[pod/awx-654b597b58-jj42n/awx-task]   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 345, in _run_internal
[pod/awx-654b597b58-jj42n/awx-task]     res = list(first_future.done)[0].result()
[pod/awx-654b597b58-jj42n/awx-task]   File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 439, in result
[pod/awx-654b597b58-jj42n/awx-task]     return self.__get_result()
[pod/awx-654b597b58-jj42n/awx-task]   File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
[pod/awx-654b597b58-jj42n/awx-task]     raise self._exception
[pod/awx-654b597b58-jj42n/awx-task]   File "/usr/lib64/python3.9/concurrent/futures/thread.py", line 58, in run
[pod/awx-654b597b58-jj42n/awx-task]     result = self.fn(*self.args, **self.kwargs)
[pod/awx-654b597b58-jj42n/awx-task]   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/utils/common.py", line 1168, in wrapper_cleanup_new_process
[pod/awx-654b597b58-jj42n/awx-task]     return func(*args, **kwargs)
[pod/awx-654b597b58-jj42n/awx-task]   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 401, in processor
[pod/awx-654b597b58-jj42n/awx-task]     return ansible_runner.interface.run(
[pod/awx-654b597b58-jj42n/awx-task]   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/interface.py", line 210, in run
[pod/awx-654b597b58-jj42n/awx-task]     r.run()
[pod/awx-654b597b58-jj42n/awx-task]   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/streaming.py", line 254, in run
[pod/awx-654b597b58-jj42n/awx-task]     self.artifacts_callback(data)
[pod/awx-654b597b58-jj42n/awx-task]   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/streaming.py", line 233, in artifacts_callback
[pod/awx-654b597b58-jj42n/awx-task]     unstream_dir(self._input, length, self.artifact_dir)
[pod/awx-654b597b58-jj42n/awx-task]   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/ansible_runner/utils/streaming.py", line 70, in unstream_dir
[pod/awx-654b597b58-jj42n/awx-task]     with zipfile.ZipFile(tmp.name, "r") as archive:
[pod/awx-654b597b58-jj42n/awx-task]   File "/usr/lib64/python3.9/zipfile.py", line 1257, in __init__
[pod/awx-654b597b58-jj42n/awx-task]     self._RealGetContents()
[pod/awx-654b597b58-jj42n/awx-task]   File "/usr/lib64/python3.9/zipfile.py", line 1324, in _RealGetContents
[pod/awx-654b597b58-jj42n/awx-task]     raise BadZipFile("File is not a zip file")
[pod/awx-654b597b58-jj42n/awx-task] zipfile.BadZipFile: File is not a zip file
[pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,893 DEBUG    [1ab2556ef2364c7d95745517224ca0cb] awx.main.tasks.jobs inventory_update 8 (running) finished running, producing 12 events.
[pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,894 DEBUG    [1ab2556ef2364c7d95745517224ca0cb] awx.analytics.job_lifecycle inventoryupdate-8 post run
[pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,932 DEBUG    [1ab2556ef2364c7d95745517224ca0cb] awx.main.models.inventory Going to update inventory computed fields, pk=2
[pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,941 DEBUG    [1ab2556ef2364c7d95745517224ca0cb] awx.main.models.inventory Finished updating inventory computed fields, pk=2, in 0.008 seconds
[pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,943 DEBUG    [1ab2556ef2364c7d95745517224ca0cb] awx.analytics.job_lifecycle inventoryupdate-8 finalize run
[pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,956 WARNING  [1ab2556ef2364c7d95745517224ca0cb] awx.main.dispatch inventory_update 8 (error) encountered an error (rc=None), please see task stdout for details.
[pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,958 DEBUG    [1ab2556ef2364c7d95745517224ca0cb] awx.main.dispatch task bbf88b96-8726-4711-b356-22f9399401e6 starting awx.main.tasks.system.handle_work_error(*['bbf88b96-8726-4711-b356-22f9399401e6'])
[pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:01,959 DEBUG    [1ab2556ef2364c7d95745517224ca0cb] awx.main.tasks.system Executing error task id bbf88b96-8726-4711-b356-22f9399401e6, subtasks: [{'type': 'inventory_update', 'id': 8}]
[pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:02,045 DEBUG    [1ab2556ef2364c7d95745517224ca0cb] awx.main.dispatch task bbf88b96-8726-4711-b356-22f9399401e6 starting awx.main.tasks.system.handle_work_success(*[])
[pod/awx-654b597b58-jj42n/awx-task] 2022-04-05 15:58:02,049 DEBUG    [1ab2556ef2364c7d95745517224ca0cb] awx.main.dispatch task 0620a91e-e5c9-4ec1-8bb0-7421b5945e86 starting awx.main.scheduler.tasks.run_task_manager(*[])

I also tried adding && sleep 60 after ansible-runner worker ... args for the Job container, but that doesn't solve the issue.

However, I have to admit, that scripts fail not necessarily 100% of the time. For our own inventory script we had some few successful runs long in the past. And also for the newly deployed Gardner AWS cluster it worked for the first run, but all further attempts were failing (10+), as well as for the new fresh AWS cluster it was failing all the time (as well as for new GCP)

I additionally tried reserving and allocating more memory for job Pods (2GB), and using OpenStack's Cinder storage for /runner folder on the runner - all w/o any success.

AlanCoding · 2022-04-08T14:49:54Z

Thanks for digging into this. I'm now even more convinced that this is the same thing which has been reported elsewhere in AWX. There are lots of duplicate issues that lay out the issue in an unclear way, but after searching through, I believe ansible/awx#11338 is the most accurate articulation of this bug. I suspect that there is a route to pinning down the error by combining the reproducer here with the tricks outlined there.

In my prior comment, I had deployed the operator on an OpenShift cluster, which was relatively quiet and didn't have much log history before that. I would like to get back around to this, and figure out how to force the issue with minikube.

shanemcd · 2022-04-21T12:26:23Z

I am pretty sure this is ansible/receptor#597

shanemcd · 2022-04-25T15:50:59Z

@stanislav-zaprudskiy The latest version of awx-ee now has a patched version of receptor in it. Can you pull the latest version and try it out? You may need to use image_pull_policy on your AWX custom resource object: https://github.com/ansible/awx-operator#deploying-a-specific-version-of-awx

shanemcd · 2022-04-25T15:57:44Z

Never mind, I just saw your comment on ansible/receptor#597. That is perplexing.

Can you provide some information about your workload?

How many hosts?
How many job events? (You can find this at /api/v2/jobs/<id>)
Any information you can provide about the facts being gathered.

shanemcd · 2022-04-25T17:16:47Z

@stanislav-zaprudskiy Have you tried the workaround described in ansible/awx#11338 (comment)?

stanislav-zaprudskiy · 2022-04-26T12:06:39Z

@shanemcd,

We have 2 installations ATM: one running v20.x of AWX, and another v15.x. We are in process of migration of all workloads from the older installation to the newer one (for various reasons), and this problem with inventories is currently hindering the progress.

How many hosts?

In the new installation we already have 20k hosts, but many inventories can't sync, so it's not a final number. The old installation has about 40k hosts, and that's something which we'd probably get eventually in the new set-up. Not all hosts are unique though - unique would be around 10k.

How many job events? (You can find this at /api/v2/jobs/)

Would you elaborate on job events please? I took some random successful job /api/v2/jobs/23154/job_events/, for which it reports 964 as count. Shall I take some specific job example?

Any information you can provide about the facts being gathered.

As for facts - we don't gather them, but rather all the details come from inventory source scripts. We pull the data from Netbox using our own custom plugin to populate it.

Have you tried the workaround described in ansible/awx#11338 (comment)?

Not, we haven't.

I was concerned that at some point in time we might get a log, which will involve rotation anyway (big inventory log; or a smaller one, but which will have to cross the border of containerLogMaxSize , etc)
containerLogMaxSize value is the same on a our OpenStack Kubernetes node and on a docker-desktop node (find kubelet configs diff here - Runner seems to have difficulties to process large outputs #998 (comment)). However, things work on docker-desktop - which may indicate that this parameter is unrelated.
I was concerned about the exact impact which increase of the parameter may cause on the nodes (performance-wise), so I was reluctant to bump it to 500MB or something which some people indicated helped them.
Finally, there is no direct interface interface in our cloud Kubernetes service which allows tinkering around with kubelet parameters 😄 I'd have to go a long route to verify if it solves the issue.

Tioborto · 2022-04-26T14:48:26Z

Hello @shanemcd,
To complete with another use case (we are another company), we run a template on a large inventory (~ 8000 hosts). This produces a large number of events (9725) with a page count (389) on /api/v2/jobs/{id}/job_events
We use the last version of ansible-runner and the last version of AWX (21.0.0).

{
    "id": 995231,
    "type": "job",
    "summary_fields": {
        "inventory": {
            "id": 484,
            "name": "linux",
            "description": "",
            "has_active_failures": true,
            "total_hosts": 8024,
            "hosts_with_active_failures": 119,
            "total_groups": 0,
            "has_inventory_sources": false,
            "total_inventory_sources": 0,
            "inventory_sources_with_failures": 0,
            "organization_id": 2,
            "kind": ""
        },
        "execution_environment": {
            "id": 40,
            "name": "Global EE Test",
            "description": "",
            "image": "....jfrog.io/adeo-awx-ee:0.0.0-main-global"
        },
        "instance_group": {
            "id": 4,
            "name": "default",
            "is_container_group": true
        }
    },
    "name": "dxp-check-connectivity-server-ask-inventory",
    "job_type": "run",
    "inventory": 484,
    "project": 754,
    "playbook": "check-connectivity/server.yaml",
    "scm_branch": "",
    "forks": 0,
    "limit": "",
    "verbosity": 0,
    "extra_vars": "{}",
    "job_tags": "",
    "force_handlers": false,
    "skip_tags": "",
    "start_at_task": "",
    "timeout": 0,
    "use_fact_cache": false,
    "organization": 2,
    "unified_job_template": 2143,
    "launch_type": "manual",
    "status": "error",
    "execution_environment": 40,
    "failed": true,
    "started": "2022-04-26T10:07:41.381674Z",
    "finished": "2022-04-26T11:03:48.440833Z",
    "canceled_on": null,
    "elapsed": 3367.059,
    "job_args": "[\"ssh-agent\", \"sh\", \"-c\", \"trap 'rm -f /runner/artifacts/995231/ssh_key_data' EXIT && ssh-add /runner/artifacts/995231/ssh_key_data && rm -f /runner/artifacts/995231/ssh_key_data && ansible-playbook -u ansibleremoteuser -i /runner/inventory/hosts -e @/runner/env/extravars check-connectivity/server.yaml\"]",
    "job_cwd": "/runner/project",
    "job_env": {
        "AWX_SERVICE_PORT_80_TCP_ADDR": "172.x.x.x",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_PORT": "tcp://172.x.x.x:8443",
        "HOSTNAME": "automation-job-995231-76ptk",
        "AWX_SERVICE_PORT": "tcp://172.x.x.x:80",
        "AWX_SERVICE_PORT_80_TCP_PORT": "80",
        "KUBERNETES_PORT_443_TCP_PROTO": "tcp",
        "KUBERNETES_PORT_443_TCP_ADDR": "172.x.x.x",
        "container": "oci",
        "AWX_SERVICE_PORT_80_TCP_PROTO": "tcp",
        "KUBERNETES_PORT": "tcp://172.x.x.x:443",
        "PWD": "/runner",
        "HOME": "/var/lib/awx",
        "AWX_SERVICE_PORT_80_TCP": "tcp://172.x.x.x:80",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_PORT_8443_TCP_PORT": "8443",
        "KUBERNETES_SERVICE_PORT_HTTPS": "443",
        "AWX_SERVICE_SERVICE_PORT": "80",
        "KUBERNETES_PORT_443_TCP_PORT": "443",
        "AWX_SERVICE_SERVICE_HOST": "172.x.x.x",
        "KUBERNETES_PORT_443_TCP": "tcp://172.x.x.x:443",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_PORT_8443_TCP": "tcp://172.x.x.x:8443",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_SERVICE_PORT_HTTPS": "8443",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_SERVICE_PORT": "8443",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_PORT_8443_TCP_PROTO": "tcp",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_SERVICE_HOST": "172.x.x.x",
        "SHLVL": "0",
        "KUBERNETES_SERVICE_PORT": "443",
        "AWX_OPERATOR_CONTROLLER_MANAGER_METRICS_SERVICE_PORT_8443_TCP_ADDR": "172.x.x.x",
        "PATH": "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
        "AWX_SERVICE_SERVICE_PORT_HTTP": "80",
        "KUBERNETES_SERVICE_HOST": "172.x.x.x",
        "LC_CTYPE": "C.UTF-8",
        "ANSIBLE_FACT_CACHE_TIMEOUT": "0",
        "ANSIBLE_FORCE_COLOR": "True",
        "ANSIBLE_HOST_KEY_CHECKING": "False",
        "ANSIBLE_INVENTORY_UNPARSED_FAILED": "True",
        "ANSIBLE_PARAMIKO_RECORD_HOST_KEYS": "False",
        "AWX_PRIVATE_DATA_DIR": "/tmp/awx_995231_ndl6nifl",
        "JOB_ID": "995231",
        "INVENTORY_ID": "484",
        "PROJECT_REVISION": "d260ddaf2d6bff",
        "ANSIBLE_RETRY_FILES_ENABLED": "False",
        "MAX_EVENT_RES": "700000",
        "AWX_HOST": "https://.../",
        "ANSIBLE_SSH_CONTROL_PATH_DIR": "/runner/cp",
        "ANSIBLE_COLLECTIONS_PATHS": "/runner/requirements_collections:~/.ansible/collections:/usr/share/ansible/collections",
        "ANSIBLE_ROLES_PATH": "/runner/requirements_roles:~/.ansible/roles:/usr/share/ansible/roles:/etc/ansible/roles",
        "PYTHONPATH": ":/usr/local/lib/python3.8/site-packages/ansible_runner/config/../callbacks",
        "ANSIBLE_CALLBACK_PLUGINS": "/usr/local/lib/python3.8/site-packages/ansible_runner/config/../callbacks",
        "ANSIBLE_STDOUT_CALLBACK": "awx_display",
        "AWX_ISOLATED_DATA_DIR": "/runner/artifacts/995231",
        "RUNNER_OMIT_EVENTS": "False",
        "RUNNER_ONLY_FAILED_EVENTS": "False"
    },
    "job_explanation": "Job terminated due to error",
    "execution_node": "",
    "controller_node": "awx-6f76575fd9-57k9r",
    "result_traceback": "",
    "event_processing_finished": true,
    "work_unit_id": "M495xlC7",
    "job_template": 2143,
    "passwords_needed_to_start": [],
    "allow_simultaneous": false,
    "artifacts": {},
    "scm_revision": "d260ddaf2d6bff9e4faa2dc375b8e1c0bb4fd7a1",
    "instance_group": 4,
    "diff_mode": false,
    "job_slice_number": 0,
    "job_slice_count": 1,
    "webhook_service": "",
    "webhook_credential": null,
    "webhook_guid": "",
    "host_status_counts": {},
    "playbook_counts": {
        "play_count": 1,
        "task_count": 1
    },
    "custom_virtualenv": null
}

For us, it's possible to try the workaround because we deployed our instance on a GKE cluster managed by another team (not doing specific).
At your disposal if you need more information

shanemcd · 2022-04-27T14:50:19Z

Now I'm almost positive that @AlanCoding was correct - it has to be ansible/awx#11338.

Unfortunately we do not currently have the resources to rearchitect this part of our application, especially since it is not affecting our customers on OpenShift, and the fact that there is a workaround. Hopefully they will fix kubernetes/kubernetes#59902 soon.

shanemcd · 2022-04-27T14:50:52Z

I'm going to close this as I think enough evidence has been provided to prove that the problem is not within ansible-runner.

stanislav-zaprudskiy · 2022-04-28T15:12:48Z

@shanemcd, we did more research around receptor and made the test inventory script working using streammethod: tcp and some socat magic on the EE which makes ansible-runner to read the incoming data from the receptor and stream results back to it.

I was going to perform some more tests and then to roll out the change in our productive environment, and will be also looking for any side-effects. However, would you come up with anything inherently bad with this approach of streammethod: tcp? Reading through ansible/receptor#278, I understand that there are certain issues for the cases when receptor restarts during the communication - but for AWX workflows that would mean a failure of a job and a need to restart it manually. WDYT?

shanemcd · 2022-04-29T19:26:43Z

@stanislav-zaprudskiy Nice digging, consider me impressed. If this helps you maybe we could look into exposing it via awx-operator. It's just more complex and we'd need to think about what the inputs look like... and certs. I used the k8s logger because we get access to it for free after the operator creates the service account. If you want to chat through anything, email is on my profile.

github-actions bot added the needs_triage New item that needs to be triaged label Feb 15, 2022

Shrews removed the needs_triage New item that needs to be triaged label Apr 5, 2022

AlanCoding mentioned this issue Apr 8, 2022

AWX stops gathering job output if kubernetes starts a new log ansible/awx#11338

Closed

3 tasks

shanemcd mentioned this issue Apr 21, 2022

Work results stream sometimes exits before all data has been received ansible/receptor#597

Closed

shanemcd closed this as completed Apr 27, 2022

washburnjoe mentioned this issue Dec 22, 2022

EC2 Inventory Errors, but no error shown ansible/awx#13357

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runner seems to have difficulties to process large outputs #998

Runner seems to have difficulties to process large outputs #998

stanislav-zaprudskiy commented Feb 15, 2022

AlanCoding commented Mar 22, 2022

stanislav-zaprudskiy commented Mar 23, 2022

stanislav-zaprudskiy commented Apr 4, 2022 •

edited

Loading

stanislav-zaprudskiy commented Apr 5, 2022 •

edited

Loading

AlanCoding commented Apr 8, 2022

shanemcd commented Apr 21, 2022

shanemcd commented Apr 25, 2022

shanemcd commented Apr 25, 2022 •

edited

Loading

shanemcd commented Apr 25, 2022

stanislav-zaprudskiy commented Apr 26, 2022

Tioborto commented Apr 26, 2022 •

edited

Loading

shanemcd commented Apr 27, 2022

shanemcd commented Apr 27, 2022

stanislav-zaprudskiy commented Apr 28, 2022

shanemcd commented Apr 29, 2022

Runner seems to have difficulties to process large outputs #998

Runner seems to have difficulties to process large outputs #998

Comments

stanislav-zaprudskiy commented Feb 15, 2022

AlanCoding commented Mar 22, 2022

stanislav-zaprudskiy commented Mar 23, 2022

stanislav-zaprudskiy commented Apr 4, 2022 • edited Loading

stanislav-zaprudskiy commented Apr 5, 2022 • edited Loading

AlanCoding commented Apr 8, 2022

shanemcd commented Apr 21, 2022

shanemcd commented Apr 25, 2022

shanemcd commented Apr 25, 2022 • edited Loading

shanemcd commented Apr 25, 2022

stanislav-zaprudskiy commented Apr 26, 2022

Tioborto commented Apr 26, 2022 • edited Loading

shanemcd commented Apr 27, 2022

shanemcd commented Apr 27, 2022

stanislav-zaprudskiy commented Apr 28, 2022

shanemcd commented Apr 29, 2022

stanislav-zaprudskiy commented Apr 4, 2022 •

edited

Loading

stanislav-zaprudskiy commented Apr 5, 2022 •

edited

Loading

shanemcd commented Apr 25, 2022 •

edited

Loading

Tioborto commented Apr 26, 2022 •

edited

Loading