Skip to content

SSM is unable to recovery after corrupted credentials #650

@kkurczewski

Description

@kkurczewski

Agent version: 3.3.2299.0 (installed via Greengrass SystemManager component)

My company uses SSM Managed instances with happens to have a flaky internet connection. Recently we lost connection to one of our machine. Before we said "bye bye" I gathered some information about issue.

From journal I managed to get logs:

...
...
...
Nov 03 09:46:32 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[1426230]: 2025-11-03 09:46:32.7813 INFO [CredentialRefresher] Next credential rotation will be>
-- Reboot --
Nov 03 12:40:12 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[6712]: 2025/11/03 12:40:12 Found config file at /etc/amazon/ssm/amazon-ssm-agent.json.
Nov 03 12:40:12 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[6712]: 2025/11/03 12:40:12 Found config file at /etc/amazon/ssm/amazon-ssm-agent.json.
Nov 03 12:40:12 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[6712]: Applying config override from /etc/amazon/ssm/amazon-ssm-agent.json.
Nov 03 12:40:12 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[6712]: 2025/11/03 12:40:12 processing appconfig overrides
Nov 03 12:40:12 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[6712]: 2025-11-03 12:40:12.7541 ERROR Agent failed to assume any identity
Nov 03 12:40:13 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[6712]: 2025-11-03 12:40:12.7546 ERROR failed to find identity, retrying: failed to find agent >
Nov 03 12:40:13 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[6712]: 2025-11-03 12:40:13.2555 INFO Checking if agent identity type OnPrem can be assumed
Nov 03 12:40:13 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[6712]: 2025-11-03 12:40:12.7546 ERROR failed to find identity, retrying: failed to find agent >
Nov 03 12:40:13 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[6712]: 2025-11-03 12:40:13.2555 INFO Checking if agent identity type OnPrem can be assumed
...
Nov 04 11:38:37 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[490780]: 2025-11-04 11:38:37.4277 ERROR failed to get identity: failed to find agent identity
Nov 04 11:38:37 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[490780]: 2025-11-04 11:38:37.4278 ERROR Error occurred when starting amazon-ssm-agent: failed >
Nov 04 11:38:37 [REDACTED] systemd[1]: snap.amazon-ssm-agent.amazon-ssm-agent.service: Succeeded.
Nov 04 11:39:07 [REDACTED] systemd[1]: snap.amazon-ssm-agent.amazon-ssm-agent.service: Scheduled restart job, restart counter is at 2647.

Before reboot I had a lot of Next credential rotation messages so everything seemed to be healthy. After reboot SSM entered in restart loop caused by failed to find agent identity.

Next I ran diagnostics:

sudo ssm-cli get-diagnostics
{
  "DiagnosticsOutput": [
    {
      "Check": "EC2 IMDS",
      "Status": "Failed",
      "Note": "Failed to query IMDS: instance id request timeout"
    },
    {
      "Check": "Hybrid instance registration",
      "Status": "Skipped",
      "Note": "Instance does not have hybrid registration"
    },
    {
      "Check": "Connectivity to ssm endpoint",
      "Status": "Skipped",
      "Note": "Unable to fetch AWS region details"
    },
    {
      "Check": "Connectivity to ec2messages endpoint",
      "Status": "Skipped",
      "Note": "Unable to fetch AWS region details"
    },
    {
      "Check": "Connectivity to ssmmessages endpoint",
      "Status": "Skipped",
      "Note": "Unable to fetch AWS region details"
    },
    {
      "Check": "Connectivity to s3 endpoint",
      "Status": "Skipped",
      "Note": "Unable to fetch AWS region details"
    },
    {
      "Check": "Connectivity to kms endpoint",
      "Status": "Skipped",
      "Note": "Unable to fetch AWS region details"
    },
    {
      "Check": "Connectivity to logs endpoint",
      "Status": "Skipped",
      "Note": "Unable to fetch AWS region details"
    },
    {
      "Check": "Connectivity to monitoring endpoint",
      "Status": "Skipped",
      "Note": "Unable to fetch AWS region details"
    },
    {
      "Check": "AWS Credentials",
      "Status": "Skipped",
      "Note": "No credentials available"
    },
    {
      "Check": "Agent service",
      "Status": "Failed",
      "Note": "Agent is installed as snap service but is not running"
    },
    {
      "Check": "Proxy configuration",
      "Status": "Skipped",
      "Note": "No proxy configuration detected"
    },
    {
      "Check": "SSM Agent version",
      "Status": "Success",
      "Note": "SSM Agent version is 3.3.2299.0"
    }
  ]
}

Which suggest that credentials are broken. From SSM troubleshooting page I found logs location and found peculiar entry:

2025-11-03 12:35:15 ERROR Registration failed due to error registering the instance with AWS SSM. CredentialsEndpointError: failed to load credentials
caused by: SerializationError: failed to decode error message
        status code: 500, request id:
caused by: UnmarshalError: failed decoding error message
        00000000  46 61 69 6c 65 64 20 74  6f 20 67 65 74 20 63 6f  |Failed to get co|
00000010  6e 6e 65 63 74 69 6f 6e                           |nnection|

caused by: invalid character 'F' looking for beginning of value
2025-11-03 12:35:21 ERROR Registration failed due to error registering the instance with AWS SSM. CredentialsEndpointError: failed to load credentials
caused by: SerializationError: failed to decode error message
        status code: 500, request id:
caused by: UnmarshalError: failed decoding error message
        00000000  46 61 69 6c 65 64 20 74  6f 20 67 65 74 20 63 6f  |Failed to get co|
00000010  6e 6e 65 63 74 69 6f 6e                           |nnection|

caused by: invalid character 'F' looking for beginning of value
2025-11-03 12:35:27 ERROR Registration failed due to error registering the instance with AWS SSM. CredentialsEndpointError: failed to load credentials
caused by: SerializationError: failed to decode error message
        status code: 500, request id:
caused by: UnmarshalError: failed decoding error message
        00000000  46 61 69 6c 65 64 20 74  6f 20 67 65 74 20 63 6f  |Failed to get co|
00000010  6e 6e 65 63 74 69 6f 6e                           |nnection|

caused by: invalid character 'F' looking for beginning of value

Which I assume means that credentials provider got Failed to get connection instead valid credentials. Since then SSM entered in crash loop and weren't able to get up anymore.


Note: This is not first time I see this marshaling error and it always result in connection lost.

I don't know what happens on deeper level, why SSM doesn't recover but it makes SSM unusable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions