-
Notifications
You must be signed in to change notification settings - Fork 347
Description
Agent version: 3.3.2299.0 (installed via Greengrass SystemManager component)
My company uses SSM Managed instances with happens to have a flaky internet connection. Recently we lost connection to one of our machine. Before we said "bye bye" I gathered some information about issue.
From journal I managed to get logs:
...
...
...
Nov 03 09:46:32 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[1426230]: 2025-11-03 09:46:32.7813 INFO [CredentialRefresher] Next credential rotation will be>
-- Reboot --
Nov 03 12:40:12 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[6712]: 2025/11/03 12:40:12 Found config file at /etc/amazon/ssm/amazon-ssm-agent.json.
Nov 03 12:40:12 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[6712]: 2025/11/03 12:40:12 Found config file at /etc/amazon/ssm/amazon-ssm-agent.json.
Nov 03 12:40:12 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[6712]: Applying config override from /etc/amazon/ssm/amazon-ssm-agent.json.
Nov 03 12:40:12 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[6712]: 2025/11/03 12:40:12 processing appconfig overrides
Nov 03 12:40:12 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[6712]: 2025-11-03 12:40:12.7541 ERROR Agent failed to assume any identity
Nov 03 12:40:13 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[6712]: 2025-11-03 12:40:12.7546 ERROR failed to find identity, retrying: failed to find agent >
Nov 03 12:40:13 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[6712]: 2025-11-03 12:40:13.2555 INFO Checking if agent identity type OnPrem can be assumed
Nov 03 12:40:13 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[6712]: 2025-11-03 12:40:12.7546 ERROR failed to find identity, retrying: failed to find agent >
Nov 03 12:40:13 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[6712]: 2025-11-03 12:40:13.2555 INFO Checking if agent identity type OnPrem can be assumed
...
Nov 04 11:38:37 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[490780]: 2025-11-04 11:38:37.4277 ERROR failed to get identity: failed to find agent identity
Nov 04 11:38:37 [REDACTED] amazon-ssm-agent.amazon-ssm-agent[490780]: 2025-11-04 11:38:37.4278 ERROR Error occurred when starting amazon-ssm-agent: failed >
Nov 04 11:38:37 [REDACTED] systemd[1]: snap.amazon-ssm-agent.amazon-ssm-agent.service: Succeeded.
Nov 04 11:39:07 [REDACTED] systemd[1]: snap.amazon-ssm-agent.amazon-ssm-agent.service: Scheduled restart job, restart counter is at 2647.
Before reboot I had a lot of Next credential rotation messages so everything seemed to be healthy. After reboot SSM entered in restart loop caused by failed to find agent identity.
Next I ran diagnostics:
sudo ssm-cli get-diagnostics
{
"DiagnosticsOutput": [
{
"Check": "EC2 IMDS",
"Status": "Failed",
"Note": "Failed to query IMDS: instance id request timeout"
},
{
"Check": "Hybrid instance registration",
"Status": "Skipped",
"Note": "Instance does not have hybrid registration"
},
{
"Check": "Connectivity to ssm endpoint",
"Status": "Skipped",
"Note": "Unable to fetch AWS region details"
},
{
"Check": "Connectivity to ec2messages endpoint",
"Status": "Skipped",
"Note": "Unable to fetch AWS region details"
},
{
"Check": "Connectivity to ssmmessages endpoint",
"Status": "Skipped",
"Note": "Unable to fetch AWS region details"
},
{
"Check": "Connectivity to s3 endpoint",
"Status": "Skipped",
"Note": "Unable to fetch AWS region details"
},
{
"Check": "Connectivity to kms endpoint",
"Status": "Skipped",
"Note": "Unable to fetch AWS region details"
},
{
"Check": "Connectivity to logs endpoint",
"Status": "Skipped",
"Note": "Unable to fetch AWS region details"
},
{
"Check": "Connectivity to monitoring endpoint",
"Status": "Skipped",
"Note": "Unable to fetch AWS region details"
},
{
"Check": "AWS Credentials",
"Status": "Skipped",
"Note": "No credentials available"
},
{
"Check": "Agent service",
"Status": "Failed",
"Note": "Agent is installed as snap service but is not running"
},
{
"Check": "Proxy configuration",
"Status": "Skipped",
"Note": "No proxy configuration detected"
},
{
"Check": "SSM Agent version",
"Status": "Success",
"Note": "SSM Agent version is 3.3.2299.0"
}
]
}
Which suggest that credentials are broken. From SSM troubleshooting page I found logs location and found peculiar entry:
2025-11-03 12:35:15 ERROR Registration failed due to error registering the instance with AWS SSM. CredentialsEndpointError: failed to load credentials
caused by: SerializationError: failed to decode error message
status code: 500, request id:
caused by: UnmarshalError: failed decoding error message
00000000 46 61 69 6c 65 64 20 74 6f 20 67 65 74 20 63 6f |Failed to get co|
00000010 6e 6e 65 63 74 69 6f 6e |nnection|
caused by: invalid character 'F' looking for beginning of value
2025-11-03 12:35:21 ERROR Registration failed due to error registering the instance with AWS SSM. CredentialsEndpointError: failed to load credentials
caused by: SerializationError: failed to decode error message
status code: 500, request id:
caused by: UnmarshalError: failed decoding error message
00000000 46 61 69 6c 65 64 20 74 6f 20 67 65 74 20 63 6f |Failed to get co|
00000010 6e 6e 65 63 74 69 6f 6e |nnection|
caused by: invalid character 'F' looking for beginning of value
2025-11-03 12:35:27 ERROR Registration failed due to error registering the instance with AWS SSM. CredentialsEndpointError: failed to load credentials
caused by: SerializationError: failed to decode error message
status code: 500, request id:
caused by: UnmarshalError: failed decoding error message
00000000 46 61 69 6c 65 64 20 74 6f 20 67 65 74 20 63 6f |Failed to get co|
00000010 6e 6e 65 63 74 69 6f 6e |nnection|
caused by: invalid character 'F' looking for beginning of value
Which I assume means that credentials provider got Failed to get connection instead valid credentials. Since then SSM entered in crash loop and weren't able to get up anymore.
Note: This is not first time I see this marshaling error and it always result in connection lost.
I don't know what happens on deeper level, why SSM doesn't recover but it makes SSM unusable.