Description
I've noticed that occasionally a call to create/start a VM via the Oxide CLI seems to hang and then fail after about a minute. Inspecting the Nexus log, I see that the request timed out:
{"msg":"request completed","v":0,"name":"nexus","level":30,"time":"2022-05-25T05:32:37.432223873Z","hostname":"oxz_nexus","pid":20389,"uri":"//organizations/o/projects/p/instances/vm1/start","method":"POST","req_id":"85612f14-783f-40ea-982f-cc888d2320f8","remote_addr":"[fd00:1122:3344:101::1]:34100","local_addr":"[fd00:1122:3344:101::3]:12220","component":"dropshot_external","name":"e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c","error_message_external":"Internal Server Error","error_message_internal":"CommunicationError: error sending request for url (http://[fd00:1122:3344:101::1]:12345/instances/8724453d-7959-44ed-809f-1bc9d02ebe9a): operation timed out","response_code":"500"}
Looking at the sled agent, I can see it hang after hitting this line:
{"msg":"Started propolis in zone: oxz_propolis-server_5d23c6ef-b4c0-4f0a-90de-fb5001133a5c","v":0,"name":"SledAgent","level":30,"time":"2022-05-25T05:31:45.446013495Z","hostname":"feldspar","pid":14194,"instance_id":"8724453d-7959-44ed-809f-1bc9d02ebe9a","component":"InstanceManager"}
{"msg":"halt_and_remove_logged: Previous zone state: Running","v":0,"name":"SledAgent","level":30,"time":"2022-05-25T05:32:39.406791793Z","hostname":"feldspar","pid":14194,"zone":"oxz_propolis-server_5d23c6ef-b4c0-4f0a-90de-fb5001133a5c","instance_id":"8724453d-7959-44ed-809f-1bc9d02ebe9a","component":"InstanceManager"}
{"msg":"Stopped and uninstalled zone","v":0,"name":"SledAgent","level":30,"time":"2022-05-25T05:32:39.408091477Z","hostname":"feldspar","pid":14194,"zone":"oxz_propolis-server_5d23c6ef-b4c0-4f0a-90de-fb5001133a5c","instance_id":"8724453d-7959-44ed-809f-1bc9d02ebe9a","component":"InstanceManager"}
It just sits there for about a minute, then removes the zone and cleans up. I was able to snag the Propolis logs during this hang, and we see this:
bnaecker@feldspar : ~/omicron $ pfexec zlogin $(zoneadm list | tail -1) 'tail +0 -f $(svcs -L propolis-server)'
[ May 24 22:31:45 Enabled. ]
[ May 24 22:31:45 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/propolis-server/bin/propolis-server run /var/svc/manifest/site/propolis-server/config.toml unknown &"). ]
[ May 24 22:31:45 Method "start" exited with status 0. ]
error: Invalid value for '<PROPOLIS_IP:PORT>': invalid IP address syntax
[ May 24 22:31:45 Stopping because all processes in service exited. ]
[ May 24 22:31:45 Executing stop method (:kill). ]
[ May 24 22:31:45 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/propolis-server/bin/propolis-server run /var/svc/manifest/site/propolis-server/config.toml unknown &"). ]
[ May 24 22:31:45 Method "start" exited with status 0. ]
error: Invalid value for '<PROPOLIS_IP:PORT>': invalid IP address syntax
[ May 24 22:31:45 Stopping because all processes in service exited. ]
[ May 24 22:31:45 Executing stop method (:kill). ]
[ May 24 22:31:45 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/propolis-server/bin/propolis-server run /var/svc/manifest/site/propolis-server/config.toml unknown &"). ]
[ May 24 22:31:45 Method "start" exited with status 0. ]
error: Invalid value for '<PROPOLIS_IP:PORT>': invalid IP address syntax
[ May 24 22:31:45 Stopping because all processes in service exited. ]
[ May 24 22:31:45 Executing stop method (:kill). ]
[ May 24 22:31:45 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/propolis-server/bin/propolis-server run /var/svc/manifest/site/propolis-server/config.toml unknown &"). ]
[ May 24 22:31:45 Method "start" exited with status 0. ]
error: Invalid value for '<PROPOLIS_IP:PORT>': invalid IP address syntax
[ May 24 22:31:45 Stopping because all processes in service exited. ]
[ May 24 22:31:45 Executing stop method (:kill). ]
[ May 24 22:31:45 Restarting too quickly, changing state to maintenance. ]
That shows that the propolis-server
binary was started with a listening address of "unknown"
, which is obviously wrong. That comes from the default value of the config/server_addr
SMF property. The sled agent is supposed to be setting that property here to the actual IP address that Nexus provided for that Propolis server. That doesn't seem to always happen.
I modified the sled agent to print the values of that property for both the generic propolis-server
SMF service, and for the specific instance of that service that it adds. Those seem to always show:
{"msg":"Generic Propolis server SMF properties: config/server_addr astring [fd00:1122:3344:101::17]:12400\n","v":0,"name":"SledAgent","level":30,"time":"2022-05-25T05:31:45.331659446Z","hostname":"feldspar","pid":14194,"instance_id":"8724453d-7959-44ed-809f-1bc9d02ebe9a","component":"InstanceManager"}
{"msg":"Propolis server SMF instance properties: ","v":0,"name":"SledAgent","level":30,"time":"2022-05-25T05:31:45.408559664Z","hostname":"feldspar","pid":14194,"instance_id":"8724453d-7959-44ed-809f-1bc9d02ebe9a","component":"InstanceManager"}
That is, the "generic" SMF service svc:/system/illumos/propolis-server
seems to get the provided value. The specific instance of the service, in this case svc:/system/illumos/propolis-server:vm-5d23c6ef-b4c0-4f0a-90de-fb5001133a5c
has an empty string as its property value. Presumably this is because the instance is supposed to inherit properties from its parent configuration, unless they've been overridden. Unfortunately, those are always the values, when the instance provision succeeds or hangs.
However, when it hangs, I've also seen that inspecting the property value with svccfg
directly from the global zone shows that it remains unmodified:
bnaecker@feldspar : ~/omicron $ pfexec zlogin $(zoneadm list | tail -1) 'svccfg -v -s propolis-server listprop config/server_addr'
config/server_addr astring unknown
I really don't have much else yet, but it's getting late and I wanted to record my debugging state. I've experimented a bit with setting the property directly on the instance, but that fails with the error: svccfg: No such property group "config"
. I've also tried calling svcprop refresh
, but that seems to have no effect.
Overall, it appears there's a race or some other inconsistency in how the actual SMF service instance we're launching gets the value of its property config/server_addr
. My current steps to repro this are, unfortunately, just starting/stopping an instance a ton of times. I can usually hit this after a median of about 5 retries, but it's extremely variable.