Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device permission update broken for some devices #2133

Closed
agners opened this issue Sep 15, 2022 · 23 comments
Closed

Device permission update broken for some devices #2133

agners opened this issue Sep 15, 2022 · 23 comments

Comments

@agners
Copy link
Member

agners commented Sep 15, 2022

RaspberryMatic users report issues with Homeatic devices not being able to passed into the RaspberryMatic Add-on (see jens-maus/RaspberryMatic#1942).

The supervisor reports the following error:

22-09-15 13:34:43 ERROR (MainThread) [supervisor.docker.addon] Can't set cgroup permission on the host for addon_de838cd8_raspberrymatic
22-09-15 13:34:43 ERROR (MainThread) [asyncio] Task exception was never retrieved
future: <Task finished name='Task-14879' coro=<DockerAddon._hardware_events() done, defined at /usr/src/supervisor/supervisor/jobs/decorator.py:85> exception=DockerError("Can't set cgroup permission on the host for addon_de838cd8_raspberrymatic")>
Traceback (most recent call last):
  File "/usr/src/supervisor/supervisor/docker/addon.py", line 720, in _hardware_events
    await self.sys_dbus.agent.cgroup.add_devices_allowed(
  File "/usr/src/supervisor/supervisor/dbus/agent/cgroup.py", line 19, in add_devices_allowed
    await self.dbus.CGroup.AddDevicesAllowed(container_id, permission)
  File "/usr/src/supervisor/supervisor/utils/dbus.py", line 174, in call_dbus
    raise DBusFatalError(reply.body[0])
supervisor.exceptions.DBusFatalError: Error calling runc for '87542225ef424bd38c4d49ff3db5a8c46b277fdee1c3efd1a6aee5edd57d9d80': exit status 1, output time="2022-09-15T11:34:43Z" level=warning msg="could not find device group for '99/204' in /proc/devices -- temporarily ignoring rule: {99 204 -1 rwm true}"
time="2022-09-15T11:34:43Z" level=info msg="found more than one filter (2) attached to a cgroup -- removing extra filters!"
time="2022-09-15T11:34:43Z" level=info msg="removing old filter 0 from cgroup" id=139 name= run_count=0 runtime=0s tag=f231b56d360e591c type=CGroupDevice
time="2022-09-15T11:34:43Z" level=info msg="removing old filter 1 from cgroup" id=140 name= run_count=0 runtime=0s tag=be0b8d7ca6afd4d8 type=CGroupDevice
time="2022-09-15T11:34:43Z" level=warning msg="could not find device group for '99/204' in /proc/devices -- temporarily ignoring rule: {99 204 -1 rwm true}"
time="2022-09-15T11:34:43Z" level=info msg="found more than one filter (2) attached to a cgroup -- removing extra filters!"
time="2022-09-15T11:34:43Z" level=info msg="removing old filter 0 from cgroup" id=144 name= run_count=0 runtime=0s tag=839fa1b30f0f67c3 type=CGroupDevice
time="2022-09-15T11:34:43Z" level=info msg="removing old filter 1 from cgroup" id=147 name= run_count=0 runtime=0s tag=be0b8d7ca6afd4d8 type=CGroupDevice
time="2022-09-15T11:34:43Z" level=error msg="failed to call BPF_PROG_DETACH (BPF_CGROUP_DEVICE) on old filter program: can't detach program: no such file or directory"
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/src/supervisor/supervisor/jobs/decorator.py", line 141, in wrapper
    raise err
  File "/usr/src/supervisor/supervisor/jobs/decorator.py", line 139, in wrapper
    return await self._method(*args, **kwargs)
  File "/usr/src/supervisor/supervisor/docker/addon.py", line 727, in _hardware_events
    raise DockerError(
supervisor.exceptions.DockerError: Can't set cgroup permission on the host for addon_de838cd8_raspberrymatic
22-09-15 13:34:43 ERROR (MainThread) [supervisor.docker.addon] Can't set cgroup permission on the host for addon_de838cd8_raspberrymatic
@jens-maus
Copy link
Contributor

Thanks for having pulled the 9.0 for the time being while we are going to investigate that issue. I will keep you updated here as soon as I did some investigation myself to try to reproduce this issue.

@agners
Copy link
Member Author

agners commented Sep 15, 2022

@jens-maus unfortunately, I really can't find my only Homeatic device 😢

Is that the device which makes use of the raw UART driver? So maybe a raw UART driver & runc problem?

I tried passing regular GPIO UART's (ttyAMA0) to an add-on as well as USB devices like deCONZ, and that seems to work. I am a bit out of idea how I can reproduce it here.

I've looked through the source code, and it seems that runc requires the device to be present in /proc/devices. If you can reproduce it at your end, can you check how /proc/devices looks when the device is plugged in?

@agners
Copy link
Member Author

agners commented Sep 15, 2022

Further investigation actually uncovered that the root problem is this error:

time="2022-09-15T11:34:43Z" level=error msg="failed to call BPF_PROG_DETACH (BPF_CGROUP_DEVICE) on old filter program: can't detach program: no such file or directory"

This causes runc to fail, which probably causes rules to not get applied correctly. This is likely caused by parallel calls to OS Agent/runc, due to multiple devices getting added at once. The following fix prevents Supervisor from calling OS Agent/runc twice at once home-assistant/supervisor#3868.

@jens-maus
Copy link
Contributor

great, looking forward to development builds to be able to test this...

@agners
Copy link
Member Author

agners commented Sep 15, 2022

@jens-maus Supervisor dev build 2022.09.dev1501 should include the fix.

@jens-maus
Copy link
Contributor

thx, however, I always tend to forget how to update to these dev builds.. can you please help me out here :)

@agners
Copy link
Member Author

agners commented Sep 15, 2022

Sure :)

ha su options --channel dev
ha su reload
ha su info

This should show latest version is 2022.09.dev1501. Then

ha su update

Will update to that.

@dieterxy
Copy link

I had the same issue. I Changed to the Dev Channel as described above. Update to 2022.09.dev1501 allown was not sufficient, but with the directly following dev updates it now works again with the following versions:
Home Assistant 2022.10.0.dev20220916
Supervisor 2022.09.dev1602
Operating System 10.0.dev20220912
Frontend 20220916.0.dev - latest

Thanks for the fast solution.

@agners
Copy link
Member Author

agners commented Sep 16, 2022

Hm, Supervisor 2022.09.dev1501 should really be enough, the newer dev versions contain just unrelated dependency updates. Did you restart the add-on after updating the Supervisor?

@agners
Copy link
Member Author

agners commented Sep 16, 2022

@jens-maus have you been able to test with the latest Sueprvisor?

@dieterxy
Copy link

Sorry, I really don't know. Usually I restart after updating, but I'm new to HA and restarting is really not that automated for me yet :-(
So it might be the case that I forgot to restart.

@pvizeli
Copy link
Member

pvizeli commented Sep 17, 2022

I had the same issue. I Changed to the Dev Channel as described above. Update to 2022.09.dev1501 allown was not sufficient, but with the directly following dev updates it now works again with the following versions: Home Assistant 2022.10.0.dev20220916 Supervisor 2022.09.dev1602 Operating System 10.0.dev20220912 Frontend 20220916.0.dev - latest

Thanks for the fast solution.

I tested the supervisor and other things quite a bit. That doesn't solve the issue. That it work for one person would maybe mean there is an timing issue? I guess to understand what the driver make and how they interact with userland could help to understand it better

@SmilyHM
Copy link

SmilyHM commented Sep 18, 2022

Hi,

I'm now on

  • Home Assistant 2022.10.0.dev.20220918
  • Supervisor 2022.09.dev.1703
  • Operating System 10.0.dev20220912
  • Frontend 20220918.0.dev - latest

It seems to work now. Addon-Log still show

Starting rfd: ....................ERROR
Starting HMIPServer: ..............OK

but RF + HMIP working both atm.

The only thing is: My Duty Cycle grows and grows (round about 1 percent / minute). Now at 55 percent. Don't know what's normal with my not very big setup, but can't imagine it would ever be abouve 20 (or 10)...
(Carrier sense is 0 - 2 percent)

Best regards
Sascha

@jens-maus
Copy link
Contributor

@pvizeli @agners

I tested the supervisor and other things quite a bit. That doesn't solve the issue. That it work for one person would maybe mean there is an timing issue? I guess to understand what the driver make and how they interact with userland could help to understand it better

I investigate the issue a bit myself and to me it parly looks like a timing issue and is definitly related to the point that within the addon the two /dev/mmd_* devices cannot be accessed sometimes... Remember, we had something like this in the past (see jens-maus/RaspberryMatic#1707 and jens-maus/RaspberryMatic#1752)? And back then I was able to solve/workaround the issue by adding a sleep 5 after loading the eq3_char_loop kernel module because it seems the dynamic permission setup in HomeAssistantOS (e.g. cgroups) for the /dev/eq3loop device took a while before another process could access it.

And now, this time it looks like something similar things are happening when the dynamically generated /dev/mmd_bidcos and /dev/mmd_hmip devices – which are generated once the multimacd process starts up – are accessed. It seems that the dynamic permission setup within HomeAssistant to provide a certain addon permission to a device seem to take some time and now with cgroupv2 this might have gone worse.... At least that's what I am noticing/guessing here.

One of the strange things is, that while debugging I can see that sometimes /dev/mmd_bidcos cannot be used and sometimes /dev/mmd_hmip cannot be used and sometimes the other way around, but it seems that both cannot be access at the same time as it seems. And if, e.g. /dev/mmd_hmip cannot be accessed the accessing process HMIPServer.jar spits out error messages like:

2022-09-18 17:38:39,857 de.eq3.cbcs.server.local.base.internal.LocalServerAdapterInitialization ERROR [vert.x-eventloop-thread-3] Error 'LOCAL_ADAPTER_NO_SUCH_PORT' while trying to open port '/dev/mmd_hmip': de.eq3.cbcs.lib.commdevice.CommDeviceException: Exception while trying to open serial port. Check configured port '/dev/mmd_hmip'

And when both mmd devices are present they are looking like:

crw-------    1 root     root      240,   2 Sep 18 17:29 /dev/mmd_bidcos
crw-------    1 root     root      240,   1 Sep 18 17:29 /dev/mmd_hmip

So can this be related to the point that these device nodes "just" differ in the minor number? I don't think so, but who knows....

And perhaps you guys (@pvizeli or @agners) have some idea how I can more deeply analyze this situation within HomeAssistantOS?!?! Because I am partly out of ideas on how to further debug these permission/device access issues...

jens-maus added a commit to jens-maus/RaspberryMatic that referenced this issue Sep 18, 2022
provide underlying operating systems a bit more time for settlement and
permission setup to the dynamically generated /dev/mmd_* devices. This
should help in case of newer HomeAssistantOS 9.0+ environments which
seem to require a bit more time before rfd/HMIPServer can successfully
access these multimacd devices. This fixes #1942 and refs home-assistant/operating-system#2133
@jens-maus
Copy link
Contributor

Ok, after further investigation I updated to the latest supervisor dev build (2022.09.dev1703) and noticed that this version seem to have improved my observation that either /dev/mmd_bidcos or /dev/mmd_hmip works, but never both. With this newer supervisor in place I could now introduce an additional sleep 5 call to provide the dynamic addon permission setup somewhat more time when these devices are dynamically generated and voila these two changes (newer supervisor + sleep 5 after dynamic device generation) seem to have resulted in a working "RaspberryMatic CCU" add-on bootup again.

However, I am still curious why these additional sleep X calls are actually necessary in case of HomeAssistantOS/HomeAssistant while they are not required for non-HomeAssistant based RaspberryMatic setup?!? Perhaps you guys (@pvizeli @agners) could check if these is still room for improvement here so that these awkward workaround for timing issues in permission setup to dynamically generated devices are not necessary. It would perhaps already be enough to somehow halt/delay the execution of an add-on (if this is possible) while the dynamic permission setup is currently underway.... But these are HomeAssistant internals that you both are probably more familiar with.

Nevertheless, it would be great if the newer supervisor with the latest integrated fixes is released soon, so that "RaspberryMatic CCU" Add-on users can update accordingly and use newer add-on versions as well (with the sleep) so that they get their HomeMatic setup running again with HAOS 9+

@jens-maus
Copy link
Contributor

FYI: I just downgraded to the latest supervisor 2022.09.1 beta version and this version also seem to be recent enough to get the "RaspberryMatic CCU" add-on going again – however, unfortunately only with the additional sleep 5 workaround as mentioned above.

@pvizeli
Copy link
Member

pvizeli commented Sep 19, 2022

FYI: I just downgraded to the latest supervisor 2022.09.1 beta version and this version also seem to be recent enough to get the "RaspberryMatic CCU" add-on going again – however, unfortunately only with the additional sleep 5 workaround as mentioned above.

Yeah, what I said, the error in the log is not the issue that we looking for. I fixed it with the supervisor, but that doesn't solve our issue. Docker now using BPF but also systemd to control the access with docker > 20.10.8 and new distros. I guess these 2 system that needs to know about the device's access take more time to get it in place.

Thank you @jens-maus for looking into and finding the issue & fix this. We act on kernel events, I don't think we can speed the device access more as it is now and I'm also not sure if the BPF is the part which is slower or the systemd cgroup implementation. Since cgroup1 is EOF and we already push back it 1y, we have to move forward with cgroup2 and find out how to life with it.

@agners
Copy link
Member Author

agners commented Sep 19, 2022

One of the strange things is, that while debugging I can see that sometimes /dev/mmd_bidcos cannot be used and sometimes /dev/mmd_hmip cannot be used and sometimes the other way around, but it seems that both cannot be access at the same time as it seems.

That is probably the same issue which caused the error failed to call BPF_PROG_DETACH (BPF_CGROUP_DEVICE) on old filter program: can't detach program: no such file or directory", basically a race between the two filters getting applied. From what I understand, this should be resolved by home-assistant/supervisor#3868. With that, even when two devices are created "at the same time", both devices correctly get permissions applied. Since that PR is in 2022.09.1, this lines up with your observations.

With this newer supervisor in place I could now introduce an additional sleep 5 call to provide the dynamic addon permission setup somewhat more time when these devices are dynamically generated and voila these two changes (newer supervisor + sleep 5 after dynamic device generation) seem to have resulted in a working "RaspberryMatic CCU" add-on bootup again.

It is very likely that the current dynamic device permission setting implementation is slower. We call runc through OS Agent to make it work with cgroupsv2, in cgroupsv1 this was a simple sysfs write.

However, 5s seems like a very long time 😰. Is 5s "the optimized" value, as in, 1s through 4s is not enough?

In general, I hope the device cgroups permission update gets accepted into upstream runc, and gains native containerd and Docker support. There will always be a window when the device is being created until it is accessible. But it should definitely not be 5s.

Ideally you should implement some method to check if the device is readable (maybe test -r works for the device?) and wait until it is.

HomeAssistantOS/HomeAssistant while they are not required for non-HomeAssistant based RaspberryMatic setup?!?

From what I understand in pure RaspberryMatic setup there is no container involved, so as soon as the device has been created it will be accessible? In the end, the permission update is (and was) and asynchronous process running in parallel. There is always a potential for a race condition here. A check along with a busy wait is probably the best approach.

@pvizeli
Copy link
Member

pvizeli commented Sep 19, 2022

Ideally you should implement some method to check if the device is readable (maybe test -r works for the device?) and wait until it is.

Yeah, I used a similar approach on an add-on to wait until the device came online (that was not about access, but maybe that changed now too, as we have to use dynamic access more often because whitelisting devices doesn't work anymore with cgroup2)

@jens-maus
Copy link
Contributor

Ideally you should implement some method to check if the device is readable (maybe test -r works for the device?) and wait until it is.

Yeah, I used a similar approach on an add-on to wait until the device came online (that was not about access, but maybe that changed now too, as we have to use dynamic access more often because whitelisting devices doesn't work anymore with cgroup2)

Unfortunatley, a simply test -r check in a while loop does not help here, because this test never seem to return a "permission denied" error or similar. The point is, that when the /dev/mmd_* devices are generated they appear instantly with read-write permission in the /dev/ space, but it still seems to take some time (5 seconds is just to be on the very very safe side) until they are actually usable by the accessing applications within the add-on. And AFAIK test -r only checks for simply file permissions and not if a file/device can actually be read or not. Thus, so far I haven't found any other means of "solving" this race condition. But I am open for more suggestions of course :)

@agners
Copy link
Member Author

agners commented Sep 19, 2022

Right, it needs something which calls open() on the device file. head -n 0 seams to work without reading, and has a proper exit code:

$ docker run --rm -it -v /dev:/dev --device /dev/ttyUSB0 alpine "/usr/bin/head" -n 0 /dev/ttyUSB0
$ echo $?
0
$ docker run --rm -it -v /dev:/dev alpine "/usr/bin/head" -n 0 /dev/ttyUSB0
head: /dev/ttyUSB0: Operation not permitted
$ echo $?
1

strace shows its not really reading with -n 0

openat(AT_FDCWD, "/dev/ttyUSB0", O_RDONLY) = 3
close(3)                                = 0

jens-maus added a commit to jens-maus/RaspberryMatic that referenced this issue Sep 19, 2022
"head -c0" which should help to deal way better with the dynamic device
generation process, especially in case of using HomeAssistantOS as the
based OS. This refs home-assistant/operating-system#2133 (comment)
@jens-maus
Copy link
Contributor

Right, it needs something which calls open() on the device file. head -n 0 seams to work without reading, and has a proper exit code:

Thanks for this great hint. Funny enough, I was already working/evaluating using head -c0 at the time you are bringing it up here as well. This, together with my own tests convinced me that rewriting the static sleep with for-loops using head -c0 as device permission access checks is a proper approach. Thus, this resulted in jens-maus/RaspberryMatic@213b049 which seem to work fine now. So thanks again @agners for your hint!

@agners
Copy link
Member Author

agners commented Oct 20, 2022

Closing this issue as the main problem of the race condition is addressed by home-assistant/supervisor#3868.

There is still a problem with the hot-plug functionality. Unfortunately the new runc device permission updates essentially does not allow to set arbitrary <major>:* permission entries at startup. We can work around this limitation in Supervisor as well. I created an issue for that so it can be addressed: home-assistant/supervisor#3960.

Closing this issue as there is nothing to be done on OS side for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants