-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce AGNOS system image size to <1GB #225
Comments
I don't know that I ever seen quite such a sprawling mess, and I've been doin embedded for a long time. SELinux PAM, source code all in a image ????? |
@Doncuppjr Thanks for the feedback! And I'm not being sarcastic. Myself I had close to zero experience with embedded, when I contributed. If you're interested, you can take this bounty or even outline a plan to clean the "mess". The plan can then be broken into bounty-sized tasks. Also comma.ai is hiring a Systems Engineer. Your embedded experience could be valuable - feel free to check out the opportunities at comma.ai/jobs. The easiest way to get a job is doing one of the bounties. |
I am on the hunt for a new gig, but this is a big thing. Your kernel compile is fast and looks fairly optimized, but I wonder why you still have selinux enabled. 2+ hours to make an image is really hamstringing this project. Iteration is speed. The previous suggestion about using |
QEMU? Jeeez, there is like an entire development environment in this thing. |
@Doncuppjr Thanks for your feedback. There are definitely things that can be trimmed, that's what this issue is for. That said, on-device development is very important to the project, and so is commonality with Ubuntu environments. Iteration speed is good, but to the extent tradeoffs are made, they're made to favor speed for openpilot over AGNOS, which doesn't need to change very often. There's definitely ground to gain, things that should be stripped out (no idea why QEMU is on there) but we'll never run on busybox (just as an example). Since you find yourself on the hunt for a new gig anyway, and you clearly have some experience in the subject, specific actionable PRs would be welcomed. |
Interesting that you mentioned busybox. If speed is your goal, I wouldn't count that one out. It can very much be tuned for speed especially when linked against glibc vs muscl. Still it is often necessary to have full size utilities for some actions and certainly busybox is not the only way to do embedded, but it is very convenient. What it really looks like happened here, is that someone took the package set for a development workstation and simply pushed it out as the Comma OS. There are all kinds of desktop stuff in there from window managers to qemu, includes for compiling and even man pages. You've got semanage on there, but your missing the other parts of selinux, and of course the package manager and it's db's and caches. A better approach is something dependency driven. You add what you know you need, and then support it with it's dependencies(not necessarily what a package manager says are dependencies). It's a different way of looking at OS image building, but the result is something much smaller and faster. It's very possible to use Ubuntu binaries, and a lot less effort to use precompiled stuff vs compiling it yourself. |
Some of that is required or beneficial, some it can go (qemu), some of it doesn't make sense. Are you actually looking at an AGNOS device?
This was actually done, back in the day, although it's been a few years and could benefit from another look. This is another matter where PRs with specific, measurable, tested gains would be welcome. |
Yes, I have a Comma 3 that went black display a few days ago, so it's had me hoping around the ecosystem trying to figure out if it's just the display and/or the logic board. Actually downloaded the AGNOS builder and made an image. 2 hours is not good. I am running Linux in a VM, but I gave it 16 cores and 32Gb or RAM. An optimized approach should have popped it out in like 10-15 minutes without caches. Cached should be like 1-2min max. |
You could also generate multiple images, which is common. Have a dev image with lots of utilities, and a prod version that is just what's needed to perform the necessary functions that an end user needs. |
I hope you get back up and running soon!
Faster build times are always nice, but the current build times in CI are much more reasonable. The task in this issue is to shrink the image, not necessarily speed up building, though a smaller image is likely to build faster. The root problem being solved is to deliver updates to end users quickly and cheaply, over extreme-range Wi-Fi and metered LTE. The path we take there is at the discretion of whoever steps up and does the work. A fresh build-up rather than a pick-apart is a perfectly valid direction. |
Tasks to optimize the OS
|
Those are the broad strokes. In all likelihood, there are services being started that will muddy the waters about what is actually needed. As well, there is likely files that don't get touched for person in one locale that will be needed by a person in another locale. |
I think it's also worth noting that using overlayfs with squashfs is not really required. overlay allows the filesystem to appear as rw. Convenient, but not required. If you know the locations where things are actually going to be written, then you don't have to make the whole filesystem rw, you can leave it ro and just mount tmpfs at the write locations. |
If that's the goal, I'd focus on differential updates (aka deltas) instead of system image size for two reasons:
|
Reopening since SquashFS didn't work out |
True, but I inadvertently left out another requirement: reasonable download and transfer times for a complete device reimage over USB. For better or worse, the maintainers of various openpilot forks have been known to install or change things that require a full wipe to get back to stock condition. |
I see. For that problem, I'm curious why agnos doesn't simply prevent application-level code from borking its recovery partition (or the system partition for that matter). Not only would that improve the USB-recovery experience (zero download), but even better it would mean far fewer users ever even need USB-recovery to begin with. Move users to the happy path > improve the unhappy path. I don't know everything about agnos, but there's no shortage of ways to accomplish this on linux in general. |
Disclaimer: I don't speak for comma. There is already mild protection insofar as the relevant partitions are mounted readonly. Remounting is trivial, but it does take explicit intent. There are many other measures that could be taken, using inherently fixed images like SquashFS with an optional overlay, or even revoking root access and using a signed trusted boot chain like Android, but at high cost and limited benefit. For complex reasons, the ability to change the system image in-situ is probably 50::50 bug::feature. The engineering resources needed to change that don't directly advance the task of solving self-driving cars, and we need reliable flash-to-stock in the field anyway. |
Convenience things, like
valgrind
, should be the last to go. First up is anything we're just wasting space on, like package caches or AGNOS build-time dependencies.If we have to start cutting these things, I think we'll want to setup an "AGNOS dev" package that will install all those extras together.
scripts/check_space.sh
is a nice tool for debugging this and will allow you to poke around like this:The text was updated successfully, but these errors were encountered: