Skip to content
clee edited this page Sep 12, 2010 · 10 revisions

Why are you doing this?

Now that is a good question, and one I ask myself once in a while. Let’s just say that yum and RPM are very entrenched in the distributions that ship them, and there is no denying that replacing RPM with all its quirks is a daunting task. Even so, I think it is worthwhile. The rpm code is notoriously horrible and unmaintainable, slow (try `rpm -qa` to list all packages and ask yourself if it really should take that long), corrupts its own database, its API is obfuscated, and, oh, it uses Berkeley DB. On top of that we have yum, which uses a different database (sqlite) for representing roughly the same data as RPM, but it’s written in Python and is also quite slow. On my older systems, I can’t run a full yum update and have to resort to manually updating in smaller steps (say, first all packages starting with ‘a’ etc), and even then it often breaks down halfway through installation and leaves me with two versions of every rpm. The fact that we now have tools to clean up broken yum “transactions” is a testament to the failure of this tool chain.

Is razor really replacing rpm?

Yes and no. The RPM package format is pretty solid and extendable and razor doesn’t change that. Also, the RPM spec file format, the build process etc is outside the scope of the razor project. What razor does replace is what you could call the RPM runtime. It’s the database of packages on your system, the installation processes, and the query functionality.

The motivation for extending the scope to also include the RPM runtime, is that a lot of the problems that people often attribute to yum are actually deficiencies in the current (non-razor) RPM runtime. Slow installation, corrupted package databases, inconsistent dependency solving rules between RPM and yum are all issues with the RPM runtime.

Why not use a standard database library for the storage backend?

It really comes down to what you need from the storage backend – what types of modifications and what kinds of queries you expect to run against the database. And it turns out that for maintaining a list of packages with their metadata, you don’t need a lot of complexity. The only queries we need are essentially getting a sorted list of packages and dependencies, but that has to be linear time with minimal overhead. We don’t need complicated locking or dynamic inserting, since a database is immutable. To do a transaction, we just batch up all modifications and build a new database when we’re done.

Another issue is that since the file format is part of the standard, in the same way that the rpm file format is, we need to control the file format to prevent accidental breakage when a database library changes its on-disk format (which seems to happen with every Berkely DB release).

Finally, for simple operations such as listing all packages (useful for tab completion, among other things), nothing beats the speed of just mmap()ing the file and immediately iterating through the list of packages – no setup required.