-
Notifications
You must be signed in to change notification settings - Fork 17
Roadmap
Deadlocks here, deadlocks there. And a lot of bloat.
Postgresql uses a MVCC design. So everytime something in a row changes, it will copy the whole row and changes it. Then mark the old row for cleanup by (auto)vacuum.
Our current design looks like this:
mb=# \d filearr
Table "public.filearr"
Column | Type | Collation | Nullable | Default
---------+------------------------+-----------+----------+-------------------------------------
id | integer | | not null | nextval('filearr_id_seq'::regclass)
path | character varying(512) | | not null |
mirrors | smallint[] | | |
Indexes:
"filearr_pkey" PRIMARY KEY, btree (id)
"filearr_path_key" UNIQUE CONSTRAINT, btree (path)
Referenced by:
TABLE "hash" CONSTRAINT "hash_file_id_fkey" FOREIGN KEY (file_id) REFERENCES filearr(id)
mb=# \d hash
Table "public.hash"
Column | Type | Collation | Nullable | Default
---------------+----------------------+-----------+----------+---------
file_id | integer | | not null |
mtime | integer | | not null |
size | bigint | | not null |
md5 | bytea | | not null |
sha1 | bytea | | not null |
sha256 | bytea | | not null |
sha1piecesize | integer | | not null |
sha1pieces | bytea | | not null |
btih | bytea | | not null |
pgp | text | | not null |
zblocksize | smallint | | not null |
zhashlens | character varying(8) | | |
zsums | bytea | | not null |
Indexes:
"hash_pkey" PRIMARY KEY, btree (file_id)
Foreign-key constraints:
"hash_file_id_fkey" FOREIGN KEY (file_id) REFERENCES filearr(id)
mb=#
Everytime a value is added to the mirrors column it copies the whole line (+ index data I guess). On top of that when ever we scan files for a mirror we create a temporary table with all the files on this mirror. This can be limited to just the files in the subdirectory, that we scan. Once the scan is done it needs to merge back its changes into the filearr table. For this it tries to aquire a full table lock. With a sufficiently high number of scans running in parallel we will see a lot of "waiting for lock" for the scans. While the table is fully locked autovacuum is blocked as well. That's the main reason why we see a lot of bloat warnings for this table in monitoring.
It would be nice to find a way to make scanning lock free. One possible option might be:
Move the path
column from filearr to the hash
table and rename that table to files
.
Then add a 2nd table which just has the file_id
and the server_id
.
And maybe a flag to mark the tuple for deleting during scan.
The new files
table would only be written once or twice depending on how we seed the entries for new files. (see below on event based processing)
Otherwise it would be static.
Working with that int/int mapping table should be doable 100% lock free or at most with rowlocks.
We packaged hypopg to test out which indeces we would need.
... or we should really work event based.
Right now we have cron jobs that run over our full 16TB table. Those cronjobs take ages and cause a lot of IO work. This leads to some fun consequences:
- There can be a really long delay between "new files on mirrors" and "we finally have those hashes and all that in the DB"
- Even though we have have some optimizations "only calculate checksums for files with a changed mtime". This still checks a lot of files without need.
- We still write the hash files to disk and to the DB.
But why?
We already have an event based system in place for doing things after publishing. (Hello repopusher! we will get to you in more depth in a bit) Maybe we can just hook up the whole "scan this subdirectory and do the initial DB entries" as kind of a repopusher style job.
We can also try to reuse the repodata files. they already contain the file list and checksums.
For this we would need to adapt the whole repopusher framework maybe. Rudi has a cleaned up and more feature rich version of it internally which we could use instead of the current code. But maybe we could make the whole system more generalized.
The idea would be to have a system like Rq(python)/sidekiq(ruby) (they are compatible to each other). We take the initial job and multiplex it into different jobs based on a config for the subdirectory.
This whole concept could then also be integrated in some kind of 2 stage publishing:
- first stage: publishing everything but the repomd.xml
- once we have the repository pushed to N mirrors
- tell backend to publish the repomd.xml file
If we dont want to have the state handling on the backend, we can have the backend publish into a staging area and the job queue is reponsible to file the production tree from there
- schedule job for initial DB entries (saves us from having to use UPSERT)
- schedule job sync first stage to the live tree (hardlink sync)
- schedule job for hashing of the subdirectory (death to all cronjobs!)
- Schedule job in a per mirror queue for the repopusher.
- if we have enough mirrors seeded, publish changed repomd.xml
If this concept is not only applied to repositories
but all dynamic trees, we would probably see a much improved user experience.
What happens right now is:
- updates get published
- packagekit and friends see new update and start downloading
- no mirrors yet so the fallback mirror will be hammered.
- first mirror comes in and gets hammered now (usually widehat)
- loads starts to even out as more mirrors pick up the files and are seen by the scanner.
Right now we still perform protocol downgrades (HTTPS -> HTTP). which is bad for security reasons.
We already have a lot of metadata in our DB, what is missing that mod_autoindex_mb could work completely without a file tree? Currently there is this 0 byte files trick, but this seems ugly. If we make the fallback mirror handling in the apache config nicer, we could make mod_autoindex_mb work entirely from the DB. That way we could e.g. deploy multiple redirector nodes in multiple data centers without each node requiring 16+TB of disk space. With the current DB sizes we are talking about roughly 100-150GB (with space for growth) for a mirrorbrain instance. each node would have a replica of the main postgresql DB.
Right now we have 3 parts accessing the DB. "mb", "scanner.pl", "mod_mirrorbrain". At least the first 2 we could unify to one language. Maybe we could even move the whole "mod_mirrorbrain" into a small app server and then have one ORM for all 3. Longterm it doesnt even have to be python anymore.
Unless ...
This is what Fedora is using. What I have learned about it so far.
- They lack HTTPS/IPv6 tracking and some other nice features we already have
- Their DB schema is slightly more complex than ours
- Their mirrorring works different than ours.
dnf asks the mirror master only for the repomd.xml file. the server always replies with a metalink file and the package manager then remembers that file and uses the mirrors directly. This also means their DB only tracks repomd.xml files. The metalink file has the checksum of the repomd.xml file so dnf can verify it got an up2date copy. The rest of the trust then relies on signed RPMs and maybe signed repomd.xml (need to verify).
Even if we wanted to switch to their model we will probably have to still support the old model.
As a potential mirrorpinky replacement. something we never finished or deployed. It would give our mirror admins self management capabilities.
Normally mirrorbrain requires a matching direct layout underneath the base dir of a mirror. Rudi asked for support of serving a subtree of main mirror on a different vhost. One way might be able to do this:
ServerName updatesonly
MirrorBrainStripPrefix /updates
When querying the database and redirecting we would need to prepend the prefix again.