Skip to content
This repository has been archived by the owner on Oct 27, 2019. It is now read-only.

DB schema tracking bug #44

Open
eksopl opened this issue Apr 30, 2012 · 7 comments
Open

DB schema tracking bug #44

eksopl opened this issue Apr 30, 2012 · 7 comments

Comments

@eksopl
Copy link
Owner

eksopl commented Apr 30, 2012

Tagging @woxxy, @oohnoitz.

I believe this is what we decided to support, at least for now.

  • What indexes can I drop for MySQL+Sphinx and MySQL+FT? There's a lot of unnecessary indexes there for MySQL+Sphinx, but I don't know what you guys need on your end. Same for MySQL+FT, but that one actually uses most of the indexes. I want to separate indexes that are only needed for MySQL+FT from the ones that are needed by both.

  • I believe we agreed that we would drop support for MyISAM on the main table. Fuuka, Asagi and FoolFuuka all need changes in order to support the MySQL InnoDB+MyISAM FT search scheme, for non-Sphinx environments. You guys good with that?

    MySQL+FT schema. This table would not be created for MySQL+Sphinx environments. I'm guessing:
    a_search: doc_id | title | comment

    Email, username, tripcode and can all be handled by MySQL exact matches with regular MySQL btree indexes. I'm offering filename search on Fuuka, and I can't decide if I should go for exact matches or fulltext matches on that. If the latter, a_search on MySQL+FT would need to also have the media field there. What do you guys think?

  • Field names for thumb / images are driving me up the wall. These names are all wrong, they've always been wrong and I keep making mistakes in the code because the names are so incoherent with each other. I'm not entirely sold on media / preview / orig_filename. With the image deduplication scheme, preview should actually be preview_orig and orig_filename kinda makes sense as media_orig. (media should matchpreview. The fact that media_filename is the field that actually matches with preview and media is the media filename has always been a great source of error for me). Then on the images table, media_filename should be media and preview_op and preview_reply are just dandy. I'm sorry for keeping on changing the names on you guys, but this really is the last one that makes sense, don't you think?

    media -> name of full image on 4chan
    preview -> name of thumb on 4chan
    preview_op -> name of OP thumb on 4chan that got saved locally
    preview_reply -> name of reply thumb on 4chan that got saved locally
    media_filename -> filename of image in the hard drive of the user for that post (the filename that shows up in the post, ie: the ONE FIELD that is deserving of having filename in its name)

    So in the main table, we'd have: media_orig | preview_orig | media_filename
    And in the _images table: media_id | media_hash | media | preview_op | preview_reply

    This makes a SHITLOAD more sense.

    I know this is confusing because I'm switching the names of media with media_filename, but the thing is, THOSE NAMES HAVE ALWAYS BEEN SWAPPED, and I keep getting so fucking confused because of it. I really, really want to settle this once and for all, and since we're making DB changes, hey, perfect moment.

    Do you guys want to kill me for this? You can link to this page the next time you have to go down for DB changes and a thread inevitably pops up on /jp/. Add that "Eksopl is the one pushing changes after changes of time consuming database because he wants to give the impression that foolz is always" to fuel the conspiracies, if you want.

  • What name did we settle on for 2ch-like IDs? poster_hash? I believe we were avoiding _id so it doesn't get mixed up with actual ids we use in the database.

  • On that note, we were going to rename the id field to something, right? What was it? poster_ip?

  • We agreed that, at least for now, we're going to make it so the auxiliary tables can be fully recreated from the main table, so fields like height/width that could be moved to the _images table aren't going to be moved at this time (in case of a botched up migration, an archive owner can just drop aux tables and run scripts to recreate them). We're adding poster_hash, I can't say I care about locked, because it's only ever paired with stickies, EXIF can wait because no one is archiving /p/ and deleted media info, while interesting, can also wait until a later date, when not so much is happening at once. Ttimestamp of thread expiration sounds neat (I lifted that idea after looking at the schema of nih). However, I don't want to litter the main table with a field that's only going to be used when parent = 0. Should we wait on that until a later date so we can put it on the _threads table?

  • You can pull request the migration script you guys made once you're ready to roll. As I posted in roadmap, I won't merge until htmlnew is rolled out, though. We also need scripts to drop indexes on Sphinx and create the extra table on MyISAM FT, but I can do those.

Anything else?

@oohnoitz
Copy link
Contributor

This is all of the input I have on the matter. You should wait for @woxxy or anyone else to comment on it as well. Also, it might be a good idea to list a few drop-in replacements for MySQL.

  • These are the indexes required for the entire inserting/updating/deleting/generating processes of both posts and auxiliary tables. It also includes the changes made in Asagi to prevent duplicate media. However, we should run a few more tests to determine if we can remove any additional fields that won't affect the insert time. A few of these indexes are listed because of the changes done in r142 to generate our auxiliary tables. These can be dropped at a later time and will be marked with an asterisk.

    MySQL = doc_id_index (PRIMARY), num_subnum_index, num_index, subnum_index, parent_index, timestamp_index, trip_index*, email_index*, media_id_index, media_hash_index*
    MySQL+FT = name_index, fullname_index, media_filename_index (optional)
    MySQL+Sphinx = no additional indexes required

    The media_filename_index would only be there in case you wish to support redirecting 404 image links to the archive. It isn't really required at all.

  • I do think that was the general consensus due to row-level-locking, crash recovery and a few other features. It was also due to the fact that table-level-locking caused some issues with slow queries hanging fuuka/foolfuuka. Although, this was mainly due to the reports daemon and media_hash searches. (We fixed the media_hash issue with the use of media_id though.)

    Anyway, I would include the media field in the MySQL+FT database schema. This will help those that actually know the entire filename or partially know the filename. You might as well provide support for the entire user-base.

  • The entire media field names have been confusing for all of us. It really needs to be made more consistent and clear so everyone knows what each field holds. Anyway, I would have done the same changes as you have listed.

  • For all poster related information, I would go with poster_ip and poster_hash as you have stated. It really is best that we avoid naming any field with the _id suffix or any type of combination unless it references another table.

  • Yes, the ability to regenerate all auxiliary tables is nice. However, we should also be thinking of removing the duplicate data as much as possible to minimize the size of our databases. Until we have a completely stable database schema that won't change for a long time, removal of duplicate data can wait. If anything, we should also create a script that could regenerate the main table from the auxiliary tables as well.

@eksopl
Copy link
Owner Author

eksopl commented May 1, 2012

I just remembered another thing.

I hate, hate, HATE the fact that the one-to-many association between board_threads and board is broken, because parent in board can be 0. Ideally, parent would change to thread_num or something in both tables (or the name can keep being parent, whatevs), and then we'd have another field for op. This would allow joining board_threads with board in a much better way than the current board.num = board_threads.parent or board.parent = board_threads.parent join condition. MySQL does stupid things when you ask it to make this join.

However, we use parent=0 everyfuckingwhere, including Sphinx index definitions. How do you guys want to solve this?

@woxxy
Copy link

woxxy commented May 1, 2012

Can do, let's also have superfun changing all ->parent to ->thread_num.

The rest is what we agreed on plus a few columns' name changed. The issue is that we'll want to change the names of the indexes to match the new ones, which I am already doing in the migration. Foolz on Percona now so that doesn't need an index rebuilding anymore (yes, I switched to Percona because of this!).

@woxxy
Copy link

woxxy commented May 9, 2012

Onto fulltext search, the _search table:

CREATE TABLE IF NOT EXISTS `%%BOARD%%_search` (
    doc_id int unsigned NOT NULL auto_increment, 
    num int unsigned NOT NULL, 
    subnum int unsigned NOT NULL, 
    thread_num int unsigned NOT NULL DEFAULT '0', 
    media_filename text, 
    comment text,
    PRIMARY KEY (doc_id),

    INDEX num_index (num),
    INDEX subnum_index (subnum),
    INDEX thread_num_index (thread_num),
    FULLTEXT media_filename_fulltext(media_filename),
    FULLTEXT comment_fulltext(comment)
) engine=MyISAM CHARSET=%%CHARSET%%;

If this will use triggers, please let's add a MySQL session variable so we can turn it off. Probably it would be necessary anyway since not everyone will be using the _search table.

@oohnoitz
Copy link
Contributor

Since I was running mytop and innotop for the past few days, it seems that MySQL has some issues with long sorting time due to filesort. MySQL uses leftmost index prefixes and should be abused. Anyway, take a look at the new index schema below and let me know what you think since I'm not a DBA.

PRIMARY* (doc_id)
thread_num_subnum_index (thread_num, num, subnum)
num_subnum_index* (num, subnum)
subnum_index (subnum)
op_index (op)
media_id_index (media_id)
media_hash_index (media_hash)
media_orig_index (media_orig)
name_trip_index (name, trip)
trip_index (trip)
email_index (email)
poster_ip_index (poster_ip)
timestamp_index (timestamp)
timestamp_expired_index (timestamp_expired)

* = UNIQUE

@eksopl
Copy link
Owner Author

eksopl commented May 18, 2012

Thanks for this, index disorganization has been bothering me too.

I'm guessing trip and email can be dropped for people running Sphinx, correct? Also, what do you need the index in expired for, by the way?

@oohnoitz
Copy link
Contributor

Yea, it is a bit annoying when I look at the indexes. I thought that having index_ prefixed would be better instead of as a suffix to make it a bit more readable, since you can ignore the prefix on all index instead. It is just a preference thing, but we should try to remain consistent.

The name, trip and email indexes would be useful for regenerating the auxiliary tables. However, it could be dropped for Sphinx and they don't plan on regenerating those tables. Also, not sure what we need the timestamp_expired_index for, @woxxy had it in the indexes when I was looking at them so I included it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants