-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Database size spirals exponentially #4
Comments
How many hashes have you insertd, and what number of hashes did you set up
|
We're working with a set of 1M hashes. The init is done accordingly, so On 31 October 2014 18:58, Peter Liljenberg notifications@github.com wrote:
|
@artfwo reports TreeDB is in fact faster, quite the opposite of what one would've expected (possibly due to reduced I/O with smaller disk footprint?) |
Ah. Of course we'll have more collisions with real hashes than random The database can be compacted with khashmgr. There's probably lots of holes
|
I agree that use TreeDB for now if that actually works. Possibly with 1kB or 2kB alignment. The code actually used TreeDB originally, but that had more fragmentation than HashDB with random hashes... I think another reason was also to avoid the overhead of the balanced tree, but while that makes sense for well-distributed keys, for a realistic distribution it probably isn't much of a win. The way this works is that each partition value is a key, and the value is a list of hashes that match that key. On random hashes the probability of collisions is very low, so you'd typically have a single hash in the list. On realistic hashes you have some patterns that dominate. On those, you end up with a lot of hashes with the same partition values, so these lists get very long. The file size then grows badly since each addition means a large list must be moved, and it's likely that there aren't a big enough hole somewhere. This can be improved by setting the alignment size when tuning the database to e.g 2kB, meaning that blocks doesn't have to moved as often, and it's more likely to find an existing hole for them. The default for HashDB is 8 bytes, while it is 256 bytes for TreeDB. This might explain the improved behaviour. The other mistuning in the current code is that it lets each hash bucket contain a linear list of keys instead of a BTree. This was for the malplaced reason to have a smaller database, but of course it means that lookup times are much worse when you have a lot of collisions. See "Tuning the file hash/tree database" at http://fallabs.com/kyotocabinet/spex.html for details. A bigger improvement would be to add multiple levels of data, much like the classic Unix filesystems work. I.e. you start by adding data items in the first level, until you have filled it it. Then you instead create a new data level, and just link to that from the first level. This one holds links to data levels with further keys. And iterate. By setting the alignment to the same size as each data area you can completely avoid fragmentation as more and more hashes are added. |
Question posed also to the tokyocabinet-users group on https://groups.google.com/forum/#!forum/tokyocabinet-users |
This may or may not be relevant, some good results using Redis to store 300M records in 5GB (uint32 though I guess). http://instagram-engineering.tumblr.com/post/12202313862/storing-hundreds-of-millions-of-simple-key-value-pairs |
Redis keeps all data in memory, so not very suitable here.
|
I've done some experimentation using LevelDB to replace the Kyoto Cabinet. The results so far are quite promising. Code branch at https://github.com/commonsmachinery/hmsearch/tree/leveldb and pretty graphs at https://docs.google.com/spreadsheets/d/12SMJ2s6rKdykwHkztzJ4b1w3mcsJ0d2TkN5SIi00Kk4/edit#gid=0 I've tried this with the 5.5M hashes we have in our current database, and it worked like a charm, with a resulting database size of ~900 MB. I did not do any extensive tests for lookup speed, but running a lookup on a 5.5M data set takes ~35-45 ms on my laptop. |
See comment here on the leveldb code: 12aee02#commitcomment-8401852 LevelDB might still handle fragmentation much better than kyoto cabinet, so it's worth testing how it behaves when appending hashes. Otherwise, to solve this comfortably we need a database backend which can hold multiple values for a single key in an index efficiently. Could be a relational database, except that we don't really have need for tables but only just the indices. Redis have a very nice list value type that should handle this well performance-wise, but since it requires holding all data in memory (its whole raison d'être) that's most likely not feasible. |
One reason originally going for a simple file database was to avoid adding infrastructure, but that is only true as long as we don't do online updates - then we'd anyway need something for all index nodes to access. It's starting to sound like a full database should be used, after all. Summary of the rambling below: Switching to using postgres with char(N) fields is probably the easiest change, and the more performant. A MongoDB branch should rework the partition-processing code to use 64-bit integers. In a relational database the data model for the hash DB would be flipped to look like this:
This would be created by hm_init. Insert is straight-forward. Lookup would still have to be done partition-by-partition to stick to the hmsearch algorithm. It could do a single lookup for each partition though, generating all the 1-variations and then just querying with:
And then sort out if an exact match or not was found by comparing the returned partitionN values. Not sure what the best datatype for this would be. It might be a plain char(N), since we know the exact partition lengths. MongoDB could store this too and do The first stab at implementing this did use mongodb, but didn't use the |
I think Postgres makes sense; this issue then turns more into a longer term issue of supporting a Postgres backend. |
Before I do anything stupid, what's the problem of having a table like so: create table hashes (hash, partition int, hashkey) Adding a hash would then generate N inserts, but we'd only have a single select on lookup that pulls information from all partitions. We'd possibly lose out on size, since it'll store the hash N times, and we'd have N times more rows in the table, which could impact performance. |
The SQL query would be a bit messy, with lots of IN and AND and OR. With some GROUP BY you might be able to implement the entire hmsearch algorithm in a single query! I don't think we need to be scared of doing multiple roundtrips. When a single user queries it is anyway a quite long process from starting hashing out in the browser until getting a response with annotations, so there's not that much (relative) scope to shorten the time here. Requests from multiple users could parallelise nicely too (at least if there are multiple database connections), so we don't have to get a very fast response to an individual user to keep up with load from many users. |
Of course, one possibility with the single-column table is to just generate the 256+6 (for max error 10) partition values, and throw all of them into a single
and sort out the result when getting the response. Brutal, but might be efficient. |
The database size grows exponentially when using HashDB on hashed images. Some improvements can be gotten by changing the order in which records are inserted, but for larger data sets this doesn't matter. Completely random hashes result in smaller db sizes than real world hashed image data.
Changing from a HashDB to a TreeDB in Kyoto (branch kyoto-tree) makes the db size significantly smaller, and grows linearly against new inserts, but at the cost of complexity (O(log N) instead of O(1)).
The text was updated successfully, but these errors were encountered: