Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database size spirals exponentially #4

Open
jonasob opened this issue Oct 31, 2014 · 16 comments
Open

Database size spirals exponentially #4

jonasob opened this issue Oct 31, 2014 · 16 comments
Assignees

Comments

@jonasob
Copy link
Member

jonasob commented Oct 31, 2014

The database size grows exponentially when using HashDB on hashed images. Some improvements can be gotten by changing the order in which records are inserted, but for larger data sets this doesn't matter. Completely random hashes result in smaller db sizes than real world hashed image data.

Changing from a HashDB to a TreeDB in Kyoto (branch kyoto-tree) makes the db size significantly smaller, and grows linearly against new inserts, but at the cost of complexity (O(log N) instead of O(1)).

@petli
Copy link
Member

petli commented Oct 31, 2014

How many hashes have you insertd, and what number of hashes did you set up
the databases for? (third argument to hm_init)
On 31 Oct 2014 11:51, "Jonas Öberg" notifications@github.com wrote:

Assigned #4 #4 to
@petli https://github.com/petli.


Reply to this email directly or view it on GitHub
#4 (comment).

@jonasob
Copy link
Member Author

jonasob commented Oct 31, 2014

We're working with a set of 1M hashes. The init is done accordingly, so
with 1000000. I also tried 1 :) And 100000000 :)
With the hashes returned from gen_hashes.py, the database size grows more
reasonably at a close to linear rate, but with real world data, the HashDB
version grows exponentially.

On 31 October 2014 18:58, Peter Liljenberg notifications@github.com wrote:

How many hashes have you insertd, and what number of hashes did you set up
the databases for? (third argument to hm_init)
On 31 Oct 2014 11:51, "Jonas Öberg" notifications@github.com wrote:

Assigned #4 #4 to
@petli https://github.com/petli.


Reply to this email directly or view it on GitHub
#4 (comment).


Reply to this email directly or view it on GitHub
#4 (comment)
.

@jonasob
Copy link
Member Author

jonasob commented Oct 31, 2014

@artfwo reports TreeDB is in fact faster, quite the opposite of what one would've expected (possibly due to reduced I/O with smaller disk footprint?)

@jonasob
Copy link
Member Author

jonasob commented Oct 31, 2014

It's tempting to just go with TreeDB and replace this also in the catalog. Appreciate any thoughts @petli ! To me and @artfwo it seems like a simple enough switch that moves us forward. We can leave this issue open to look more into this issue when there's time.

@petli
Copy link
Member

petli commented Oct 31, 2014

Ah. Of course we'll have more collisions with real hashes than random
ones...

The database can be compacted with khashmgr. There's probably lots of holes
in it due to the append operation on collisions.
On 31 Oct 2014 12:05, "Jonas Öberg" notifications@github.com wrote:

We're working with a set of 1M hashes. The init is done accordingly, so
with 1000000. I also tried 1 :) And 100000000 :)
With the hashes returned from gen_hashes.py, the database size grows more
reasonably at a close to linear rate, but with real world data, the HashDB
version grows exponentially.

On 31 October 2014 18:58, Peter Liljenberg notifications@github.com
wrote:

How many hashes have you insertd, and what number of hashes did you set
up
the databases for? (third argument to hm_init)
On 31 Oct 2014 11:51, "Jonas Öberg" notifications@github.com wrote:

Assigned #4 #4
to
@petli https://github.com/petli.


Reply to this email directly or view it on GitHub
#4 (comment).


Reply to this email directly or view it on GitHub
<
https://github.com/commonsmachinery/hmsearch/issues/4#issuecomment-61245322>

.


Reply to this email directly or view it on GitHub
#4 (comment)
.

@petli
Copy link
Member

petli commented Oct 31, 2014

I agree that use TreeDB for now if that actually works. Possibly with 1kB or 2kB alignment.

The code actually used TreeDB originally, but that had more fragmentation than HashDB with random hashes... I think another reason was also to avoid the overhead of the balanced tree, but while that makes sense for well-distributed keys, for a realistic distribution it probably isn't much of a win.

The way this works is that each partition value is a key, and the value is a list of hashes that match that key. On random hashes the probability of collisions is very low, so you'd typically have a single hash in the list. On realistic hashes you have some patterns that dominate. On those, you end up with a lot of hashes with the same partition values, so these lists get very long.

The file size then grows badly since each addition means a large list must be moved, and it's likely that there aren't a big enough hole somewhere. This can be improved by setting the alignment size when tuning the database to e.g 2kB, meaning that blocks doesn't have to moved as often, and it's more likely to find an existing hole for them. The default for HashDB is 8 bytes, while it is 256 bytes for TreeDB. This might explain the improved behaviour.

The other mistuning in the current code is that it lets each hash bucket contain a linear list of keys instead of a BTree. This was for the malplaced reason to have a smaller database, but of course it means that lookup times are much worse when you have a lot of collisions.

See "Tuning the file hash/tree database" at http://fallabs.com/kyotocabinet/spex.html for details.

A bigger improvement would be to add multiple levels of data, much like the classic Unix filesystems work. I.e. you start by adding data items in the first level, until you have filled it it. Then you instead create a new data level, and just link to that from the first level. This one holds links to data levels with further keys. And iterate. By setting the alignment to the same size as each data area you can completely avoid fragmentation as more and more hashes are added.

@jonasob
Copy link
Member Author

jonasob commented Nov 2, 2014

Question posed also to the tokyocabinet-users group on https://groups.google.com/forum/#!forum/tokyocabinet-users

@jonasob
Copy link
Member Author

jonasob commented Nov 2, 2014

This may or may not be relevant, some good results using Redis to store 300M records in 5GB (uint32 though I guess). http://instagram-engineering.tumblr.com/post/12202313862/storing-hundreds-of-millions-of-simple-key-value-pairs

@petli
Copy link
Member

petli commented Nov 2, 2014

Redis keeps all data in memory, so not very suitable here.
On 2 Nov 2014 08:33, "Jonas Öberg" notifications@github.com wrote:

This may or may not be relevant, some good results using Redis to store
300M records in 5GB (uint32 though I guess).
http://instagram-engineering.tumblr.com/post/12202313862/storing-hundreds-of-millions-of-simple-key-value-pairs


Reply to this email directly or view it on GitHub
#4 (comment)
.

@jonasob
Copy link
Member Author

jonasob commented Nov 2, 2014

I've done some experimentation using LevelDB to replace the Kyoto Cabinet. The results so far are quite promising. Code branch at https://github.com/commonsmachinery/hmsearch/tree/leveldb and pretty graphs at https://docs.google.com/spreadsheets/d/12SMJ2s6rKdykwHkztzJ4b1w3mcsJ0d2TkN5SIi00Kk4/edit#gid=0

I've tried this with the 5.5M hashes we have in our current database, and it worked like a charm, with a resulting database size of ~900 MB. I did not do any extensive tests for lookup speed, but running a lookup on a 5.5M data set takes ~35-45 ms on my laptop.

@petli
Copy link
Member

petli commented Nov 2, 2014

See comment here on the leveldb code: 12aee02#commitcomment-8401852

LevelDB might still handle fragmentation much better than kyoto cabinet, so it's worth testing how it behaves when appending hashes.

Otherwise, to solve this comfortably we need a database backend which can hold multiple values for a single key in an index efficiently. Could be a relational database, except that we don't really have need for tables but only just the indices.

Redis have a very nice list value type that should handle this well performance-wise, but since it requires holding all data in memory (its whole raison d'être) that's most likely not feasible.

@petli
Copy link
Member

petli commented Nov 3, 2014

One reason originally going for a simple file database was to avoid adding infrastructure, but that is only true as long as we don't do online updates - then we'd anyway need something for all index nodes to access. It's starting to sound like a full database should be used, after all.

Summary of the rambling below: Switching to using postgres with char(N) fields is probably the easiest change, and the more performant. A MongoDB branch should rework the partition-processing code to use 64-bit integers.

In a relational database the data model for the hash DB would be flipped to look like this:

create table hashes (hash, partition1 index, partition2 index, ... partitionN index).

This would be created by hm_init. Insert is straight-forward. Lookup would still have to be done partition-by-partition to stick to the hmsearch algorithm. It could do a single lookup for each partition though, generating all the 1-variations and then just querying with:

select hash, partitionN from hashes where partitionN in (exactMatch, first1var, second1var..., last1var)

And then sort out if an exact match or not was found by comparing the returned partitionN values.

Not sure what the best datatype for this would be. It might be a plain char(N), since we know the exact partition lengths.

MongoDB could store this too and do find...$in queries, but a strongly-typed database like postgres should be a fair bit more efficient.

The first stab at implementing this did use mongodb, but didn't use the $in query. It was thus very inefficient, and abandoned. There was also quite large overheads for the indices needed, since everything was stored as variable-length strings. This could be improved if we assume that partitions always fit into 64-bit integers (true for 256 bit keys with max_error > 5) and save them as numbers instead of strings: http://api.mongodb.org/perl/0.42/MongoDB/DataTypes.html#Numbers

@jonasob
Copy link
Member Author

jonasob commented Nov 3, 2014

I think Postgres makes sense; this issue then turns more into a longer term issue of supporting a Postgres backend.

@jonasob
Copy link
Member Author

jonasob commented Nov 3, 2014

Before I do anything stupid, what's the problem of having a table like so:

create table hashes (hash, partition int, hashkey)

Adding a hash would then generate N inserts, but we'd only have a single select on lookup that pulls information from all partitions. We'd possibly lose out on size, since it'll store the hash N times, and we'd have N times more rows in the table, which could impact performance.

@jonasob jonasob assigned jonasob and unassigned petli Nov 3, 2014
@petli
Copy link
Member

petli commented Nov 3, 2014

The SQL query would be a bit messy, with lots of IN and AND and OR. With some GROUP BY you might be able to implement the entire hmsearch algorithm in a single query!

I don't think we need to be scared of doing multiple roundtrips. When a single user queries it is anyway a quite long process from starting hashing out in the browser until getting a response with annotations, so there's not that much (relative) scope to shorten the time here. Requests from multiple users could parallelise nicely too (at least if there are multiple database connections), so we don't have to get a very fast response to an individual user to keep up with load from many users.

@petli
Copy link
Member

petli commented Nov 3, 2014

Of course, one possibility with the single-column table is to just generate the 256+6 (for max error 10) partition values, and throw all of them into a single

select hash, partition, hashkey from hashes where hashkey in (...)

and sort out the result when getting the response. Brutal, but might be efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants