Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proof of concept for BTree index #12

Closed
wants to merge 1 commit into from
Closed

proof of concept for BTree index #12

wants to merge 1 commit into from

Conversation

jxieeducation
Copy link
Contributor

This is an extremely simple proof of concept for #9

This allows the backend to adopt a BTree model using the BTrees pip library (sudo pip install BTrees, did not add to the requirements.txt yet).
While attempting to add the BTree, I found a few problems with using a BTree.

  1. querying the whole set: if user tries to store a tuple, it's easy to add both elements to the hash. however, with trees, it's much more complicated.
  2. search: if an object's built in compare function is wrong / raises an error, the BTree can't do anything about it, because it needs to compare the elements.
    It's easy to make a BTree of the hashes of the elements, however, that pretty much defeats the purpose of using a BTree in the first place.

While writing the simple add / remove functions, I assumed that caching is not needed for BTrees, because it offers little adv. as the tree goes down.
I also believe a BTree will also be clunky when elements get large, because it runs in O(n), while the hashmap runs in O(1).

Please let me know what you think?

@adewes
Copy link
Owner

adewes commented Apr 24, 2014

Hey @jxieeducation ,

first of all thanks for your contribution, this is really great! Here are my thoughts:

  1. I had a look at the BTrees documentation and it seems to rely heavily on ZODB and its persistence module, so I'm not entirely sure if it's the right thing to use here. Basically, I think we want a BTree that can be incrementally written and read from disk (to avoid loading it into memory when the database starts up)
  2. BTrees allow lookup (as well as insertion and deletion) operations in O(log n) time (http://en.wikipedia.org/wiki/B-tree), as opposed to O(1) for lookup in hash tables. The main advantage of a BTree is that it can be written and read incrementally, thus without loading its entire content into memory first, and that range queries can be performed efficiently.
  3. Like with the normal file-based index, we need to insert key/value pairs into the BTree to be able to do rich comparisons between values, search operations and range queries. I think the update function of OOBTree allows for storing key/value pairs in the tree.

Summarizing, I think it would be interesting to see if there's a more lightweight BTree library available for Python or if we can even implement a simple BTree structure ourselves. Feel free however to further investigate the OOBTrees package, if you do I would be interested to see how exactly we would insert key/value pairs and persist the tree on disk. So it would be great if you could write a simple demo that defines a fully functioning BTree-based index class and contains some unit tests that we can run to check its correctness.

Good work!

@jxieeducation
Copy link
Contributor Author

Thanks, I totally agree that this is not the library that we are looking for. Will be coming up with a simple BTree then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants