Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of transaction parameter in Database's get and put methods #1

Closed
jnwatson opened this issue Feb 18, 2013 · 10 comments
Closed

Lack of transaction parameter in Database's get and put methods #1

jnwatson opened this issue Feb 18, 2013 · 10 comments

Comments

@jnwatson
Copy link
Owner

I notice that Database.get and Database.put do not take a Transaction parameter. It sure would be nice if there were an optional field to allow a separate Transaction.

In order to use multiple transactions, do you intend the client of py-lmdb to have to close and re-open the Database object? That seems to me the only way to provide a new transaction handle to the underlying mdb_get and mdb_put methods. Or perhaps I don't understand something about mdb (for example I'm not sure why a transaction is passed into mdb_open: so the database creation itself can be rolled back?)

BTW, thanks for your work so far! I'm pleasantly impressed that I've see mdb have comparable performance to regular python dictionaries for my usage.

@dw
Copy link
Collaborator

dw commented Feb 18, 2013

Hi,

The API is slightly broken around that at present. Internally the Database instance captures the Transaction that was used to open() it, so any method calls on the database are actually transactional. The breakage, though, is that at the C level the database handle outlives the transaction that creates it, and repeated calls to the handle open function will return the same handle, which completely doesn't match what we're doing.

I've looked at fixing this, but it requires some design that deviates from the underlying C API slightly which is why I've resisted (my initial use case also did not benefit from multiple transactions within a single Python process). We could however discuss that here.

My idea would be to move Transaction.open() to Environment.open(), internally start its own transaction, which defaults to readonly mode if create=False. The problem is that in MDB, creation of a database outlives the transaction that it was created within, even if that transaction is rolled back. Howard suggested this may be changed in the future, so I'd like to avoid exposing this detail to Python, otherwise it'll lead to user surprise ("I created a database then rolled the txn back, but it's still there!").

The remainder of the Database get()/put()/etc. methods would then be moved to the Transaction object, and an optional db= parameter would be introduced to each of them. This is basically the inverse of your suggestion, but it more closely matches the underlying API.

env = lmdb.connect()
db1 = env.open('my-database')
db2 = env.open('their-database')
txn = env.begin()
txn.put('foo', 'bar', db=db1)
txn.commit()

I still hate having to pass the optional db= argument in, something doesn't feel right about it.

@jnwatson
Copy link
Owner Author

I'm not sure what the right answer is, so I thought I might see what other packages over databases do. The below is just my quick survey.

First, there happens to be a suggested standard python API to python databases:
http://www.python.org/dev/peps/pep-0249/

I don't think this applies to mdb at all; this seems to govern SQL-type databases.

BerkeleyDB is very similar to mdb in its C API.
PyBSDDB uses a txn=None optional parameter in its database get/put methods:
http://www.jcea.es/programacion/pybsddb_doc/db.html#db-methods

ZODB seems to have a single global Transaction that you can continually commit to (without starting again). You can also have subtransactions but I don't see a way for parallel independent transactions. There's also a separate transaction package that implements context manager (e.g. with transaction).

I'll keep looking.

I do like, and think is a common use case, an auto-commit mode that commits every operation.

@dw
Copy link
Collaborator

dw commented Feb 18, 2013

Hi again & thanks for the survey! It's really helpful.

Your mention of PyBSDDB is particularly interesting, as you say, MDB is designed to clone Berkeley's interface, so there's no reason we shouldn't be doing the same in Python. I'm going on holiday for a few days, but will look more closely at PyBSDDB's interface when I return.

DBAPI PEP is definitely a no-go, it doesn't make sense at all for BDB/MDB etc (unless you want to implement an SQL engine on top ;).

@dw
Copy link
Collaborator

dw commented Mar 25, 2013

Hi there,

I have finally gotten around to updating the library. The old Cython binding is replaced with a cffi binding, enabling compatibility with PyPy. I settled on a compromise for the interface:

  • Database get/set/put/delete methods are moved to Transaction.
  • Transaction defaults to the main database.

This means it's possible to work with the main database simply with:

env = lmdb.connect()
with env.begin() as txn:
    txn.put(...)
    for key, value in txn.cursor():
         pass
    # etc

Working with sub-databases can be accomplished by using the db= parameter. I believe this is a fair compromise, since most users will only want a single keyspace.

db = env.open('sub-database')
with env.begin() as txn:
    txn.put(k, v, db=db)

@dw
Copy link
Collaborator

dw commented Mar 25, 2013

Finally, I forgot to mention, making txn= a parameter is pointless since unlike BDB, MDB does not support 'transactionless' operations, and emulating them would encourage bad behavior: users would be encouraged to make many transactions during a bulk insert without knowing the transactions exist, rather than explicitly being forced to think about how their transaction is formed.

@wbolster
Copy link
Contributor

Is your goal to provide a mostly 1-to-1 translation of the C API, or a Pythonic wrapper? The .begin() doesn't really feel Pythonic to me, and neither does the explicit .cursor() for iteration.

Fwiw, in Plyvel (Python bindings to LevelDB; see https://plyvel.readthedocs.org/) I made the WriteBatch (similar to transaction it seems) specific to a database, while technically it is not linked to a database until the batch is written/applied. The end result is a more natural API (in my opinion), at the expense of slightly less flexibility.

@dw
Copy link
Collaborator

dw commented Mar 26, 2013

Hi Wouter,

In LMDB it is impossible to iterate without starting a transaction, and the intent of the iteration dictates what kind of transaction should be created: in particular, a read-write transaction will block all other writers. Since one of the major benefits of LMDB over LevelDB is its support for interprocess concurrency, implicit write transactions would never make sense, and implicit read transactions require synchronization that becomes expensive – to the tune of 19.8 microseconds with my current wrapper (or about 50k requests/sec).

I had thought about providing some kind of 'DatabaseTransaction' object that bound both objects together, but I could not think of a good use case where this would be beneficial in any meaningful way.

With regard to transactions, opening a database potentially requires a write transaction, which means any 'Pythonic' (I detest that term - it is meaningless) interface that tries to blur these lines must deal with upgrading the user's read-only transaction as necessary, and suchlike. There is little room for compromise here that does not result in reduced concurrency, efficiency, or user surprise. The extra 6 characters is more than worth it IMHO.

@dw dw closed this as completed Mar 26, 2013
@dw
Copy link
Collaborator

dw commented Mar 26, 2013

I forgot to add a further point: LMDB's Cursor object inherently binds to a specific database and supports a put operation, although it is not implemented yet. This roughly serves the case of having a 'bound writeable database within a transaction'.

@jnwatson
Copy link
Owner Author

Looks fine to me.

I didn't have any trouble getting your previous Cython binding built for pypy, at least enough for a smoke test. In a single-process lmdb benchmark I wrote, pypy ran 24% slower than regular python -O.

However, a more serious multi-process benchmark I wrote running under pypy manages to get into what looks like deadlock, locking the database for good (even after restarting and accessing with regular python).

I saw no such behavior with regular python.

In other words, I welcome the cffi binding.

Nic

@dw
Copy link
Collaborator

dw commented Apr 8, 2013

Hi Nic,

Just a small ping to note the CPython binding has been completely rewritten in C, and has much better performance for basically every operation (simple tests get 600k random reads/sec). Right now CPython is faster than PyPy/cffi, although I'd like to make cffi competitive again somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants