Skip to content
This repository has been archived by the owner on Feb 8, 2023. It is now read-only.

idea: support for transactional groups of writes #106

Closed
ianopolous opened this issue Feb 1, 2016 · 41 comments
Closed

idea: support for transactional groups of writes #106

ianopolous opened this issue Feb 1, 2016 · 41 comments

Comments

@ianopolous
Copy link
Member

I had a brief discussion with @whyrusleeping and wanted to start the ball rolling on a public discussion here.

Motivation:
The ability to implement an ACID database or other transactional system in a highly concurrent environment. In my use case with Peergos I am imagining many users concurrently writing to the same IPFS daemon. This means that the GC will essentially need to be called at random relative to each thread. Each user will be building up trees of objects using object.patch, before eventually committing (pinning).

Implementation:
What I had in mind was an optional parameter to all IPFS write commands which was a transaction ID (and maybe a timeout as well). Then all writes are tagged with this temporary ID within a transaction. IPFS just keeps a map from transaction ID to a set of hashes, and these form new object roots during GC. Then when a transaction is complete you could call a completeTransaction(tid) which dumps the mapping (or let it time out). This gives you guarantees that an object you've just written won't be GC'd regardless of the GC activity before you are finished with it.

@jbenet
Copy link
Member

jbenet commented Feb 1, 2016

Good ideas 👍

@whyrusleeping
Copy link
Member

this could be done on top of the files api, where the 'token' is just part of the path element. And then we have the root of the database namespace 'pinned' with the soon-to-be-created ipfs files pin command

@jbenet
Copy link
Member

jbenet commented Dec 22, 2016

@ianopolous thanks for bringing this up again (in irc) -- i'm still not entirely sure what the best way to do this is. we should outline the needs and constraints, how operations function (how they may fail, how to recover), and how to implement proper consistency/safety (oplogs? journals?)

this also relates to ipfs-cluster and the "transaction oplog" we're making there. cc @hsanjuan

@wking
Copy link

wking commented Dec 22, 2016 via email

@ianopolous
Copy link
Member Author

@wking I don't think that would give you any guarantees. Consider the example where a user is doing a single add of a large file, which can take arbitrarily long. Half way through writing this file, someone else calls GC. I believe (correct me if I'm wrong @whyrusleeping ) but this would gc all the objects written for the first half of the file. Even if that is handled correctly internally, and the sub objects aren't able to be gc'd, it proves that the user needs to control the timeout of the transaction.

There isn't any overlap between my proposal and object pinning: It would work in conjunction with pinning.
e.g.

add object X with transactionID T, giving multihash H
pin (H)
completeTransaction(T)

or a more complex example:

add object X with transactionID T, giving multihash H1
add object Y (which has a merkle link to H1) with transactionID T, giving multihash H2
pin -r (H2)
completeTransaction(T)

To me this is a very simple mental model which gives you real guarantees and thus allows you to construct algorithms which are robust under concurrency.

@wking
Copy link

wking commented Dec 23, 2016 via email

@ianopolous
Copy link
Member Author

With a configurable expiration time,
you can still handle wide, bushy trees safely if the expiration times
are configurable.

The timeout would have to be configurable on a call by call basis, which is exactly what I'm suggesting.

All the calls I'm suggesting have a user controlled timeout (see the first comment), in my code examples I left it as an implicit parameter. This means the behaviour is well defined for ipfs crashes and restarts, assuming the return from the write command means that the transactionID + timeout and hash have been "written to disk". Upon restart, ipfs just continues the timeout. To be clear, all transactions would have a timeout controlled by the user. This would also play well with ipfs-cluster, where "write to disk" means safely committed to as many nodes as the cluster setup requires, or erasure coded or whatever.

Let's assume that ipfs can handle the internal writing of a large file correctly under concurrency. The timeout would need to be from completion of writing of the final root object, whatever the write command called. The timeout for a transaction would get updated with every write in the same transaction, so you can "keep asking for more time" effectively (another property lacking in an expiring pin model).

Expiring pins cannot help here. Consider the following interleaving:

add object X with transactionID T, timeout t, giving multihash H
gc (from another thread)
pin (H)
completeTransaction(T)

We've lost our object immediately because of another thread calling gc. This can't be fixed by any modification to the pin call, because the damage is already done before the pin call starts.

@wking
Copy link

wking commented Dec 23, 2016

The timeout for a transaction would get updated with every write in the same transaction, so you can "keep asking for more time" effectively (another property lacking in an expiring pin model).

I agree that you'd need a way to ask for more time, but I don't see why you couldn't get that with expiring pins. And it would give you a way to guard an existing object:

err = pin(object.hash, expiration=estimated_time_to_patent + buffer)
If err == DoesNotExist:
    push(object.content, expiration=estimated_time_to_patent + buffer)

... the damage is already done before the pin call starts.

Yeah, you'd need either a default expiring pin on all objects that lasts long enough for the explicit pin placement or a way to set (expiring) pins atomically with the push.

But these expiring pins look a lot like your expiring transactions. The trade off is that transactions can be explicitly closed/canceled if you finish early but require per-(object, transaction) tracking on the server, while expiring pins are not cancelable (unless your OK with Bob removing Alice's pin) and only require a single value (the max expiration time) to be tracked per object.

@jbenet
Copy link
Member

jbenet commented Dec 23, 2016

We've considered a similar heuristic like "dont gc things added in less than sec" that may help some, but may introduce other problems.

I think good transactional semantics are useful to define, and ipfs (at least the repo business) is quite simple; this should be easy.

I think a place to start would be to go over the core api, refine the functions we want to "transactionize" (i suspect they're likely the same as for ipfs-cluster cc @hsanjuan) and then propose how we would do the transaction on each.

How we do GC can definitely be rethought quite a bit to support these operations safely.

@ianopolous
Copy link
Member Author

For a concrete example of a complex operation I'd like to put in a transaction, have a look at Peergos' write to a merkle-btree, the recursive put method: https://github.com/Peergos/Peergos/blob/master/src/peergos/shared/merklebtree/TreeNode.java#L80

Note that's written in an asynchronous way using CompletableFutures so that it can be automatically compiled to Javascript and use native Promises in a browser. Ordinarily I'd just use synchronous Java code which is easier to read. You can ignore the UserPublicKey argument, and the ContentAddressedStorage argument is IPFS.

Essentially, it has to modify ~ log(tree size) nodes resulting in a new root which I then pin. Non leaf nodes have merkle links to other nodes, and leaf nodes have merkle links to larger data objects.

A much simpler use case at my day job is a single add-and-pin of a large file.

@ianopolous
Copy link
Member Author

@whyrusleeping Another, much simpler, idea I had to solve all these use cases is to give every write command an option to also pin (recursively) the result (the assumption is that each command itself is atomic internally with respect to gc). This would then allow the user to build up arbitrary structures, then when the final root is written and pinned) explicitly unpin all the child pins. Obviously then the burden is on the client to keep track of all their intermediate pins in a transaction, but that is easy.

@jbenet
Copy link
Member

jbenet commented Dec 24, 2016

@ianopolous yeah i think that would work. that -- as a baseline -- should be possible. i woudl like to support larger transaction batches though

@whyrusleeping
Copy link
Member

whyrusleeping commented Dec 26, 2016 via email

@whyrusleeping
Copy link
Member

cc @Stebalien, I remember we were talking about making the concept of 'sessions' extend up into the API. that could potentially support this usecase.

@Stebalien
Copy link
Member

Yes: ipfs/kubo#4138

Really, we already have blockservice sessions (you added them when you added bitswap sessions), we just need to refine them a bit and ensure that blocks don't get garbage collected when a session that has observed them is open.

@ianopolous
Copy link
Member Author

@whyrusleeping asked me to write my ideal api for this so here goes:

ipfs transaction start $duration_millis => tid
ipfs transaction close $tid

Then write commands (block put in our case) get an optional tid parameter so usage would look like:

ipfs transaction start $duration_millis => tid
ipfs block put .... tid=$tid
...
ipfs block put .... tid=$tid
ipfs pin add $hash
ipfs transaction close $tid

@momack2
Copy link
Contributor

momack2 commented Sep 27, 2018

@whyrusleeping does any of the new session work you've been doing affect this issue? Seemingly this is a main blocker for data safety in Peergos, which is something they want to work on in Q4. If it is applicable, would it be ready on that timeline? If not, is there something we could do to unblock them in the nearer term (other GC improvements?)?

@Stebalien
Copy link
Member

Not really. We'll probably use sessions with this but it's mostly orthogonal.

@momack2
Copy link
Contributor

momack2 commented Oct 4, 2018

Hey @michaelavila! I hear you are working on something close to this in pinning land. Peergos has some work that depends on this capability (effectively the ability to add temporary pins to a "don't garbage collect this yet" set that we clean up later) and we'd like to unblock them by early/mid Q4 so they can do more work on top of that. Would you be capable/interested in owning this?

@michaelavila
Copy link

@momack2 I don't mind taking on this task, but I'm out 10/20-11/4 and we have a few weeks of team gatherings in different countries before that. All of which I think will have an effect on timing, which appears to be a constraint here.

@momack2
Copy link
Contributor

momack2 commented Oct 17, 2018

Let's discuss on Thursday at the London IPFS meetup to see if the current very basic "don't GC until this is done" solution is sufficient for Peergos folks to make progress. If something more adaptive is needed we can find another owner.

@hsanjuan
Copy link
Contributor

Isn't there an mfs workaround for this consisting in putting in mfs the cid of things that you want to prevent from being GCed. This puts them in a gc ignore list. After being done, you can remove from them from mfs. This was told to me when asking about similar issues with GC when doing "refs pinning" (refs -r + pin add at the end).

@hsanjuan
Copy link
Contributor

So putting are removing things from mfs effectively gives you a session for that thing (?)

@ianopolous
Copy link
Member Author

This sounds like it might be a possibility to implement the desired api. Can you tag the hashes in mfs with the transaction id - so we can have multiple concurrent sessions/transactions in flight?

Presumably we could just use whatever mechanism mfs does to add them to a gc ignore list directly though?

@hsanjuan
Copy link
Contributor

You don't need transaction ids. Just putting the object in an mfs folder (with a random name known by your transaction actor) so it's not GCed. At the end of the transaction, the actor removes its folder.. I think GC gets an ignore list with all the things on MFS, so as long as someone put it in there it should be safe.

I'm just saying that if you want to have this functionality there's a potential workaround that doesn't need any changes to the codebase and it's not crazy to implement on top of ipfs, right now (again, in theory, maybe I'm missing something).

@ianopolous
Copy link
Member Author

We need transaction ids just to group objects that can be released together, without affecting concurrent transactions. As you say, the id could be used as the folder name in mfs.

In our case, we have to implement all new calls we use in our codebase and we'd rather not do temporary workarounds (we've been waiting 3 years already for safe transactions in ipfs, so a little longer is fine :-) ). But it sounds like one could use mfs (or what it uses under the hood) to implement our proposed calls in ipfs.

@michaelavila
Copy link

michaelavila commented Nov 15, 2018

@ianopolous are you ok with using a flag to pass the transaction id in?

e.g.:

ipfs transaction start $duration_millis => tid
ipfs block put --transaction=$tid ...
...
ipfs block put --transaction=$tid ...
ipfs pin add $hash
ipfs transaction close $tid

We can have a shorthand for the transaction flag too: --transaction=<tid> | -t <tid>.

@ianopolous
Copy link
Member Author

@michaelavila Yep, that's how I imagined it'd be implemented.

@cboddy
Copy link
Member

cboddy commented Nov 29, 2018

@michaelavila that's exciting just wondering if there is any work proceeding for this feature or when such a feature may be expected?

@momack2
Copy link
Contributor

momack2 commented Dec 4, 2018

@michaelavila @eingenito @Stebalien - can I ask that we freeze an interface for folks to implement against sooner rather than later (ex, by EOWeek if possible) despite not having this implemented yet? If we can agree on that, we can do useful work in parallel that lines up along that interface. =]

@eingenito
Copy link

eingenito commented Dec 5, 2018

@cboddy and @momack2 - @michaelavila is currently not working on this feature, and it's unlikely to be addressed in the next two months with priorities as-is; which means that agreeing on the command line API might not buy much. Particularly since actually trying to invoke ipfs with these arguments will just cause it to complain.

That said how about:
ipfs transaction start [--duration=<millis>] so we can reserve the right to create open ended transactions
ipfs block put [--transaction-id=<tid>] <data> to specify the transaction maybe leaving room for a boolean transaction flag or something like that later.
ipfs transaction commit <tid> to commit an identified transaction

-t isn't a great short name because it's already used in ipfs add and that's a command that might well want to participate in a transaction, so maybe just punt on that for now?

Also does pinning participate in the transaction? ipfs pin add [--transaction-id=<tid>]. I mean I know that under the hood the first implementation is likely to just defer GC until transaction commit, but that's not something the user would automatically know (I mean everyone here would).

What do people think?

@ianopolous
Copy link
Member Author

@eingenito We don't need a commit, that is covered by pin.

The idea is you write a bunch of related ipld objects inside the transaction, pin what you need at the end, and then close the transaction (which just allows the rest to be GC'd). The pin command doesn't need awareness of the transaction (though clearly the intermediate objects need to be pinnable).

FWIW We only care about the http api and specifically only the block put command.

@momack2
Copy link
Contributor

momack2 commented Dec 14, 2018

I think we are agreed on @michaelavila's api from #106 (comment) with the caveat that --transaction-id=<tid> (without the short -t option) seems safer. (@Stebalien - shout if you disagree)

@ianopolous is there anything else blocking you?

@ianopolous
Copy link
Member Author

@momack2 That's great! That's all we need to move forward. Thank you!

@ianopolous
Copy link
Member Author

@momack2 Is there any possibility to progress an implementation of this?

@momack2
Copy link
Contributor

momack2 commented Jul 12, 2019

Our team is very focused on the identified needs for package managers this quarter, so I don't think we have development bandwidth to take this on - but if you want to drive an implementation proposal and bring it to an upcoming core implementations meeting for discussion / design feedback, I think we could help review, advise, and unblock from that perspective.

@ianopolous
Copy link
Member Author

I have good news @momack2 ! We've discovered a way to run our own (fully concurrent) GC algorithm in Peergos on any ipfs instance (regardless of which blockstore it is using, and without direct access to the blockstore - previously we could do this only on s3 blockstores by direct access). This means we don't need the built-in ipfs gc, and as a consequence we don't need this transaction api, (or actually even the pin api any more).

@momack2
Copy link
Contributor

momack2 commented Jul 22, 2020

Very interesting! How did you achieve that? Might be something other teams are equally excited about!

@ianopolous
Copy link
Member Author

It will take a while to explain, but we plan to write a blog post about it at some point.

@bertrandfalguiere
Copy link

Is there a blog post in the end? I would be interested in reading it. :)

@ianopolous
Copy link
Member Author

This got lost in the pandemic, but thanks for the reminder @bertrandfalguiere I should have time in the next few months for this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests