-
Notifications
You must be signed in to change notification settings - Fork 30
idea: support for transactional groups of writes #106
Comments
Good ideas 👍 |
this could be done on top of the files api, where the 'token' is just part of the path element. And then we have the root of the database namespace 'pinned' with the soon-to-be-created |
@ianopolous thanks for bringing this up again (in irc) -- i'm still not entirely sure what the best way to do this is. we should outline the needs and constraints, how operations function (how they may fail, how to recover), and how to implement proper consistency/safety (oplogs? journals?) this also relates to ipfs-cluster and the "transaction oplog" we're making there. cc @hsanjuan |
Another approach to this would be use creation / access timestamps per
object, and only collect garbage older than $TIMEOUT [1]. That is
less stateful than tracking transactions, which is nice for robustness
and simplicity. However, the timestamp-based approach doesn't give
you a way to release objects once you no longer need to protect them
from garbage collection.
And, as @whyrusleeping points out [2], there's a lot of overlap
between your proposed transaction semantics and existing pinning
behavior. Maybe you're asking for pins that have an expiration date?
Like [3], but for pins instead of signatures.
This issue is related to #11.
[1]: https://github.com/ipfs/go-ipfs/pull/1274/files#r31234525
[2]: #106 (comment)
[3]: https://github.com/ipfs/specs/tree/7c897fb2057e998d61d92d5e8fb34df152104673/iprs-interplanetary-record-system#signed-expiring-after-a-time-to-live
|
@wking I don't think that would give you any guarantees. Consider the example where a user is doing a single add of a large file, which can take arbitrarily long. Half way through writing this file, someone else calls GC. I believe (correct me if I'm wrong @whyrusleeping ) but this would gc all the objects written for the first half of the file. Even if that is handled correctly internally, and the sub objects aren't able to be gc'd, it proves that the user needs to control the timeout of the transaction. There isn't any overlap between my proposal and object pinning: It would work in conjunction with pinning. add object X with transactionID T, giving multihash H
pin (H)
completeTransaction(T) or a more complex example: add object X with transactionID T, giving multihash H1
add object Y (which has a merkle link to H1) with transactionID T, giving multihash H2
pin -r (H2)
completeTransaction(T) To me this is a very simple mental model which gives you real guarantees and thus allows you to construct algorithms which are robust under concurrency. |
On Thu, Dec 22, 2016 at 02:20:11PM -0800, Ian Preston wrote:
Consider the example where a user is doing a single add of a large
file…
This is not an atomic push of a single object. The file gets chunked
and sharded. You push chunks 1, 2, 3, …, and then fan-out objects i,
ii, iii, …. You only need the expiring pin (which can be explicit or
creation-time based) on one object to last long enough for you to push
the parent (after which the expiring pin on the parent will protect
its ancestors from garbage collection). With deep, skinny trees
(e.g. Git histories), the time-until-you-push-a-parent is going to be
fairly constant. With wide, bushy trees (e.g. huge files with
multiple fan-out layers), the time-to-parent is going to increase
exponentially by fan-out layer. With a configurable expiration time,
you can still handle wide, bushy trees safely if the expiration times
are configurable. For example, “I won't get a parent for this object
for around 2 hours. Please don't GC it for the next three hours”. Or
maybe “My network is flaky. Please keep everything I push for 24
hours”.
…
completeTransaction(T)
```
To me this is a very simple mental model which gives you real
guarantees and thus allows you to construct algorithms which are
robust under concurrency.
The issue here is “what happens if the client dies while pushing and
never releases the transaction?”. Unless your transactions expire (in
which case, they're back to looking like expiring pins), the server
will eventually get bogged down tracking an always-increasing number
of opened-but-abandoned transactions.
|
The timeout would have to be configurable on a call by call basis, which is exactly what I'm suggesting. All the calls I'm suggesting have a user controlled timeout (see the first comment), in my code examples I left it as an implicit parameter. This means the behaviour is well defined for ipfs crashes and restarts, assuming the return from the write command means that the transactionID + timeout and hash have been "written to disk". Upon restart, ipfs just continues the timeout. To be clear, all transactions would have a timeout controlled by the user. This would also play well with ipfs-cluster, where "write to disk" means safely committed to as many nodes as the cluster setup requires, or erasure coded or whatever. Let's assume that ipfs can handle the internal writing of a large file correctly under concurrency. The timeout would need to be from completion of writing of the final root object, whatever the write command called. The timeout for a transaction would get updated with every write in the same transaction, so you can "keep asking for more time" effectively (another property lacking in an expiring pin model). Expiring pins cannot help here. Consider the following interleaving: add object X with transactionID T, timeout t, giving multihash H
gc (from another thread)
pin (H)
completeTransaction(T) We've lost our object immediately because of another thread calling gc. This can't be fixed by any modification to the pin call, because the damage is already done before the pin call starts. |
I agree that you'd need a way to ask for more time, but I don't see why you couldn't get that with expiring pins. And it would give you a way to guard an existing object: err = pin(object.hash, expiration=estimated_time_to_patent + buffer)
If err == DoesNotExist:
push(object.content, expiration=estimated_time_to_patent + buffer)
Yeah, you'd need either a default expiring pin on all objects that lasts long enough for the explicit pin placement or a way to set (expiring) pins atomically with the push. But these expiring pins look a lot like your expiring transactions. The trade off is that transactions can be explicitly closed/canceled if you finish early but require per-(object, transaction) tracking on the server, while expiring pins are not cancelable (unless your OK with Bob removing Alice's pin) and only require a single value (the max expiration time) to be tracked per object. |
We've considered a similar heuristic like "dont gc things added in less than sec" that may help some, but may introduce other problems. I think good transactional semantics are useful to define, and ipfs (at least the repo business) is quite simple; this should be easy. I think a place to start would be to go over the core api, refine the functions we want to "transactionize" (i suspect they're likely the same as for ipfs-cluster cc @hsanjuan) and then propose how we would do the transaction on each. How we do GC can definitely be rethought quite a bit to support these operations safely. |
For a concrete example of a complex operation I'd like to put in a transaction, have a look at Peergos' write to a merkle-btree, the recursive put method: https://github.com/Peergos/Peergos/blob/master/src/peergos/shared/merklebtree/TreeNode.java#L80 Note that's written in an asynchronous way using CompletableFutures so that it can be automatically compiled to Javascript and use native Promises in a browser. Ordinarily I'd just use synchronous Java code which is easier to read. You can ignore the UserPublicKey argument, and the ContentAddressedStorage argument is IPFS. Essentially, it has to modify ~ log(tree size) nodes resulting in a new root which I then pin. Non leaf nodes have merkle links to other nodes, and leaf nodes have merkle links to larger data objects. A much simpler use case at my day job is a single add-and-pin of a large file. |
@whyrusleeping Another, much simpler, idea I had to solve all these use cases is to give every write command an option to also pin (recursively) the result (the assumption is that each command itself is atomic internally with respect to gc). This would then allow the user to build up arbitrary structures, then when the final root is written and pinned) explicitly unpin all the child pins. Obviously then the burden is on the client to keep track of all their intermediate pins in a transaction, but that is easy. |
@ianopolous yeah i think that would work. that -- as a baseline -- should be possible. i woudl like to support larger transaction batches though |
Let's make an issue to add pinning options to all write commands for now.
That will at least help the short term
…On Sat, Dec 24, 2016, 14:27 Juan Benet ***@***.***> wrote:
@ianopolous <https://github.com/ianopolous> yeah i think that would work.
that -- as a baseline -- should be possible. i woudl like to support larger
transaction batches though
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#106 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABL4HMJS9zAODhIF7kwVN72B0ntbuHLHks5rLR2mgaJpZM4HQMdV>
.
|
cc @Stebalien, I remember we were talking about making the concept of 'sessions' extend up into the API. that could potentially support this usecase. |
Yes: ipfs/kubo#4138 Really, we already have blockservice sessions (you added them when you added bitswap sessions), we just need to refine them a bit and ensure that blocks don't get garbage collected when a session that has observed them is open. |
@whyrusleeping asked me to write my ideal api for this so here goes:
Then write commands (block put in our case) get an optional tid parameter so usage would look like:
|
@whyrusleeping does any of the new session work you've been doing affect this issue? Seemingly this is a main blocker for data safety in Peergos, which is something they want to work on in Q4. If it is applicable, would it be ready on that timeline? If not, is there something we could do to unblock them in the nearer term (other GC improvements?)? |
Not really. We'll probably use sessions with this but it's mostly orthogonal. |
Hey @michaelavila! I hear you are working on something close to this in pinning land. Peergos has some work that depends on this capability (effectively the ability to add temporary pins to a "don't garbage collect this yet" set that we clean up later) and we'd like to unblock them by early/mid Q4 so they can do more work on top of that. Would you be capable/interested in owning this? |
@momack2 I don't mind taking on this task, but I'm out 10/20-11/4 and we have a few weeks of team gatherings in different countries before that. All of which I think will have an effect on timing, which appears to be a constraint here. |
Let's discuss on Thursday at the London IPFS meetup to see if the current very basic "don't GC until this is done" solution is sufficient for Peergos folks to make progress. If something more adaptive is needed we can find another owner. |
Isn't there an |
So putting are removing things from mfs effectively gives you a |
This sounds like it might be a possibility to implement the desired api. Can you tag the hashes in mfs with the transaction id - so we can have multiple concurrent sessions/transactions in flight? Presumably we could just use whatever mechanism mfs does to add them to a gc ignore list directly though? |
You don't need transaction ids. Just putting the object in an mfs folder (with a random name known by your transaction actor) so it's not GCed. At the end of the transaction, the actor removes its folder.. I think GC gets an ignore list with all the things on MFS, so as long as someone put it in there it should be safe. I'm just saying that if you want to have this functionality there's a potential workaround that doesn't need any changes to the codebase and it's not crazy to implement on top of ipfs, right now (again, in theory, maybe I'm missing something). |
We need transaction ids just to group objects that can be released together, without affecting concurrent transactions. As you say, the id could be used as the folder name in mfs. In our case, we have to implement all new calls we use in our codebase and we'd rather not do temporary workarounds (we've been waiting 3 years already for safe transactions in ipfs, so a little longer is fine :-) ). But it sounds like one could use mfs (or what it uses under the hood) to implement our proposed calls in ipfs. |
@ianopolous are you ok with using a flag to pass the transaction id in? e.g.:
We can have a shorthand for the transaction flag too: |
@michaelavila Yep, that's how I imagined it'd be implemented. |
@michaelavila that's exciting just wondering if there is any work proceeding for this feature or when such a feature may be expected? |
@michaelavila @eingenito @Stebalien - can I ask that we freeze an interface for folks to implement against sooner rather than later (ex, by EOWeek if possible) despite not having this implemented yet? If we can agree on that, we can do useful work in parallel that lines up along that interface. =] |
@cboddy and @momack2 - @michaelavila is currently not working on this feature, and it's unlikely to be addressed in the next two months with priorities as-is; which means that agreeing on the command line API might not buy much. Particularly since actually trying to invoke ipfs with these arguments will just cause it to complain. That said how about:
Also does pinning participate in the transaction? What do people think? |
@eingenito We don't need a commit, that is covered by pin. The idea is you write a bunch of related ipld objects inside the transaction, pin what you need at the end, and then close the transaction (which just allows the rest to be GC'd). The pin command doesn't need awareness of the transaction (though clearly the intermediate objects need to be pinnable). FWIW We only care about the http api and specifically only the block put command. |
I think we are agreed on @michaelavila's api from #106 (comment) with the caveat that @ianopolous is there anything else blocking you? |
@momack2 That's great! That's all we need to move forward. Thank you! |
@momack2 Is there any possibility to progress an implementation of this? |
Our team is very focused on the identified needs for package managers this quarter, so I don't think we have development bandwidth to take this on - but if you want to drive an implementation proposal and bring it to an upcoming core implementations meeting for discussion / design feedback, I think we could help review, advise, and unblock from that perspective. |
I have good news @momack2 ! We've discovered a way to run our own (fully concurrent) GC algorithm in Peergos on any ipfs instance (regardless of which blockstore it is using, and without direct access to the blockstore - previously we could do this only on s3 blockstores by direct access). This means we don't need the built-in ipfs gc, and as a consequence we don't need this transaction api, (or actually even the pin api any more). |
Very interesting! How did you achieve that? Might be something other teams are equally excited about! |
It will take a while to explain, but we plan to write a blog post about it at some point. |
Is there a blog post in the end? I would be interested in reading it. :) |
This got lost in the pandemic, but thanks for the reminder @bertrandfalguiere I should have time in the next few months for this. |
I had a brief discussion with @whyrusleeping and wanted to start the ball rolling on a public discussion here.
Motivation:
The ability to implement an ACID database or other transactional system in a highly concurrent environment. In my use case with Peergos I am imagining many users concurrently writing to the same IPFS daemon. This means that the GC will essentially need to be called at random relative to each thread. Each user will be building up trees of objects using object.patch, before eventually committing (pinning).
Implementation:
What I had in mind was an optional parameter to all IPFS write commands which was a transaction ID (and maybe a timeout as well). Then all writes are tagged with this temporary ID within a transaction. IPFS just keeps a map from transaction ID to a set of hashes, and these form new object roots during GC. Then when a transaction is complete you could call a completeTransaction(tid) which dumps the mapping (or let it time out). This gives you guarantees that an object you've just written won't be GC'd regardless of the GC activity before you are finished with it.
The text was updated successfully, but these errors were encountered: