Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What integration to have between Store and Blob protocol implementations #1343

Open
vasco-santos opened this issue Mar 25, 2024 · 5 comments

Comments

@vasco-santos
Copy link
Contributor

vasco-santos commented Mar 25, 2024

A follow up on some chats on #1339 regarding how to integrate new world with old world

store protocol persisted state

Since we shipped w3up, the store/* protocol implementation is backed by two state stores:

  • storeTable:
    • a dynamo table
    • indexed by space and link (CID with CAR codec)
  • carStoreBucket:

blob protocol persisted state

On the other side, we are now implementing the blob/* protocol, which is less opinated about the bag of blocks ingested. Therefore, the blob protocol receives the multihash bytes and returns back multihash bytes, even though naturally it will need to encode this multihash internally (for instance in base64).

Blob protocol needs persisted state quite similar to the store protocol. To untie it from the "store" and "car" related namings, at the moment we are using names closer to the blob protocol:

  • allocationStorage instead of storeTable
  • blobStorage instead of carStoreBucket

Note that the indexing SHOULD be quite similar, and is likely out of scope of this issue to discuss it. The main thing is that the index keys will now be different for same CARs uploaded

Integrate new world with old world

The main problem we want to solve here is how to make both worlds work together, or if it is actually desired to do so.

When store/add handler is called, the carStoreBucket is checked so that we know if that CAR is already being stored. If so, we do not need to receive the bytes. Moreover, we check if storeTable has a mapping of the CAR link to that space. Depending on the result of these ops, we can do one of the following:

  • If bytes are stored, and CAR is allocated to space, nothing is done (NOOP)
  • If bytes are stored, but CAR is not allocated to space, CAR link is allocated to the space but bytes do not need to be sent over
  • If bytes are not stored, the CAR link is allocated to space and client MUST write the bytes to the provided target.

In the blob/add handler, we MUST do same set of verifications as the ones above. However, we MAY want to continue decoupling both allocating on user space, and requesting bytes to be written for content we already have received as a CAR before.

We can check if we already received a CAR with the same bytes (in other words, we can derive CAR Cid from the multihash by creating a CID with CAR codec). However, this will also mean:

  • more complex logic
  • more pricey as we will need to perform two get/head operations on the stores (one with multihash encoded key, and one injecting CAR codec) in order to see if we have bytes, or it is allocated in space

Note that this will be tied with looking up on bucket now, but then same applies to look for claims for that content

Alternatively, we could just start from scratch with the new bucket in R2/other write targets. This would also tie nicely with the previous discussions that a new Bucket should exist once nucleation happens, instead of having in nucleated entity bill historical content.

Would like your opinions to get to a decision cc @hannahhoward @alanshaw @Gozala @reidlw

@Gozala
Copy link
Contributor

Gozala commented Mar 26, 2024

I would suggest to try and do an amortized migration from CAR → Blob. Specifically I suggest to do following:

  1. Check for the blob as first path
  2. If blob record is not present fallback to CAR based check, if discovered update records to blob schema so that next time 1st check will pass.
  3. Once we reach 0 CAR records remove fallback path
    • We can speed this up by running a script that adds blobs corresponding to CARs we have

This way extra costs will be temporary, although sadly on every new write which is not great. Also I suspect we can manage to do dynamo queries without doing one as car other as blob, but I don't believe that would work for S3.

@Gozala
Copy link
Contributor

Gozala commented Mar 26, 2024

Actually now that I'm thinking about it we probably need to move from looking if we have CAR/Blob in S3 to looking if we have it in location claims, don't we ? Because in the future we will not have it in S3 but we will have it in R2, so perhaps we should be checking index instead. We do need to consider that we may have content in S3/R2 before we have it indexed however.

@vasco-santos
Copy link
Contributor Author

Actually now that I'm thinking about it we probably need to move from looking if we have CAR/Blob in S3 to looking if we have it in location claims, don't we ? Because in the future we will not have it in S3 but we will have it in R2, so perhaps we should be checking index instead. We do need to consider that we may have content in S3/R2 before we have it indexed however.

Yes we need to look for claims, being location or other. But same exact problem happens there, dynamo/allocation store has same thing happening and claim for CarCID or (TBD, we talked about raw right?) CID for the multihash.

@alanshaw
Copy link
Member

Personally I'd punt on de-duping against old data. There's already a lot more to implement here than I'd imagined and dealing with de-duping might make the code messy and hard to follow and leaves us with dependencies on buckets we may not be using in the future. When we get to the state where we're uploading to a node on a decentralized network de-duping will be on the level of the node you're uploading to, not some global store.

If necessary we can implement de-duping with old data at a later date.

@hannahhoward
Copy link

Agree with Alan here. I'm fine with not worrying about deduping for now.

I would rather handle the migration in a script when we feel its safe to deprecate store/add

vasco-santos added a commit to storacha/w3infra that referenced this issue Apr 18, 2024
This PR creates stores and wires up new `upload-api` running `blob/add`,
`web3.storage/blob/allocate`, `web3.storage/blob/accept` and
`ucan/conclude` capabilities. Tests are also imported from `upload-api`
implementation and run here.

As agreed on storacha/w3up#1343 , there
won't be any deduping between old world and new world. Therefore, we
have new `allocations` table, and use different key schema in `carpark`.
We are writing blobs keyed as `base58btc` as previously discussed as
`${base58btcEncodedMultihash}/${base58btcEncodedMultihash}.blob`. I
added `.blob` suffix but I am happy to other suggestions. Depending on
how we progress with the reads side, we can consider creating a new
bucket to fully isolate new content?

The `receipts` and `tasks` storage end up being more complicated as they
need to follow
https://github.com/web3-storage/w3infra/blob/main/docs/ucan-invocation-stream.md#buckets,
and is essentially the same as what happens on
https://github.com/web3-storage/w3infra/blob/main/upload-api/ucan-invocation.js#L66
but at a different level as this is a proactive write of tasks and
receipts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants