Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Chunked upload #620

Closed
Gozala opened this issue Oct 15, 2021 · 9 comments
Closed

Proposal: Chunked upload #620

Gozala opened this issue Oct 15, 2021 · 9 comments
Labels
kind/enhancement A net-new feature or improvement to an existing feature pi/upload-v2 stack/api-protocols topic/upload related to uploads

Comments

@Gozala
Copy link
Contributor

Gozala commented Oct 15, 2021

I wold like to propose chunked upload feature to workaround upload size limitation. Idea is to provide a new API endpoint

PUT /${sha256(content)}

Supplied with Content-Range headers to provide pwrite like functionality. Widely available sha256 could can be used to identify upload.

After all chunks are written user send request with 0-0 range to flush, which would basically perform equivalent of POST /upload but use written bytes for body.

This API endpoint supposed to be agnostic of the mime type, in other words you could upload car formatted dags or an arbitrary files. Behavior will be same as with POST /upload.

Sentiments

  1. What if user never finishes uploads ?
    • We can garbage collect puts that have not been flushed for X hours.
  2. Can this be used to overcome upload quotas ?
  • web3.storage could count uploaded chunks towards upload quota.
  1. What happens when chunks are flushed ?
    • They get garbage collected and subtracted from upload quota.
    • 💔 This will not support use case where you write , flush then overwrite bytes flush again. That is ok, we're not building another fs.
  2. Why sha256 as opposed to PUT /${cid} ?
    1. Users may not know what the CID will be just like they don't when uploading a file.
    2. User may chunk differently and get different CID.
    3. Avoids question of when uploading car what should CID be a dag root a car file CID ?
  3. Why not multihash instead of sha256 ?
    • multihash is cool, but we may not have hashing function used by user on backend.
    • we could potentially support multihash and reject hashing algorithms we don't support, but not sure it would provide a real value here.
    • alternatively we could support multihash with just sha256 this would leave some room for extensions in the future
  4. Does not carbites already solve this ?
    1. It does but it requires codecs to do a treewalk and uploader may not have those.
    2. Some people write in languages we don't have carbites for.
    3. this can support car format or whatever else we may use in the future.
  5. What if user failed / forgot to upload some ranges ?
    • Flush will fail if hash of the content mismatches claimed sha256(content).
@Gozala Gozala added kind/enhancement A net-new feature or improvement to an existing feature need/triage Needs initial labeling and prioritization labels Oct 15, 2021
@alanshaw
Copy link
Contributor

alanshaw commented Oct 18, 2021

I'm kinda down with this but I'm not convinced yet that the effort is justified and I have implementation questions!

I can see that this solves:

  1. Leaky abstraction problem.
    • Yes, it's a pain, but like I said before if you're encoding data with a non-default encoder then you likely have the decoder to hand so it's probably not blocking.
    • Right now, we cannot even store data in IPFS that is not pb, cbor or raw. IPFS will error importing it because it decodes the blocks. We should lobby for this to be fixed.
    • Could we instead have the client set an X-Root-Cid header on each request to allow use of the "simple" carbites strategy which requires no decoders?
  2. Enables folks with big data to upload files without uploading DAGs.
    • One thing I like about the current setup is that it encourages people to move to content addressing. You can use simple fetch for small data initially and as your data requirements and understanding of IPFS grows you upgrade to uploading DAGs.
    • We should be doing all we can to encourage transferring DAGs and not plain files. This is the opposite of that.

Implementation questions:

  • Where do the chunks get stored before the final flush?
  • I'm not clear on how we re-assemble the chunks and import into IPFS on flush given the Cloudflare workers environment and a 30s execution time.
  • User has to wait for transfer from intermediate store to IPFS on flush - their upload time is doubled no?
  • How does this work with a directory of files? Can you sha256 a directory? Do we get people to tar stuff and then extract it?

Devils advocate:

  • We're primarily targeting the web and JS, is it less work to invest in building carbites for other languages if/when the need arises?

@Gozala
Copy link
Contributor Author

Gozala commented Oct 18, 2021

* Yes, it's a pain, but like I said before if you're encoding data with a non-default encoder then you likely have the decoder to hand so it's probably not blocking.

That is true only if all of your application stack is in JS (that uses uses specifically new IPLD stack).

Right now, we cannot even store data in IPFS that is not pb, cbor or raw. IPFS will error importing it because it decodes the blocks. We should lobby for this to be fixed.

We should absolutely fix that. Does not that affect carbites just the same though ?

Could we instead have the client set an X-Root-Cid header on each request to allow use of the "simple" carbites strategy which requires no decoders?

We could. I do not think that would be better though, not for the users. Beauty of proposed API is that it allows chunked uploading of bytes regardless of what they represent. Having to provide CIDs would imply:

  1. User having to know what the content is (is it a dag or maybe is it an image).
  2. Introduces uncertainty, how did user chunked it (not applicable for dags).

Avoiding CID here avoids all that.

One thing I like about the current setup is that it encourages people to move to content addressing. You can use simple fetch for small data initially and as your data requirements and understanding of IPFS grows you upgrade to uploading DAGs.

Using sha256 is content addressing, sure it's not all they way IPFS and DAGs but on the flip side it is a lot more accessible. Better yet it does compose with Dag encoding, our client API can just take car writer and do the rest (hash it and chunk upload it).

We should be doing all we can to encourage transferring DAGs and not plain files. This is the opposite of that.

I am not disagreeing, yet I would like to ask why should we ?

This is not the opposite of encouraging transferring DAGs (it supports DAGs just well) it is rather meeting users where they are. If they can give us a dag that is great, but if they don't have one we can still take their file and turn it to a dag on the server, without making our cloudfare limits their problem.

Where do the chunks get stored before the final flush?

I don't know, it's an implementation detail. Could be S3 or maybe in a postgres binary file store.

  • I'm not clear on how we re-assemble the chunks and import into IPFS on flush given the Cloudflare workers environment and a 30s execution time.

This needs exploring as well, but I imagine we could pipe chunks into /ipfs/add more or less (for cars it would be /dag/import).

If 30s is a problem we can consider alternative strategy of doing chunking on write (which would have to vary between car and other files) so on flush we have to just assemble.

  • User has to wait for transfer from intermediate store to IPFS on flush - their upload time is doubled no?

Depends 🤷‍♂️ ? If we have S3 backup maybe they don't need to wait for anything we can just give back CID and deal with actual pinning in async.

This might be also good to prefer uploading dags when possible, but when that is not an option it's better than inability.

  • How does this work with a directory of files? Can you sha256 a directory? Do we get people to tar stuff and then extract it?

I have to admit I have not considered directories, maybe it should just be out of scope. If we really want to support directories here as well we could ask providing sha256 of form-data encoding of directory, but I find it less compelling because it doesn't really meet users where they are in this case.

@mikeal
Copy link

mikeal commented Oct 19, 2021

it’s definitely better/easier for us when they just upload CAR files, assuming they are using typical codecs. we wouldn’t want to change anything we’ve already done, but i could see us adding this feature in 2022.

This API endpoint supposed to be agnostic of the mime type, in other words you could upload car formatted dags or an arbitrary files. Behavior will be same as with POST /upload.

i would want to have a separate endpoint, or very explicit querystring params, for regular files and CAR files:

  • we’ve seen people encode CAR files into unixfs on purpose and i don’t see an easy way to support that if we’re switching the behavior between file encoding and CAR uploading using mimetype
  • we’ll want to accept some unixfs settings as params when encoding files, but obviously not when accepting CAR files
  • i could see us adding features in the future for tar and tar.gz that decompress the archive formats and encoding the entire directory structure.

@atopal atopal removed the need/triage Needs initial labeling and prioritization label Oct 22, 2021
@dchoi27
Copy link
Contributor

dchoi27 commented Jan 10, 2022

@Gozala do we want to keep this open, or merge the discussion into #980 and #837?

@Gozala
Copy link
Contributor Author

Gozala commented Jan 26, 2022

We have been discussing .storage API V2 and specifically a need for some sort of session identifiers to do multiple uploads. Which is probably what will supersede this. This is probably a good place to discuss what it may look like here. I'll do than in the following comment

@Gozala
Copy link
Contributor Author

Gozala commented Jan 26, 2022

So revived idea to allow chunked uploads is following:

  1. Define new API endpoint for uploads e.g. PATCH /.
  2. Payload for that endpoint could would be some (yet to be determined) encoding of following structure:
     interface Transaction {
        data: CarFile // Car file containing set of blocks
        instructions: Instruction[]
     }
    
     type Instruction = {
         name: PublicKey
         value: CID
         seqno?: number
         ttl?: number
     }
  3. Payload will contain UCAN(s) with necessary capabilities to perform submitted transaction (which would include signatures for private keys)

That way client could:

  1. generate a "session keypair"
  2. Upload chunks of content encoded as CAR and update session (which is w3name / ipns key) to point to a DAG referencing all the DAGs uploaded so far (for the given session).
  3. Last upload would update session (which is w3name / ipns key) to point to the complete DAG representing a content.

@mikeal
Copy link

mikeal commented Jan 26, 2022

love it.

so all the state is managed by the client updating an IPNS key, but we have a single endpoint for users to transactionally upload the data and update the state of the IPNS key

@mikeal
Copy link

mikeal commented Jan 26, 2022

would we be able to just put the entire signed IPNS record in the UCAN they send for the upload?

@elizabeth-griffiths
Copy link
Member

We're closing this issue. If you still need help please open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement A net-new feature or improvement to an existing feature pi/upload-v2 stack/api-protocols topic/upload related to uploads
Projects
None yet
Development

No branches or pull requests

7 participants