Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: buffered writer #70

Merged
merged 19 commits into from
Mar 31, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 19 additions & 5 deletions api.ts
Original file line number Diff line number Diff line change
@@ -1,17 +1,18 @@
import { CID } from 'multiformats/cid'
import { CID } from "multiformats/cid"

export type { CID }
/* Generic types for interfacing with block storage */

export type Block = { cid: CID, bytes: Uint8Array }
export type Block = { cid: CID; bytes: Uint8Array }

export type BlockHeader = {
cid: CID,
length: number,
cid: CID
length: number
blockLength: number
}

export type BlockIndex = BlockHeader & {
offset: number,
offset: number
blockOffset: number
}

Expand All @@ -36,6 +37,19 @@ export interface BlockWriter {
close(): Promise<void>
}

export interface CarBufferWriter {
write(block: Block): void
close(): Uint8Array
}

export interface CarBufferWriterOptions {
roots?: CID[] // defaults to []
byteOffset?: number // defaults to 0
byteLength?: number // defaults to buffer.byteLength

headerCapacity?: number // defaults to size needed for provided roots
}

export interface WriterChannel {
writer: BlockWriter
out: AsyncIterable<Uint8Array>
Expand Down
148 changes: 148 additions & 0 deletions lib/buffer-writer.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
import varint from 'varint'
import * as CBOR from '@ipld/dag-cbor'

/**
* @typedef {import('../api').CID} CID
* @typedef {import('../api').Block} Block
* @typedef {import('../api').CarBufferWriter} Writer
* @typedef {import('../api').CarBufferWriterOptions} Options
* @typedef {import('./coding').CarEncoder} CarEncoder
*/

export class CarBufferWriter {
/**
* @param {Uint8Array} bytes
* @param {number} byteOffset
* @param {CID[]} roots
*/
constructor (bytes, byteOffset, roots = []) {
this.bytes = bytes
/** @private */
this.byteOffset = byteOffset
/** @private */
this.roots = roots
/** @private */
this.headerCapacty = byteOffset
}

/**
* @param {CID} root
*/
addRoot (root) {
const byteLength = root.bytes.byteLength + ROOT_EXTRA_SIZE
if (byteLength > this.headerCapacty - EMPTY_HEADER_SIZE) {
throw new RangeError('Root will not fit')
}
this.roots.push(root)
this.headerCapacty -= byteLength
}

/**
* Write a `Block` (a `{ cid:CID, bytes:Uint8Array }` pair) to the archive.
* Throws if there is not enough capacity.
*
* @param {Block} block A `{ cid:CID, bytes:Uint8Array }` pair.
*/
write ({ cid, bytes }) {
const size = varint.encode(cid.bytes.length + bytes.length)
const byteLength = size.length + cid.bytes.byteLength + bytes.byteLength
if (this.byteOffset + byteLength > this.bytes.byteLength) {
throw new RangeError('Buffer overflow')
} else {
this.bytes.set(size, this.byteOffset)
this.byteOffset += size.length

this.bytes.set(cid.bytes, this.byteOffset)
this.byteOffset += cid.bytes.byteOffset

this.bytes.set(bytes, this.byteOffset)
this.byteOffset += bytes.byteLength
}
}

close () {
const { roots } = this
const headerBytes = CBOR.encode({ version: 1, roots })
const varintBytes = varint.encode(headerBytes.length)

const headerByteLength = varintBytes.length + headerBytes.byteLength
const offset = this.headerCapacty - headerByteLength

if (offset >= 0) {
this.bytes.set(varintBytes, offset)
this.bytes.set(headerBytes, offset + varintBytes.length)

return this.bytes.subarray(offset, this.byteOffset)
} else if (this.bytes.byteLength + offset - this.byteOffset > 0) {
// Slide blocks by an offset
this.byteOffset -= offset
this.bytes.set(this.bytes.subarray(this.headerCapacty), this.byteOffset)
this.bytes.set(varintBytes, 0)
this.bytes.set(headerBytes, varintBytes.length)
return this.bytes.subarray(0, this.byteOffset)
} else {
throw new RangeError('Header does not fit')
}
}
}

// Number of bytes required without any roots
const EMPTY_HEADER_SIZE = 17
// Number of bytes used for CIDv1 with sha256 digest
const DEFAULT_CID_SIZE = 36
// Number of bytes added per root
const ROOT_EXTRA_SIZE = 5
Gozala marked this conversation as resolved.
Show resolved Hide resolved

/**
* @param {number} count - Number of roots
* @param {number} [capacity] - Total byteLength of allroots
*/
export const estimateHeaderCapacity = (
count,
capacity = count * DEFAULT_CID_SIZE
) => {
// Number of bytes added per root
const rootsExtra = count * ROOT_EXTRA_SIZE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This'll be out when you go above 24 roots, and then again at 256, cbor will be adding extra bytes to fit in the array length.

You could fix it by adding 1 byte for >=24 and another byte for >=256. There's another boundary at 65,536 where you'd need to add 2 more bytes, but maybe by that time you should be doing a throw.

There's also a difference in encoding for different sizes of CIDs for the same boundaries - a tiny CID <24 is going to be 1 byte more compact, then a huge over 256 is going to add another byte. These are going to be uncommon, but not out of the question. Unfortunately your arguments won't account for that, maybe documentation just needs to make it clear that this estimate is only good for sensible defaults.

It seems that the cost of a bad estimate is the huge content shuffle in close() which could be invisibly very expensive. Documenting that would be good too.

So, some solid tests for this function for various CIDs might be good!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an EncodedLength() calculator to the Go dag-cbor, which lets you figure out exactly how many bytes an object will take up once encoded, we could do the same for JS because it's not too hard, but it may not end up being that much cheaper than just doing a dag-cbor encode of a faked object and checking the byte count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This'll be out when you go above 24 roots, and then again at 256, cbor will be adding extra bytes to fit in the array length.

You could fix it by adding 1 byte for >=24 and another byte for >=256. There's another boundary at 65,536 where you'd need to add 2 more bytes, but maybe by that time you should be doing a throw.

I end up publishing before I did a last push so I'm not sure if that is prior to a fix I have added to account for the varintSize. If so I assume that last change should address this or is that something else ?

There's also a difference in encoding for different sizes of CIDs for the same boundaries - a tiny CID <24 is going to be 1 byte more compact, then a huge over 256 is going to add another byte. These are going to be uncommon, but not out of the question. Unfortunately your arguments won't account for that, maybe documentation just needs to make it clear that this estimate is only good for sensible defaults.

Yeah I'm not sure if there is a good way to account for all of that. Only thing I have considered is to overestimate a bit, which i think is better than underestimate. Alternatively we could return a range and let user decide which one to go with.

It seems that the cost of a bad estimate is the huge content shuffle in close() which could be invisibly very expensive. Documenting that would be good too.

I do not think that overhead is huge, as far as I remember browsers optimize case of moving bytes within the same buffer

So, some solid tests for this function for various CIDs might be good!

I added an EncodedLength() calculator to the Go dag-cbor, which lets you figure out exactly how many bytes an object will take up once encoded, we could do the same for JS because it's not too hard, but it may not end up being that much cheaper than just doing a dag-cbor encode of a faked object and checking the byte count.

Yeah I was thinking about that as well, but decided it was not worth an effort. However I think it would be great to have one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so I assume that last change should address this or is that something else ?

something else, this is all about cbor encoding compression. Arrays have their length included at the start, if the length is <24 then that number is included in the single byte at the start that says "this is an array". Between 24 and 256 it's in its own byte, so the array takes up 2 bytes to get started, 256 to 65536 takes up another byte, etc. The same rule applies for byte arrays - so CIDs have this property too. A CID that's less than 24 bytes long will have one less byte in the prelude, and a CID greater than 256 bytes long will have an extra one. This is why a size estimator would be handy, because the rules are specific and get a bit weird and surprising in some edges. But the rules are fixes so if you lock them down then you're set.

Unfortunately if you start down this road then you're not far off making a full estimator for arbitrary objects!

So maybe just get this right for obvious cases and not have easy failure modes, and we can get a proper estimator into the dag-cbor codec to use later.

const headerSize = capacity + rootsExtra + EMPTY_HEADER_SIZE
const varintSize = varint.encodingLength(headerSize)
return varintSize + headerSize
}

/**
* @param {CID[]} cids
*/
const totalByteLength = cids => {
Gozala marked this conversation as resolved.
Show resolved Hide resolved
let total = 0
for (const cid of cids) {
total += cid.bytes.byteLength
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could fix this too now you have the nice arrayLengthEncodeSize() which will work for CIDs too; I think if you subtract 2 from ROOT_EXTRA_SIZE (the byte array is >24<256 so it requires two bytes cbor prelude <bytes><length>, the extra bits in ROOT_EXTRA_SIZE are to do with tags, mainly) and then add it back in by adding + arrayLengthEncodeSize(cid.length) here then you get a more accurate CID length. The problem with doing that is that your API allows for an optional rootsByteLength, but that could also probably be fixed by adding those 2 to the DEFAULT_CID_SIZE.

If you add bafkqaaia and bafkqbbacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa to your tests then you should see the impact of the bad sizes here.

Identity CIDs are the most likely way we're going to end up here--it would be silly to use one as a root but it's allowed and it's not out of the question that someone finds a use-case for it.

}
return total
}

/**
*
* @param {ArrayBuffer} buffer
* @param {Options} [options]
Gozala marked this conversation as resolved.
Show resolved Hide resolved
* @returns {Writer}
*/
export const createWriter = (
buffer,
{
roots = [],
byteOffset = 0,
byteLength = buffer.byteLength,
headerCapacity = estimateHeaderCapacity(
roots.length,
totalByteLength(roots)
)
} = {}
) => {
const bytes = new Uint8Array(buffer, byteOffset, byteLength)

const writer = new CarBufferWriter(bytes, headerCapacity)
for (const root of roots) {
writer.addRoot(root)
}

return writer
}