-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: store pins in datastore instead of a DAG #2771
Conversation
Adhoc testing script. Add a buffer without pinning it, time how long it takes to pin it. Store the time and work out the average time taken every 100 pins: 'use strict'
const last = require('it-last')
const drain = require('it-drain')
const { createController } = require('ipfsd-ctl')
async function main () {
const ipfs = (await createController({
type: 'go',
ipfsBin: require('go-ipfs-dep').path(),
ipfsHttpModule: require('ipfs-http-client'),
disposable: false
}))
await ipfs.init()
await ipfs.start()
let times = []
let chunk = 0
for (let i = 0; i < 83000; i++) {
const buf = Buffer.from(`${Math.random()}`)
const result = await last(ipfs.api.add(buf, {
pin: false
}))
const start = Date.now()
const res = await ipfs.api.pin.add(result.cid)
if (res[Symbol.asyncIterator]) {
await drain(res)
}
const mem = process.memoryUsage()
times.push({
...mem,
elapsed: Date.now() - start
})
chunk++
if (chunk === 1000) {
const sum = times.reduce((acc, curr) => {
acc.elapsed += curr.elapsed
acc.rss += curr.rss
acc.heapTotal += curr.heapTotal
acc.heapUsed += curr.heapUsed
acc.external += curr.external
return acc
}, { elapsed: 0, rss: 0, heapTotal: 0, heapUsed: 0, external: 0 })
console.info(`${i + 1}, ${sum.elapsed / times.length}, ${sum.rss / times.length}, ${sum.heapTotal / times.length}, ${sum.heapUsed / times.length}, ${sum.external / times.length}`)
chunk = 0
times = []
}
}
await ipfs.stop()
}
main() Results: 10k pins, DAG vs datastore, ranges from 20-300x speedup in time taken to add a single pin: After 100k pins, there doesn't seem to be much performance degredation in storing in the datastore whereas the DAG method degrades significantly after 8192 pins (see #2197 for discussion of that): The next significant performance jump vs DAGs would probably be after the first layer of buckets is full - e.g. 256 buckets of 8192 pins = 2,097,152 pins. That'll probably take a bit of time to benchmark... |
Next steps:
|
That's a very cool speed improvement! Some observations:
|
I guess you could only store the cid version/codec in the pin? I was thinking of changing the pin type to be an integer too, so there are definitely some improvements that can be made, this is just a first pass.
@Stebalien has talked about making a similar change to this too so it's only slightly ahead of the go-ipfs repo. At any rate, go-ipfs is switching to badger by default which js-ipfs can't read so I'm not sure how much of a priority that is any more.
I guess you can't share your entire list of pins by sharing one CID, but also now do you don't have to share your entire list of pins, you can share individual ones. Grouping multiple pins as pinsets could be added back in as a new feature, the human readable names would make this nicer to work with. Something like: $ ipfs pin add Qmfoo
pinned Qmfoo recursively
$ ipfs pin-set add my-super-fun-pinset Qmfoo
$ ipfs pin-set list my-super-fun-pinset
my-super-fun-pinset Qmqux
Qmfoo
Qmbar
Qmbaz You could event have the root of a pinset be an IPNS name to allow pulling updates from the network. That'd be neat. |
This can be fixed :).
At the moment, I think this is causing strictly more harm than help. It's been 6 years and I have yet to see someone use this. Ideally, everything would be stored in an IPLD-backed graph database. However, we aren't there yet in terms of tooling. We could get part way there by creating an IPLD-backed datastore (datastore -> IPLD HAMT -> datastore) but that will throw away the type information.
Any reason to store the CID?
Base64url? Go-ipfs, at least, now has hyper-optimized base58. However, it's still slower than base64 (and takes more space). Questions/comments.
|
43316eb
to
80986f9
Compare
Some more graphs. I pinned 83k single blocks using the test script above (originally intended to be 100k but the js-dag benchmark took too long to run and I had to get on an aeroplane). The initial hump at 8192 pins is there, then a consistent performance degradation over time. At 83k pins, js is taking 2.5s to add a pin. Go has the same degradation but it is significantly less pronounced. The js-dag implementation stores the pinsets in memory, js-datastore does not. There is an increase in memory usage over time but it's may not be hitting the v8 gc threshold, or there's a leak somewhere...
The sizes appear to be comparable, or perhaps they are statistically insignificant compared to the block size. After completing the benchmark and running repo gc I see: # js-dag
.jsipfs $ du -hs
367M .
# go-dag
.ipfs $ du -hs
353M .
# js-datastore
.jsipfs $ du -hs
344M .
Yes, this is the idea behind storing them CBOR encoded rather than protobufs.
Good suggestion, names are not unique so comments might be a better field name.
If we're not going let the user query by name we probably shouldn't do this.
My thinking was that by using the multihash of a block as the pin identifier (not the full CID), it becomes cheap to calculate if a given block has already been pinned (assuming the user has hashed it with the same algorithm). The full CID is stored so we can show the user what they used to pin the block when they do a |
cc @hsanjuan |
Adds a `.pins` datastore to `ipfs-repo` and uses that to store pins as cbor binary keyed by base64 stringified multihashes (n.b. not CIDs). Each pin has several fields: ```javascript { cid: // buffer, the full CID pinned type: // string, 'recursive' or 'direct' comments: // string, human-readable comments for the pin } ``` BREAKING CHANGES: * pins are now stored in a datastore, a repo migration will be necessary * ipfs.pins.add now returns an async generator * ipfs.pins.rm now returns an async generator Depends on: - [ ] ipfs/js-ipfs-repo#221
a5ec30e
to
582be49
Compare
The changes in ipfs/js-ipfs#2771 mean that the input/output of `ipfs.pins.add` and `ipfs.pins.rm` are now streaming so this PR updates to the new API.
Adds a `.pins` datastore to `ipfs-repo` and uses that to store pins as cbor binary keyed by multihash. ### Format As stored in the datastore, each pin has several fields: ```javascript { codec: // optional Number, the codec from the CID that this multihash was pinned with, if omitted, treated as 'dag-pb' version: // optional Number, the version number from the CID that this multihash was pinned with, if omitted, treated as v0 depth: // Number Infinity = recursive pin, 0 = direct, 1+ = pinned to a depth comments: // optional String user-friendly description of the pin metadata: // optional Object, user-defined data for the pin } ``` Notes: `.codec` and `.version` are stored so we can recreate the original CID when listing pins. ### Metadata The intention is for us to be able to add extra fields that have technical meaning to the root of the object, and the user can store application-specific data in the `metadata` field. ### CLI ```console $ ipfs pin add bafyfoo --metadata key1=value1,key2=value2 $ ipfs pin add bafyfoo --metadata-format=json --metadata '{"key1":"value1","key2":"value2"}' $ ipfs pin list bafyfoo $ ipfs pin list -l CID Name Type Metadata bafyfoo My pin Recursive {"key1":"value1","key2":"value2"} $ ipfs pin metadata Qmfoo --format=json {"key1":"value1","key2":"value2"} ``` ### HTTP API * '/api/v0/pin/add' route adds new `metadata` argument, accepts a json string * '/api/v0/pin/metadata' returns metadata as json ### Core API * `ipfs.pin.addAll` accepts and returns an async iterator * `ipfs.pin.rmAll` accepts and returns an async iterator ```javascript // pass a cid or IPFS Path with options const { cid } = await ipfs.pin.add(new CID('/ipfs/Qmfoo'), { recursive: false, metadata: { key: 'value }, timeout: 2000 })) // pass an iterable of CIDs const [{ cid: cid1 }, { cid: cid2 }] = await all(ipfs.pin.addAll([ new CID('/ipfs/Qmfoo'), new CID('/ipfs/Qmbar') ], { timeout: '2s' })) // pass an iterable of objects with options const [{ cid: cid1 }, { cid: cid2 }] = await all(ipfs.pin.addAll([ { cid: new CID('/ipfs/Qmfoo'), recursive: true, comments: 'A recursive pin' }, { cid: new CID('/ipfs/Qmbar'), recursive: false, comments: 'A direct pin' } ], { timeout: '2s' })) ``` * ipfs.pin.rmAll accepts and returns an async generator (other input types are available) ```javascript // pass an IPFS Path or CID const { cid } = await ipfs.rm(new CID('/ipfs/Qmfoo/file.txt')) // pass options const { cid } = await all(ipfs.rm(new CID('/ipfs/Qmfoo'), { recursive: true })) // pass an iterable of CIDs or objects with options const [{ cid }] = await all(ipfs.rmAll([{ cid: new CID('/ipfs/Qmfoo'), recursive: true }])) ``` Bonus: Lets us pipe the output of one command into another: ```javascript await pipe( ipfs.pin.ls({ type: 'recursive' }), (source) => ipfs.pin.rmAll(source) ) // or await all(ipfs.pin.rmAll(ipfs.pin.ls({ type: 'recursive'}))) ``` BREAKING CHANGES: * pins are now stored in a datastore, a repo migration will occur on startup * All deps of this module now use Uint8Arrays in place of node Buffers
Adds a
.pins
datastore toipfs-repo
and uses that to store pins as cbor binary keyed by base32 encoded multihashes (n.b. not CIDs).Format
As stored in the datastore, each pin has several fields:
Notes:
.codec
and.version
are stored so we can recreate the original CID when listing pins.Metadata
The intention is for us to be able to add extra fields that have technical meaning to the root of the object, and the user can store application-specific data in the
metadata
field.CLI
HTTP API
metadata
argument, accepts a json stringFuture tech:
/default/C19A797...
,/my-namespace/C19A797...
ipfs pin ls --namespace=my-namespace
ipfs pin query metadata.key1=value1
Core API
ipfs.pin.addAll
accepts and returns an async iteratoripfs.pin.rmAll
accepts and returns an async iteratorBonus: Lets us pipe the output of one command into another:
Todo:
Depends on:
BREAKING CHANGES: