Deliver a fast, accessible snapshot of ImageNet #2

ajbouh · 2018-03-27T16:15:24Z

We need this to support an end-to-end benchmark at scale.

This issue is related to propelml/propel#417, which tracks a similar effort that's dependent on js-ipf.

@diasdavid has posted an initial import of ImageNet: propelml/propel#417 (comment)

This import requires use of directory sharding (as he outlines in his comment). This ipfs/kubo#4871 implies that it won't work in the current implementation of IPTF without some adjustments.

The current implementation of ipfs ls doesn't support sharding, so we'll also need to wait on ipfs/kubo#4874 before we can experiment with this snapshot in IPTF.

The text was updated successfully, but these errors were encountered:

ajbouh · 2018-03-27T23:29:49Z

I pulled the latest go-ipfs code and I've been trying to get ipfs ls --resolve-type=false QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1 to succeed for the better part of this afternoon.

I learned from @Stebalien that --resolve-type=false is needed to avoid fetching the first block of every file in the directory. Though the current sharded directory implementation seems to ignore this option. @Stebalien wrote a fix for this: ipfs/kubo#4884

Running with this fix I've made some progress, but the maximum size of ipfs bitswap wantlist always seems to be 1. I'm on a wired connection to a 250Mbps cable modem and ipfs ls ... has printed nothing even after many minutes of waiting.

A huge thank you to @Stebalien for his help debugging this so far, but perhaps sharded directories aren't quite ready for primetime yet? @ry @diasdavid

ajbouh · 2018-03-28T19:02:47Z

More data from experiments after my machine downloaded all the directory shards:

Initial time to list the directory was about 70 seconds.

~/gopath/src/github.com/ipfs/go-ipfs$ time ipfs ls --resolve-type=false QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1 | wc -l
1281167

real    1m9.764s
user    0m11.004s
sys    0m0.396s

Rerunning was about 46 seconds each time.

~/gopath/src/github.com/ipfs/go-ipfs$ time ipfs ls --resolve-type=false QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1 | wc -l
1281167

real    0m45.935s
user    0m11.256s
sys    0m0.444s

Trying to just list the first 10 took about the same amount of time as listing all of them.

~/gopath/src/github.com/ipfs/go-ipfs$ time ipfs ls --resolve-type=false QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1 | head -n 10
 … snip …
real    0m46.718s
user    0m10.868s
sys    0m0.460s

ajbouh · 2018-03-28T19:10:55Z

Also worth recording that the ImageNet dataset has some issues:

CMYK JPEG files sometimes screw up image loaders (e.g. MATLAB imread), and for the ILSVRC CLS-LOC training set, it takes long time to sort out these CMYK files. To this end, we list all known CMYK JPEG files as follows:

n01739381_1309.JPEG
n02077923_14822.JPEG
n02447366_23489.JPEG
n02492035_15739.JPEG
n02747177_10752.JPEG
n03018349_4028.JPEG
n03062245_4620.JPEG
n03347037_9675.JPEG
n03467068_12171.JPEG
n03529860_11437.JPEG
n03544143_17228.JPEG
n03633091_5218.JPEG
n03710637_5125.JPEG
n03961711_5286.JPEG
n04033995_2932.JPEG
n04258138_17003.JPEG
n04264628_27969.JPEG
n04336792_7448.JPEG
n04371774_5854.JPEG
n04596742_4225.JPEG
n07583066_647.JPEG
n13037406_4650.JPEG

Also, n02105855_2933.JPEG is actually a PNG file, which may crash the image loader as well if not configured generic enough.

– https://da-data.blogspot.com/2016/02/cleaning-imagenet-dataset-collected.html

ajbouh · 2018-03-29T02:20:48Z

Based on the experiments above, it seems like we need to either fix remote sharded directories or work around it with manual sharding.

Fixing remote directory shard enumeration

Speed up fetch by having more than element in the ipfs bitswap wantlist at a time.

@Stebalien We discussed this briefly. Who is the right person to follow up with about this?

Once we have the root directory structure cached locally, we still need to be able to enumerate files without loading the entire list into memory first. There are two possible ways to do this:

adjust the ls api to be streaming (see Implement a streaming ls api ipfs/kubo#4882)
create a manifest file manually after import with something like

$ export ROOT_CID=...
$ ipfs ls --resolve-type=false $ROOT_CID > manifest
$ MANIFEST_CID=$(ipfs add manifest)
$ ROOT_CID=$(ipfs object patch $ROOT_CID add-link $MANIFEST_CID)

Manual sharding

Alternatively, I can write a script to manually re-shard the ImageNet dataset into directories of < 1k entries.

cc @diasdavid @lgierth Thoughts?

daviddias · 2018-03-29T02:30:27Z

@victorbjelkholm has been the champion on helping me getting this dataset fully pinned on the IPFS Infrastructure.

@victorbjelkholm, what about spinning a node with a large enough disk and give user access to @ajbouh so that he can pin things directly? I'm sure this would save a bunch of round trips.

victorb · 2018-03-29T10:16:37Z

what about spinning a node with a large enough disk

Sure thing. All I need is how big of disk is required, what kind of instance (would be on AWS) and a public key to give access and I'll get it created in a few moments.

daviddias · 2018-03-29T18:29:22Z

@victorbjelkholm Disk should be bigger than 500Gb, 1TB just to be safe. I let the decision on where to host for you, however, beware of the bandwidth costs, this machine will be piping out this dataset multiple times.

ajbouh · 2018-07-17T21:09:40Z

@victorbjelkholm Any progress on this?

ajbouh · 2020-11-17T20:19:41Z

Looks like the relevant IPFS conversation has continued on ipfs/kubo#6523

ajbouh mentioned this issue Mar 28, 2018

Implement a streaming ls api ipfs/kubo#4882

Closed

ajbouh mentioned this issue Apr 2, 2018

Sharded directory fetching is unusably slow ipfs/kubo#4908

Closed

ajbouh mentioned this issue Jul 17, 2018

Datastore benchmarks ipfs/kubo#4870

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deliver a fast, accessible snapshot of ImageNet #2

Deliver a fast, accessible snapshot of ImageNet #2

ajbouh commented Mar 27, 2018

ajbouh commented Mar 27, 2018

ajbouh commented Mar 28, 2018

ajbouh commented Mar 28, 2018

ajbouh commented Mar 29, 2018 •

edited

Loading

daviddias commented Mar 29, 2018

victorb commented Mar 29, 2018

daviddias commented Mar 29, 2018

ajbouh commented Jul 17, 2018

ajbouh commented Nov 17, 2020

Deliver a fast, accessible snapshot of ImageNet #2

Deliver a fast, accessible snapshot of ImageNet #2

Comments

ajbouh commented Mar 27, 2018

ajbouh commented Mar 27, 2018

ajbouh commented Mar 28, 2018

ajbouh commented Mar 28, 2018

ajbouh commented Mar 29, 2018 • edited Loading

Fixing remote directory shard enumeration

Manual sharding

daviddias commented Mar 29, 2018

victorb commented Mar 29, 2018

daviddias commented Mar 29, 2018

ajbouh commented Jul 17, 2018

ajbouh commented Nov 17, 2020

ajbouh commented Mar 29, 2018 •

edited

Loading