-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deliver a fast, accessible snapshot of ImageNet #2
Comments
I pulled the latest go-ipfs code and I've been trying to get I learned from @Stebalien that Running with this fix I've made some progress, but the maximum size of A huge thank you to @Stebalien for his help debugging this so far, but perhaps sharded directories aren't quite ready for primetime yet? @ry @diasdavid |
More data from experiments after my machine downloaded all the directory shards: Initial time to list the directory was about 70 seconds.
Rerunning was about 46 seconds each time.
Trying to just list the first 10 took about the same amount of time as listing all of them.
|
Also worth recording that the ImageNet dataset has some issues:
– https://da-data.blogspot.com/2016/02/cleaning-imagenet-dataset-collected.html |
Based on the experiments above, it seems like we need to either fix remote sharded directories or work around it with manual sharding. Fixing remote directory shard enumeration
@Stebalien We discussed this briefly. Who is the right person to follow up with about this? Once we have the root directory structure cached locally, we still need to be able to enumerate files without loading the entire list into memory first. There are two possible ways to do this:
Manual shardingAlternatively, I can write a script to manually re-shard the ImageNet dataset into directories of < 1k entries. cc @diasdavid @lgierth Thoughts? |
@victorbjelkholm has been the champion on helping me getting this dataset fully pinned on the IPFS Infrastructure. @victorbjelkholm, what about spinning a node with a large enough disk and give user access to @ajbouh so that he can pin things directly? I'm sure this would save a bunch of round trips. |
Sure thing. All I need is how big of disk is required, what kind of instance (would be on AWS) and a public key to give access and I'll get it created in a few moments. |
@victorbjelkholm Disk should be bigger than 500Gb, 1TB just to be safe. I let the decision on where to host for you, however, beware of the bandwidth costs, this machine will be piping out this dataset multiple times. |
@victorbjelkholm Any progress on this? |
Looks like the relevant IPFS conversation has continued on ipfs/kubo#6523 |
We need this to support an end-to-end benchmark at scale.
This issue is related to propelml/propel#417, which tracks a similar effort that's dependent on js-ipf.
@diasdavid has posted an initial import of ImageNet: propelml/propel#417 (comment)
This import requires use of directory sharding (as he outlines in his comment). This ipfs/kubo#4871 implies that it won't work in the current implementation of IPTF without some adjustments.
The current implementation of
ipfs ls
doesn't support sharding, so we'll also need to wait on ipfs/kubo#4874 before we can experiment with this snapshot in IPTF.The text was updated successfully, but these errors were encountered: