ipfs ls doesn't work for sharded directories #4874

Stebalien · 2018-03-24T01:23:16Z

While we can't implement streaming ls :(, we can, at least, just ls locally and send back the results.

The text was updated successfully, but these errors were encountered:

Stebalien · 2018-03-27T15:59:40Z

So, @lgierth suggested just buffering it in memory and yoloing. @diasdavid suggested adding a --stream flag to avoid breaking the API.

Stebalien · 2018-03-27T15:59:51Z

@whyrusleeping @kevina thoughts?

ajbouh · 2018-03-27T17:03:49Z

I'm working to bring large dataset snapshots hosted on IPFS into machine learning frameworks like TensorFlow. There's working code to do this over in github.com/tesserai/iptf

The high level design for IPTF is a C++ class that implements basic filesystem operations. I've written some code to bind the go interfaces in the mfs package to a C++ class with the proper interface.

Latency and throughput are both important for the IPTF use case, as during training we need to keep the CPU and GPU as close to 100% utilization as possible. The major limiting factor in many cases is being able to fetch and decode image data quickly enough.

Under the hood, this process does something like this: walk a directory hierarchy in a random order, randomly choosing which files to load, decode these files, pass them along to the training logic, and continue the random walk. It's important that this be fast and not require fetching the entire directory's contents before returning.

Here's the internal method used to visit each child during one of these walks: https://github.com/tesserai/iptf/blob/b9a6b3d65a5d34546ff86876bce416338dd4827b/iptf/go/ro/raw_file_system.go#L136

Stebalien · 2018-03-27T18:08:40Z

How many files are we talking about?
Are you planning on operating on a random sample or do you actually need to do a random walk over the entire dataset?

Unfortunately, while we support random lookups, we don't support seeking in sharded directories (although we arguably should). That is, if you know the name of the file in question, you can efficiently get the file. However, if you want the Nth file, you'll have to traverse the entire directory counting files until we reach it (at the moment, the underlying directory structure doesn't record the information we'd need to seek to the Nth file efficiently).

However, if you don't have that many files (the list can easily fit in memory), this isn't an issue. Once this issue is fixed, you'll be able to fetch the entire directory and randomize it in memory.

Alternatively, if you don't need to walk the entire dataset, you can (once this issue is fixed) stream the results and sample the stream.

If neither of these apply, we'll have to look into adding seek support to sharded directories (which we'll probably want anyways).

Kubuxu · 2018-03-27T18:52:10Z

@Stebalien does ls not work at all or is it just slow/buffers everything? If it is the former the issue should be renamed.

ajbouh · 2018-03-27T19:40:47Z

@Stebalien the interface we implement is defined here: https://github.com/tesserai/iptf/blob/b9a6b3d65a5d34546ff86876bce416338dd4827b/iptf/go/api/api.go#L54

It includes things like:

	EachChildName(dir string, visit func(name string) error) error
	Stat(fname string) (*Stat, error)
	NewRandomAccessFile(fname string) (io.ReaderAt, error)

So the ability to stream through the names, sample one, stat it, and then load the data from that file would be enough.

The actual implementation of sampling is (currently) done at a higher level, so it's reasonable to assume that we have the names of files by the time we want to load the file.

Re: how many files, I don't actually know that since I can't inspect the CID that @diasdavid posted:

We have successfully added and pinned ImageNet to IPFS. The following CID (Content Identifier) has it QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1 and it is being pinned by this node /ip4/145.239.144.121/tcp/4001/ipfs/QmY9BWdiEn43iNx6nJEvgwrioJUYPQnUAoqvqKVRjavK4h, you can do a direct ipfs swarm connect to it to speed up the discovery.

Any idea how complicated doing this is going to be? I'd prefer something that works today, so we can get an end-to-end benchmark up and running.

I posted a script to shard a smaller dataset before importing it to IPFS: https://gist.github.com/ajbouh/80cf1e2c87fff205283e527568da4ce4

You can explore it here: https://gateway.ipfs.io/ipfs/QmaePis1poDFG3C9m3GZcJA6P9L6gRZTVsPed2XsnZTvgC/

That smaller dataset had ~50k files in a single directory, which I sharded into 50 directories of ~1000 files each.

Stebalien · 2018-03-27T21:21:18Z

does ls not work at all or is it just slow/buffers everything? If it is the former the issue should be renamed.

Ah. I was looking at file ls (the deprecated version), not ls. It looks like ls does support listing sharded directories, just not streaming them.

So the ability to stream through the names, sample one, stat it, and then load the data from that file would be enough.

Got it. That won't require modifying unixfs itself so it should be doable.

Any idea how complicated doing this is going to be? I'd prefer something that works today, so we can get an end-to-end benchmark up and running.

...

That smaller dataset had ~50k files in a single directory, which I sharded into 50 directories of ~1000 files each.

So, given that I was wrong and that the non-streaming API works, you can probably just use that for now. 50-100KiB should actually fit in memory quite easily (on the order of 10s of megabyte, at most).

A streaming API will require some discussion: #4882

Stebalien · 2018-03-27T21:22:27Z

(closing as listing files does, in fact, work).

Stebalien mentioned this issue Mar 24, 2018

make the tar writer handle sharded ipfs directories #4873

Merged

ajbouh mentioned this issue Mar 27, 2018

Deliver a fast, accessible snapshot of ImageNet tesserai/iptf#2

Open

Stebalien closed this as completed Mar 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ipfs ls doesn't work for sharded directories #4874

ipfs ls doesn't work for sharded directories #4874

Stebalien commented Mar 24, 2018

Stebalien commented Mar 27, 2018

Stebalien commented Mar 27, 2018

ajbouh commented Mar 27, 2018

Stebalien commented Mar 27, 2018

Kubuxu commented Mar 27, 2018

ajbouh commented Mar 27, 2018

Stebalien commented Mar 27, 2018

Stebalien commented Mar 27, 2018

ipfs ls doesn't work for sharded directories #4874

ipfs ls doesn't work for sharded directories #4874

Comments

Stebalien commented Mar 24, 2018

Stebalien commented Mar 27, 2018

Stebalien commented Mar 27, 2018

ajbouh commented Mar 27, 2018

Stebalien commented Mar 27, 2018

Kubuxu commented Mar 27, 2018

ajbouh commented Mar 27, 2018

Stebalien commented Mar 27, 2018

Stebalien commented Mar 27, 2018