Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipfs ls doesn't work for sharded directories #4874

Closed
Stebalien opened this issue Mar 24, 2018 · 8 comments
Closed

ipfs ls doesn't work for sharded directories #4874

Stebalien opened this issue Mar 24, 2018 · 8 comments

Comments

@Stebalien
Copy link
Member

While we can't implement streaming ls :(, we can, at least, just ls locally and send back the results.

@Stebalien
Copy link
Member Author

So, @lgierth suggested just buffering it in memory and yoloing. @diasdavid suggested adding a --stream flag to avoid breaking the API.

@Stebalien
Copy link
Member Author

@whyrusleeping @kevina thoughts?

@ajbouh
Copy link

ajbouh commented Mar 27, 2018

I'm working to bring large dataset snapshots hosted on IPFS into machine learning frameworks like TensorFlow. There's working code to do this over in github.com/tesserai/iptf

The high level design for IPTF is a C++ class that implements basic filesystem operations. I've written some code to bind the go interfaces in the mfs package to a C++ class with the proper interface.

Latency and throughput are both important for the IPTF use case, as during training we need to keep the CPU and GPU as close to 100% utilization as possible. The major limiting factor in many cases is being able to fetch and decode image data quickly enough.

Under the hood, this process does something like this: walk a directory hierarchy in a random order, randomly choosing which files to load, decode these files, pass them along to the training logic, and continue the random walk. It's important that this be fast and not require fetching the entire directory's contents before returning.

Here's the internal method used to visit each child during one of these walks: https://github.com/tesserai/iptf/blob/b9a6b3d65a5d34546ff86876bce416338dd4827b/iptf/go/ro/raw_file_system.go#L136

@Stebalien
Copy link
Member Author

  1. How many files are we talking about?
  2. Are you planning on operating on a random sample or do you actually need to do a random walk over the entire dataset?

Unfortunately, while we support random lookups, we don't support seeking in sharded directories (although we arguably should). That is, if you know the name of the file in question, you can efficiently get the file. However, if you want the Nth file, you'll have to traverse the entire directory counting files until we reach it (at the moment, the underlying directory structure doesn't record the information we'd need to seek to the Nth file efficiently).

However, if you don't have that many files (the list can easily fit in memory), this isn't an issue. Once this issue is fixed, you'll be able to fetch the entire directory and randomize it in memory.

Alternatively, if you don't need to walk the entire dataset, you can (once this issue is fixed) stream the results and sample the stream.

If neither of these apply, we'll have to look into adding seek support to sharded directories (which we'll probably want anyways).

@Kubuxu
Copy link
Member

Kubuxu commented Mar 27, 2018

@Stebalien does ls not work at all or is it just slow/buffers everything? If it is the former the issue should be renamed.

@ajbouh
Copy link

ajbouh commented Mar 27, 2018

@Stebalien the interface we implement is defined here: https://github.com/tesserai/iptf/blob/b9a6b3d65a5d34546ff86876bce416338dd4827b/iptf/go/api/api.go#L54

It includes things like:

	EachChildName(dir string, visit func(name string) error) error
	Stat(fname string) (*Stat, error)
	NewRandomAccessFile(fname string) (io.ReaderAt, error)

So the ability to stream through the names, sample one, stat it, and then load the data from that file would be enough.

The actual implementation of sampling is (currently) done at a higher level, so it's reasonable to assume that we have the names of files by the time we want to load the file.

Re: how many files, I don't actually know that since I can't inspect the CID that @diasdavid posted:

We have successfully added and pinned ImageNet to IPFS. The following CID (Content Identifier) has it QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1 and it is being pinned by this node /ip4/145.239.144.121/tcp/4001/ipfs/QmY9BWdiEn43iNx6nJEvgwrioJUYPQnUAoqvqKVRjavK4h, you can do a direct ipfs swarm connect to it to speed up the discovery.

Any idea how complicated doing this is going to be? I'd prefer something that works today, so we can get an end-to-end benchmark up and running.

I posted a script to shard a smaller dataset before importing it to IPFS: https://gist.github.com/ajbouh/80cf1e2c87fff205283e527568da4ce4

You can explore it here: https://gateway.ipfs.io/ipfs/QmaePis1poDFG3C9m3GZcJA6P9L6gRZTVsPed2XsnZTvgC/

That smaller dataset had ~50k files in a single directory, which I sharded into 50 directories of ~1000 files each.

@Stebalien
Copy link
Member Author

does ls not work at all or is it just slow/buffers everything? If it is the former the issue should be renamed.

Ah. I was looking at file ls (the deprecated version), not ls. It looks like ls does support listing sharded directories, just not streaming them.

So the ability to stream through the names, sample one, stat it, and then load the data from that file would be enough.

Got it. That won't require modifying unixfs itself so it should be doable.

Any idea how complicated doing this is going to be? I'd prefer something that works today, so we can get an end-to-end benchmark up and running.

...

That smaller dataset had ~50k files in a single directory, which I sharded into 50 directories of ~1000 files each.

So, given that I was wrong and that the non-streaming API works, you can probably just use that for now. 50-100KiB should actually fit in memory quite easily (on the order of 10s of megabyte, at most).

A streaming API will require some discussion: #4882

@Stebalien
Copy link
Member Author

(closing as listing files does, in fact, work).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants