-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipfs ls doesn't work for sharded directories #4874
Comments
So, @lgierth suggested just buffering it in memory and yoloing. @diasdavid suggested adding a |
@whyrusleeping @kevina thoughts? |
I'm working to bring large dataset snapshots hosted on IPFS into machine learning frameworks like TensorFlow. There's working code to do this over in github.com/tesserai/iptf The high level design for IPTF is a C++ class that implements basic filesystem operations. I've written some code to bind the go interfaces in the mfs package to a C++ class with the proper interface. Latency and throughput are both important for the IPTF use case, as during training we need to keep the CPU and GPU as close to 100% utilization as possible. The major limiting factor in many cases is being able to fetch and decode image data quickly enough. Under the hood, this process does something like this: walk a directory hierarchy in a random order, randomly choosing which files to load, decode these files, pass them along to the training logic, and continue the random walk. It's important that this be fast and not require fetching the entire directory's contents before returning. Here's the internal method used to visit each child during one of these walks: https://github.com/tesserai/iptf/blob/b9a6b3d65a5d34546ff86876bce416338dd4827b/iptf/go/ro/raw_file_system.go#L136 |
Unfortunately, while we support random lookups, we don't support seeking in sharded directories (although we arguably should). That is, if you know the name of the file in question, you can efficiently get the file. However, if you want the Nth file, you'll have to traverse the entire directory counting files until we reach it (at the moment, the underlying directory structure doesn't record the information we'd need to seek to the Nth file efficiently). However, if you don't have that many files (the list can easily fit in memory), this isn't an issue. Once this issue is fixed, you'll be able to fetch the entire directory and randomize it in memory. Alternatively, if you don't need to walk the entire dataset, you can (once this issue is fixed) stream the results and sample the stream. If neither of these apply, we'll have to look into adding seek support to sharded directories (which we'll probably want anyways). |
@Stebalien does |
@Stebalien the interface we implement is defined here: https://github.com/tesserai/iptf/blob/b9a6b3d65a5d34546ff86876bce416338dd4827b/iptf/go/api/api.go#L54 It includes things like: EachChildName(dir string, visit func(name string) error) error
Stat(fname string) (*Stat, error)
NewRandomAccessFile(fname string) (io.ReaderAt, error) So the ability to stream through the names, sample one, stat it, and then load the data from that file would be enough. The actual implementation of sampling is (currently) done at a higher level, so it's reasonable to assume that we have the names of files by the time we want to load the file. Re: how many files, I don't actually know that since I can't inspect the CID that @diasdavid posted:
Any idea how complicated doing this is going to be? I'd prefer something that works today, so we can get an end-to-end benchmark up and running. I posted a script to shard a smaller dataset before importing it to IPFS: https://gist.github.com/ajbouh/80cf1e2c87fff205283e527568da4ce4 You can explore it here: https://gateway.ipfs.io/ipfs/QmaePis1poDFG3C9m3GZcJA6P9L6gRZTVsPed2XsnZTvgC/ That smaller dataset had ~50k files in a single directory, which I sharded into 50 directories of ~1000 files each. |
Ah. I was looking at
Got it. That won't require modifying unixfs itself so it should be doable.
...
So, given that I was wrong and that the non-streaming API works, you can probably just use that for now. 50-100KiB should actually fit in memory quite easily (on the order of 10s of megabyte, at most). A streaming API will require some discussion: #4882 |
(closing as listing files does, in fact, work). |
While we can't implement streaming
ls
:(, we can, at least, just ls locally and send back the results.The text was updated successfully, but these errors were encountered: