Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow ipfs cat on 251 MB .ipfs #2174

Closed
Faleidel opened this issue Jan 7, 2016 · 22 comments
Closed

Slow ipfs cat on 251 MB .ipfs #2174

Faleidel opened this issue Jan 7, 2016 · 22 comments
Labels
kind/bug A bug in existing code (including security flaws) topic/perf Performance

Comments

@Faleidel
Copy link

Faleidel commented Jan 7, 2016

rmongeau-715:~ mongeaur$ echo "test123123" | ipfs add
added QmNgb837adF9NFqTiZbhL7Pcc7Xs2kemmC2ASL5msCQccC QmNgb837adF9NFqTiZbhL7Pcc7Xs2kemmC2ASL5msCQccC
rmongeau-715:~ mongeaur$ time ipfs cat QmNgb837adF9NFqTiZbhL7Pcc7Xs2kemmC2ASL5msCQccC
test123123

real    0m1.957s
user    0m0.014s
sys 0m0.012s

This is on an ipfs node with a .ipfs size of 251,7 MB

On a new empty ifps node the same thing is instantaneous.

Is it normal that its so slow? Is ipfs.io/ipfs as slow as that?

ps : running 0.4 dev

@rht
Copy link
Contributor

rht commented Jan 21, 2016

What are the time stat for ipfs cat on an empty .ipfs, ipfs cat on 251MB .ipfs, and tree on .ipfs?

@Faleidel
Copy link
Author

I now run v4.0 and things are much better.

With almost empty .ipfs :
real 0m0.024s
user 0m0.015s
sys 0m0.009s

With 251MB .ipfs :
real 0m0.377s
user 0m0.016s
sys 0m0.009s

But this is still a x20 factor.

@rht
Copy link
Contributor

rht commented Jan 21, 2016

How long does it take to tree?

@Faleidel
Copy link
Author

Sorry, but what is this tree thing?

@rht
Copy link
Contributor

rht commented Jan 22, 2016

tree refers to https://en.wikipedia.org/wiki/Tree_(Unix), this is to check how long it takes to traverse the entire content.

@Faleidel
Copy link
Author

On a small .ipfs :
5071 directories, 5296 files

real    0m6.761s
user    0m0.192s
sys 0m0.446s

With 251MB .ipfs :
11452 directories, 12042 files

real    0m24.079s
user    0m0.491s
sys 0m1.231s

@rht
Copy link
Contributor

rht commented Jan 22, 2016

ic, that would be the worst case. Or rather, the hash lookup should be more equivalent to find . -name <some_hash.data> multiple times (as much as the chunk numbers of the hash).

e: clarification

@rht
Copy link
Contributor

rht commented Jan 22, 2016

Tested the find

small .ipfs:
ipfs cat 0.03s user 0.01s system 72% cpu 0.052 total
find . -name 0.00s user 0.00s system 62% cpu 0.003 total

~110MB .ipfs
ipfs cat 0.05s user 0.03s system 88% cpu 0.092 total
find . -name 0.01s user 0.02s system 97% cpu 0.027 total

~475MB .ipfs
ipfs cat 0.06s user 0.05s system 100% cpu 0.107 total
find . -name 0.01s user 0.02s system 97% cpu 0.039 total

The scaling looks reasonable.

I can't reproduce the bug (~2s ipfs cat) you specified.
I did time ipfs cat QmNgb837adF9NFqTiZbhL7Pcc7Xs2kemmC2ASL5msCQccC on the ~110MB .ipfs, ~475MB. They scale just like the 3 cases above.

@dignifiedquire
Copy link
Member

dignifiedquire commented Jun 1, 2016

I'm seeing a similar issue, currently looking at these stats:

$ time ipfs cat Qmc7yHAskid65yTEtebm1PjWH4fHmyUnB6GfmbmV29F9ev
console.log('hello from a text file')

ipfs cat Qmc7yHAskid65yTEtebm1PjWH4fHmyUnB6GfmbmV29F9ev  

0.01s user 0.01s system 0% cpu 16.724 total
$ time ipfs get Qmc7yHAskid65yTEtebm1PjWH4fHmyUnB6GfmbmV29F9ev
Saving file(s) to Qmc7yHAskid65yTEtebm1PjWH4fHmyUnB6GfmbmV29F9ev
2.00 KB 0

ipfs get Qmc7yHAskid65yTEtebm1PjWH4fHmyUnB6GfmbmV29F9ev  

0.01s user 0.01s system 86% cpu 0.025 total

With a .ipfs folder size of 2,89 GB with 273.210 items.

My version is

$ ipfs version --commit
ipfs version 0.4.2-c814478

So it seems that there is something specific about cat related to the repo size, that is not the case for get.


Update:

@Kubuxu
Copy link
Member

Kubuxu commented Jun 1, 2016

Here is a profiling report: https://ipfs.io/ipfs/QmUBq6RW1fLcuUpKqrkGTdaXaVKGzFgFwNVPG84QaLGyTY/ipfs.svg

Looks like for some reason ipfs cat tries to get space used and it is slow.

ipfs cat runs ConditionalGC that is slow in such a big repo.

It is due to func (r *FSRepo) GetStorageUsage(). @whyrusleeping what would be correct way to resolve this issue? Should we cache GetStorageUsage?

@rht
Copy link
Contributor

rht commented Jun 1, 2016

It does, since ipfs cat runs a conditional gc to check if the repo max size has been exceeded https://github.com/ipfs/go-ipfs/blob/c814478fb98f7ac5628dff3881821eb9904041ee/core/commands/cat.go#L45.

A middle ground that maintains this feature is to cache information of the repo size. Additionally, this could be used to provide the repo's dedup ratio (much like in zfs). zfs has quota as an option as well.

@Kubuxu
Copy link
Member

Kubuxu commented Jun 1, 2016

I am thinking about using custom walk function and using ModTime to filter out changes.

As files are immutable in we would only need to store sizes of directories.

@dignifiedquire can you run: ls -1 ~/.ipfs/blocks/ | wc -l to check how many directories are in your repo and find ~/.ipfs/blocks/ -type f | wc -l to check how many blocks there are.

@dignifiedquire
Copy link
Member

@Kubuxu there you go

$ ls -1 ~/.ipfs/blocks/ | wc -l
   62947

$ find ~/.ipfs/blocks/ -type f | wc -l
  210380

@rht
Copy link
Contributor

rht commented Jun 1, 2016

I am thinking about using custom walk function and using ModTime to filter out changes.

This still doesn't address the fact that all the blocks have to be traversed.

As stated, I'm for caching (list of keys, reference count, and size) into a dedup table, saved into a file in $IPFS_PATH. This could result in faster ipfs refs (it uses AllKeysChan, which requires full repo traversal), faster blockstore.Has, keeping track of dedup ratio,...

@Kubuxu
Copy link
Member

Kubuxu commented Jun 1, 2016

We don't need to traverse all blocks. If ModTime of directory is same as last registered we could not walk it. This already should improve times by factor of 4 basing on @dignifiedquire data.

I am all for full caching but it requires changes in repo format (and more things to keep in sync and so on)

The quick and dirty option would be to disable ConditionalGC in ipfs cat (it is already disabled inside ipfs get) and only rely on interval GC until the repo format gets upgraded with cache.

@rht
Copy link
Contributor

rht commented Jun 1, 2016

ic, though the leap from "keeping track of ModTime of directories" to "keeping track of the keys and their properties" is not that far. The full caching can be implemented as an in-memory table throughout the daemon's (or a daemon-less command) life, then serialized/synced only when the repo closes.

ConditionalGC is disabled in ipfs get because the total size of the hash can't be calculated readily (unlike in ipfs cat, where the hash must refer to a file). The only compelling reason to disable all the ConditionalGC is that atm the repo is rarely used under constrained disk space situation by most users (e.g., embedded devices). Which is the case, and so I think it is good for now.

@whyrusleeping
Copy link
Member

I don't even think we need to traverse the directory. We can likely just keep a log of our current size based on blocks added and removed (track datastore puts and deletes) and write it out periodically to the datastore.

I don't think we care about a perfectly exact size on disk, its just a rough estimate to determine if we want to GC now or later.

@whyrusleeping whyrusleeping added the kind/bug A bug in existing code (including security flaws) label Jun 1, 2016
@Kubuxu
Copy link
Member

Kubuxu commented Jun 20, 2016

This is quite important bug, it makes big repos on my server almost unusable.
I am for temporary solution right now and introducing proper fix latter.

@Kubuxu Kubuxu added the topic/perf Performance label Jun 20, 2016
@whyrusleeping whyrusleeping added this to the Resource Constraints milestone Aug 9, 2016
@kevina
Copy link
Contributor

kevina commented Aug 9, 2016

Is there a reason ipfs cat needs to performs a conditional GC? That seams like a rather odd thing to do for an (essentially) read-only command.

I am thinking a quick fix would be to just disable the Conditional GC. A better fix would be to run the Conditional GC every now and then on commands that add blocks such as ipfs add rather then ipfs cat.

@whyrusleeping
Copy link
Member

The reason its on ipfs cat (albeit, a poor reason) is that ipfs cat will fetch content from the network and store it on disk before sending it back to you. We wanted to make sure we always tried our hardest to have enough space to execute the command.

I think removing that call for now is okay until we find a better way to do the conditional gc stuff

@kevina
Copy link
Contributor

kevina commented Aug 9, 2016

Ahh, forgot that it could fetch from the network.

@whyrusleeping
Copy link
Member

Going to close this for now since we no longer do a conditional GC for cat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) topic/perf Performance
Projects
None yet
Development

No branches or pull requests

6 participants