-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in unpack? #57
Comments
Ok, so I figured it out with help from Lines 117 to 119 in a0d722c
The lazy bytestring that points to the file contents and the one that points to the rest ( Then I've attempted a prototype of converting to streamly and it worked in constant memory. But I'm having efficiency issues that need sorting out: hasufell@bd21bf9#diff-cafc5a404a211b294e8976f28f99048766774cc6414507cc2f5032dc5b5bc376R71 @Bodigrim I think going with streamly and getting rid of the |
I'm afraid that |
Why? Is there an upper bound of number of dependencies?
I don't think so. |
There is no hard limit. But I am afraid of a situation, when On the other hand, I do understand that being unable to unpack GHC tar archive is pretty a deal breaker :) |
I'm more afraid of |
@Bodigrim I agree that
Any new dependencies would add new dependencies upstream of 67 other packages. We have to carefully consider them. I'm not fundamentally opposed to adding dependencies, fwiw. |
It's nowhere near streamly: https://github.com/composewell/streaming-benchmarks#monadic-streams
Streamly is used in production as well.
Sorry, what do you mean with 67 packages? Afais the only additional packages it would pull in would be roughly
Have never done anything like that. Would you be volunteering to implement a solution that way? edit: also relevant composewell/streamly#533 |
Yes, benchmarks exist and they say things. Whether they say useful things is a question left up to the implementation, and
Links?
Reverse dep lookup says many packages depend on
It's not within my bandwidth currently. |
Tar isn't complicated. And tbf, it's mostly IO-bound I'd say. So even using
I don't have a complete list. I've also used it myself in a performance critical setting of an event sourced platform.
Yes.
Well, the bandwidth and motivation of contributors has an impact on viable options. |
This doesn't appear to be a leak, just an algorithm that has a footprint bounded by the largest file in the tarball, which for many use cases should be just fine. (In particularly, many things using this lib will need to have the unpacked files resident in memory anyway, to operate on them). In my opinion it would suffice simply to document this. A more purely streaming interface will have different tradeoffs, and there's certainly a use-case for that too -- but that could be provided by another library just fine. |
I strongly disagree with that. 'tar' takes up the most common name people in search for a proper tar library will come across, but it is currently the worst library with a plethora of bugs and it fails to unpack most tarballs correctly found in the wild. The only use case that seems to work somewhat is that of cabal-install and even there things are broken if you happen to create a filepath that is too long. The state of this library is embarrassing and it is not enough to point out that it works for small archives and corner cases. It should be the go-to library or should be incorporated into cabal-install and then abandoned on hackage, so users don't waste their time with tracking down functional bugs and performance issues. I'm sorry if that sounds harsh, but these things affect people in production, who made the wrong choice of relying on this lib without investigating its state. Existing users aren't the only concern. Potential users are too. |
@Bodigrim, @davean and myself just picked this package up from Duncan weeks ago to see about improving it. You are browbeating the people who are actually trying to help improve the state of this library, and it is completely uncalled for and unacceptable that you're acting with such hostility. I have repeatedly seen borderline abusive behavior from you, and I am not going to tolerate it any further. If you can't voice your negativity in a constructive manner, don't voice it at all. |
That's untrue. I haven't brought up a single ad-hominem argument and have explained why I think simply documenting the current behavior is not enough, because that's a too low standard for a core library.
That's a strong claim to make in public without proper backup and I do not appreciate that.
You mean all the patches I wrote are not constructive? :) |
Let’s all take a step back and at least recognize we all want tar to be better and that we mostly just disagree on path but not the end point. I’m willing to put some time into going through some of the outstanding PRs that have languished too long to help triage them if that would help provide an amicable way forward. |
Folks, please let's be kinder to each other. I suggest we wait some time for composewell/streamly#533. Let's focus on other issues in the meantime. |
Streamly is now split into streamly-core, which only depends on boot libraries: https://hackage.haskell.org/package/streamly-core |
@hasufell could you give a try to d94a988? I'm not really sure why ;) but it seems to make things better:
|
I've added a CI job to test memory consumption on large files, seems all good. |
It seems to work. |
Using a decompressed GHC bindist (e.g. ghc-8.4.3-x86_64-fedora27-linux.tar.xz) with the following line with profiling enabled shows a memory consumption of ~266mb:
The profiling info is attached here. All I see is PINNED memory. I can't make any sense out of it.
This also happens in my fork (which is the only tar package that can extract GHC bindists without errors): https://hackage.haskell.org/package/tar-bytestring
This does not happen with:
01-index.tar
archive (which is probably packed withtar
? GHC bindists are not)The text was updated successfully, but these errors were encountered: