Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in unpack? #57

Closed
hasufell opened this issue Aug 30, 2020 · 21 comments
Closed

Memory leak in unpack? #57

hasufell opened this issue Aug 30, 2020 · 21 comments

Comments

@hasufell
Copy link
Member

hasufell commented Aug 30, 2020

Using a decompressed GHC bindist (e.g. ghc-8.4.3-x86_64-fedora27-linux.tar.xz) with the following line with profiling enabled shows a memory consumption of ~266mb:

htar +RTS -p -hc -RTS -x -f ghc-8.4.3-x86_64-fedora27-linux.tar

The profiling info is attached here. All I see is PINNED memory. I can't make any sense out of it.

This also happens in my fork (which is the only tar package that can extract GHC bindists without errors): https://hackage.haskell.org/package/tar-bytestring

This does not happen with:

  • libarchive
  • with cabals 01-index.tar archive (which is probably packed with tar? GHC bindists are not)
@hasufell
Copy link
Member Author

@dcoutts

@hasufell hasufell changed the title Memork leak in unpack? Memory leak in unpack? Aug 30, 2020
@hasufell
Copy link
Member Author

hasufell commented Apr 6, 2021

Ok, so I figured it out with help from #haskell. The memory leaks here are expected and cannot be fixed with lazy bytestring:

let content = LBS.take size (LBS.drop 512 bs)
padding = (512 - size) `mod` 512
bs' = LBS.drop (512 + size + padding) bs

The lazy bytestring that points to the file contents and the one that points to the rest (bs' here) share the same thunk. bs' is passed onto the next iteration of unfoldEntries and so blocks stream fusion.

Then BS.writeFile in unpack will force the entire file contents into memory. It will be freed on the next iteration though. So the memory peak of unpacking is determined by the largest file.

I've attempted a prototype of converting to streamly and it worked in constant memory. But I'm having efficiency issues that need sorting out: hasufell@bd21bf9#diff-cafc5a404a211b294e8976f28f99048766774cc6414507cc2f5032dc5b5bc376R71

@Bodigrim I think going with streamly and getting rid of the Entries type is the way to go. I'll do more experimentation.

@Bodigrim
Copy link
Contributor

Bodigrim commented Apr 7, 2021

I'm afraid that streamly brings in too many dependencies. Any chance we can get away with a more lightweight (and probably low tech) solution?

@hasufell
Copy link
Member Author

hasufell commented Apr 7, 2021

I'm afraid that streamly brings in too many dependencies.

Why? Is there an upper bound of number of dependencies?

Any chance we can get away with a more lightweight (and probably low tech) solution?

I don't think so.

@Bodigrim
Copy link
Contributor

Bodigrim commented Apr 7, 2021

There is no hard limit. But I am afraid of a situation, when cabal-install or hackage-server cannot upgrade to a new GHC, because one of tar dependencies is lagging behind. This is not a dig at streamly (FWIW I think that it is excellently maintained), rather a general consideration that such decisions should not to be taken lightly.

On the other hand, I do understand that being unable to unpack GHC tar archive is pretty a deal breaker :)

CC @emilypi @fgaz @gbaz

@hasufell
Copy link
Member Author

hasufell commented Apr 7, 2021

I'm more afraid of tar blowing up users machines due to memory leaks than I am afraid of adding a well-maintained high-performance library as a dependency.

@emilypi
Copy link
Member

emilypi commented Apr 7, 2021

@Bodigrim I agree that streamly brings in too much. Contrary to what Julian says, there are a few options:

  1. A low-tech, zero non-boot dependency solution would be vector streams:

    type Octets = Data.Vector.Fusion.Stream.Monadic IO Word8

  2. A lighter-cost but still very performant solution would be the streaming package, which doesn't add too many non-boot packages: http://hackage.haskell.org/package/streaming. This one has stood the test of production for long-running apps and large streams.

Why? Is there an upper bound of number of dependencies?

Any new dependencies would add new dependencies upstream of 67 other packages. We have to carefully consider them. I'm not fundamentally opposed to adding dependencies, fwiw.

@fgaz
Copy link
Member

fgaz commented Apr 7, 2021

Dep tree for reference
deps

@hasufell
Copy link
Member Author

hasufell commented Apr 7, 2021

lighter-cost but still very performant solution would be the streaming package

It's nowhere near streamly: https://github.com/composewell/streaming-benchmarks#monadic-streams

This one has stood the test of production for long-running apps and large streams.

Streamly is used in production as well.

Any new dependencies would add new dependencies upstream of 67 other packages.

Sorry, what do you mean with 67 packages? Afais the only additional packages it would pull in would be roughly network, monad-control and heaps.

A low-tech, zero non-boot dependency solution would be vector streams:

Have never done anything like that. Would you be volunteering to implement a solution that way?


edit: also relevant composewell/streamly#533

@emilypi
Copy link
Member

emilypi commented Apr 7, 2021

It's nowhere near streamly

Yes, benchmarks exist and they say things. Whether they say useful things is a question left up to the implementation, and streamly's benchmarks are not super useful for gauging complicated behaviors. As a rule, anyone implementing solutions claiming to be a panacea need more vetting than otherwise. To be clear it's not off the table.

Streamly is used in production as well

Links?

Sorry, what do you mean with 67 packages? Afais the only additional packages it would pull in would be roughly network, monad-control and heaps.

Reverse dep lookup says many packages depend on tar, and you would be inserting these new dependencies into their build stream.

Would you be volunteering to implement a solution that way?

It's not within my bandwidth currently.

@hasufell
Copy link
Member Author

hasufell commented Apr 7, 2021

Yes, benchmarks exist and they say things. Whether they say useful things is a question left up to the implementation, and streamly's benchmarks are not super useful for gauging complicated behaviors.

Tar isn't complicated. And tbf, it's mostly IO-bound I'd say. So even using conduit would theoretically be an option.

Links?

I don't have a complete list.

I've also used it myself in a performance critical setting of an event sourced platform.

Reverse dep lookup says many packages depend on tar, and you would be inserting these new dependencies into their build stream.

Yes.

It's not within my bandwidth currently.

Well, the bandwidth and motivation of contributors has an impact on viable options.

@gbaz
Copy link

gbaz commented Apr 7, 2021

This doesn't appear to be a leak, just an algorithm that has a footprint bounded by the largest file in the tarball, which for many use cases should be just fine. (In particularly, many things using this lib will need to have the unpacked files resident in memory anyway, to operate on them). In my opinion it would suffice simply to document this. A more purely streaming interface will have different tradeoffs, and there's certainly a use-case for that too -- but that could be provided by another library just fine.

@hasufell
Copy link
Member Author

hasufell commented Apr 8, 2021

I strongly disagree with that.

'tar' takes up the most common name people in search for a proper tar library will come across, but it is currently the worst library with a plethora of bugs and it fails to unpack most tarballs correctly found in the wild.

The only use case that seems to work somewhat is that of cabal-install and even there things are broken if you happen to create a filepath that is too long.

The state of this library is embarrassing and it is not enough to point out that it works for small archives and corner cases.

It should be the go-to library or should be incorporated into cabal-install and then abandoned on hackage, so users don't waste their time with tracking down functional bugs and performance issues.


I'm sorry if that sounds harsh, but these things affect people in production, who made the wrong choice of relying on this lib without investigating its state. Existing users aren't the only concern. Potential users are too.

@emilypi
Copy link
Member

emilypi commented Apr 8, 2021

I'm sorry if that sounds harsh, but these things affect people in production, who made the wrong choice of relying on this lib without investigating its state. Existing users aren't the only concern. Potential users are too.

@Bodigrim, @davean and myself just picked this package up from Duncan weeks ago to see about improving it. You are browbeating the people who are actually trying to help improve the state of this library, and it is completely uncalled for and unacceptable that you're acting with such hostility. I have repeatedly seen borderline abusive behavior from you, and I am not going to tolerate it any further. If you can't voice your negativity in a constructive manner, don't voice it at all.

@hasufell
Copy link
Member Author

hasufell commented Apr 8, 2021

@Bodigrim, @davean and myself just picked this package up from Duncan weeks ago to see about improving it. You are browbeating the people who are actually trying to help improve the state of this library, and it is completely uncalled for and unacceptable that you're acting with such hostility.

That's untrue. I haven't brought up a single ad-hominem argument and have explained why I think simply documenting the current behavior is not enough, because that's a too low standard for a core library.

I have repeatedly seen borderline abusive behavior from you

That's a strong claim to make in public without proper backup and I do not appreciate that.

If you can't voice your negativity in a constructive manner, don't voice it at all.

You mean all the patches I wrote are not constructive? :)

@cartazio
Copy link

cartazio commented Apr 8, 2021

Let’s all take a step back and at least recognize we all want tar to be better and that we mostly just disagree on path but not the end point.

I’m willing to put some time into going through some of the outstanding PRs that have languished too long to help triage them if that would help provide an amicable way forward.

@Bodigrim
Copy link
Contributor

Bodigrim commented Apr 8, 2021

Folks, please let's be kinder to each other.

I suggest we wait some time for composewell/streamly#533. Let's focus on other issues in the meantime.

@hasufell
Copy link
Member Author

Streamly is now split into streamly-core, which only depends on boot libraries: https://hackage.haskell.org/package/streamly-core

@Bodigrim
Copy link
Contributor

@hasufell could you give a try to d94a988? I'm not really sure why ;) but it seems to make things better:

$ cabal run htar -- -x -f ghc-8.4.3-x86_64-fedora27-linux.tar +RTS -s -M30M
   2,742,724,576 bytes allocated in the heap
      14,786,128 bytes copied during GC
       1,970,912 bytes maximum residency (175 sample(s))
         908,576 bytes maximum slop
              23 MiB total memory in use (4 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       337 colls,     0 par    0.008s   0.008s     0.0000s    0.0001s
  Gen  1       175 colls,     0 par    0.016s   0.019s     0.0001s    0.0003s

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    1.565s  (  1.860s elapsed)
  GC      time    0.024s  (  0.027s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time    1.590s  (  1.890s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    1,752,331,400 bytes per MUT second

  Productivity  98.5% of total user, 98.4% of total elapsed

@Bodigrim
Copy link
Contributor

I've added a CI job to test memory consumption on large files, seems all good.

@hasufell
Copy link
Member Author

hasufell commented Jan 2, 2024

It seems to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants