Memory leak in unpack? #57

hasufell · 2020-08-30T20:59:04Z

Using a decompressed GHC bindist (e.g. ghc-8.4.3-x86_64-fedora27-linux.tar.xz) with the following line with profiling enabled shows a memory consumption of ~266mb:

htar +RTS -p -hc -RTS -x -f ghc-8.4.3-x86_64-fedora27-linux.tar

The profiling info is attached here. All I see is PINNED memory. I can't make any sense out of it.

This also happens in my fork (which is the only tar package that can extract GHC bindists without errors): https://hackage.haskell.org/package/tar-bytestring

This does not happen with:

libarchive
with cabals 01-index.tar archive (which is probably packed with tar? GHC bindists are not)

The text was updated successfully, but these errors were encountered:

hasufell · 2020-08-30T20:59:15Z

@dcoutts

hasufell · 2021-04-06T23:41:23Z

Ok, so I figured it out with help from #haskell. The memory leaks here are expected and cannot be fixed with lazy bytestring:

tar/Codec/Archive/Tar/Read.hs

Lines 117 to 119 in a0d722c

    
           let content = LBS.take size (LBS.drop 512 bs) 
        
               padding = (512 - size) `mod` 512 
        
               bs'     = LBS.drop (512 + size + padding) bs

The lazy bytestring that points to the file contents and the one that points to the rest (bs' here) share the same thunk. bs' is passed onto the next iteration of unfoldEntries and so blocks stream fusion.

Then BS.writeFile in unpack will force the entire file contents into memory. It will be freed on the next iteration though. So the memory peak of unpacking is determined by the largest file.

I've attempted a prototype of converting to streamly and it worked in constant memory. But I'm having efficiency issues that need sorting out: hasufell@bd21bf9#diff-cafc5a404a211b294e8976f28f99048766774cc6414507cc2f5032dc5b5bc376R71

@Bodigrim I think going with streamly and getting rid of the Entries type is the way to go. I'll do more experimentation.

Bodigrim · 2021-04-07T18:43:48Z

I'm afraid that streamly brings in too many dependencies. Any chance we can get away with a more lightweight (and probably low tech) solution?

hasufell · 2021-04-07T19:02:04Z

I'm afraid that streamly brings in too many dependencies.

Why? Is there an upper bound of number of dependencies?

Any chance we can get away with a more lightweight (and probably low tech) solution?

I don't think so.

Bodigrim · 2021-04-07T19:16:36Z

There is no hard limit. But I am afraid of a situation, when cabal-install or hackage-server cannot upgrade to a new GHC, because one of tar dependencies is lagging behind. This is not a dig at streamly (FWIW I think that it is excellently maintained), rather a general consideration that such decisions should not to be taken lightly.

On the other hand, I do understand that being unable to unpack GHC tar archive is pretty a deal breaker :)

CC @emilypi @fgaz @gbaz

hasufell · 2021-04-07T19:40:07Z

I'm more afraid of tar blowing up users machines due to memory leaks than I am afraid of adding a well-maintained high-performance library as a dependency.

emilypi · 2021-04-07T20:03:37Z

@Bodigrim I agree that streamly brings in too much. Contrary to what Julian says, there are a few options:

A low-tech, zero non-boot dependency solution would be vector streams:

type Octets = Data.Vector.Fusion.Stream.Monadic IO Word8
A lighter-cost but still very performant solution would be the streaming package, which doesn't add too many non-boot packages: http://hackage.haskell.org/package/streaming. This one has stood the test of production for long-running apps and large streams.

Why? Is there an upper bound of number of dependencies?

Any new dependencies would add new dependencies upstream of 67 other packages. We have to carefully consider them. I'm not fundamentally opposed to adding dependencies, fwiw.

fgaz · 2021-04-07T20:03:41Z

Dep tree for reference

hasufell · 2021-04-07T20:11:11Z

lighter-cost but still very performant solution would be the streaming package

It's nowhere near streamly: https://github.com/composewell/streaming-benchmarks#monadic-streams

This one has stood the test of production for long-running apps and large streams.

Streamly is used in production as well.

Any new dependencies would add new dependencies upstream of 67 other packages.

Sorry, what do you mean with 67 packages? Afais the only additional packages it would pull in would be roughly network, monad-control and heaps.

A low-tech, zero non-boot dependency solution would be vector streams:

Have never done anything like that. Would you be volunteering to implement a solution that way?

edit: also relevant composewell/streamly#533

emilypi · 2021-04-07T21:06:41Z

It's nowhere near streamly

Yes, benchmarks exist and they say things. Whether they say useful things is a question left up to the implementation, and streamly's benchmarks are not super useful for gauging complicated behaviors. As a rule, anyone implementing solutions claiming to be a panacea need more vetting than otherwise. To be clear it's not off the table.

Streamly is used in production as well

Links?

Sorry, what do you mean with 67 packages? Afais the only additional packages it would pull in would be roughly network, monad-control and heaps.

Reverse dep lookup says many packages depend on tar, and you would be inserting these new dependencies into their build stream.

Would you be volunteering to implement a solution that way?

It's not within my bandwidth currently.

hasufell · 2021-04-07T21:24:34Z

Yes, benchmarks exist and they say things. Whether they say useful things is a question left up to the implementation, and streamly's benchmarks are not super useful for gauging complicated behaviors.

Tar isn't complicated. And tbf, it's mostly IO-bound I'd say. So even using conduit would theoretically be an option.

Links?

I don't have a complete list.

I've also used it myself in a performance critical setting of an event sourced platform.

Reverse dep lookup says many packages depend on tar, and you would be inserting these new dependencies into their build stream.

Yes.

It's not within my bandwidth currently.

Well, the bandwidth and motivation of contributors has an impact on viable options.

gbaz · 2021-04-07T23:18:14Z

This doesn't appear to be a leak, just an algorithm that has a footprint bounded by the largest file in the tarball, which for many use cases should be just fine. (In particularly, many things using this lib will need to have the unpacked files resident in memory anyway, to operate on them). In my opinion it would suffice simply to document this. A more purely streaming interface will have different tradeoffs, and there's certainly a use-case for that too -- but that could be provided by another library just fine.

hasufell · 2021-04-08T08:22:31Z

I strongly disagree with that.

'tar' takes up the most common name people in search for a proper tar library will come across, but it is currently the worst library with a plethora of bugs and it fails to unpack most tarballs correctly found in the wild.

The only use case that seems to work somewhat is that of cabal-install and even there things are broken if you happen to create a filepath that is too long.

The state of this library is embarrassing and it is not enough to point out that it works for small archives and corner cases.

It should be the go-to library or should be incorporated into cabal-install and then abandoned on hackage, so users don't waste their time with tracking down functional bugs and performance issues.

I'm sorry if that sounds harsh, but these things affect people in production, who made the wrong choice of relying on this lib without investigating its state. Existing users aren't the only concern. Potential users are too.

emilypi · 2021-04-08T15:39:25Z

I'm sorry if that sounds harsh, but these things affect people in production, who made the wrong choice of relying on this lib without investigating its state. Existing users aren't the only concern. Potential users are too.

@Bodigrim, @davean and myself just picked this package up from Duncan weeks ago to see about improving it. You are browbeating the people who are actually trying to help improve the state of this library, and it is completely uncalled for and unacceptable that you're acting with such hostility. I have repeatedly seen borderline abusive behavior from you, and I am not going to tolerate it any further. If you can't voice your negativity in a constructive manner, don't voice it at all.

hasufell · 2021-04-08T20:36:22Z

@Bodigrim, @davean and myself just picked this package up from Duncan weeks ago to see about improving it. You are browbeating the people who are actually trying to help improve the state of this library, and it is completely uncalled for and unacceptable that you're acting with such hostility.

That's untrue. I haven't brought up a single ad-hominem argument and have explained why I think simply documenting the current behavior is not enough, because that's a too low standard for a core library.

I have repeatedly seen borderline abusive behavior from you

That's a strong claim to make in public without proper backup and I do not appreciate that.

If you can't voice your negativity in a constructive manner, don't voice it at all.

You mean all the patches I wrote are not constructive? :)

cartazio · 2021-04-08T20:45:00Z

Let’s all take a step back and at least recognize we all want tar to be better and that we mostly just disagree on path but not the end point.

I’m willing to put some time into going through some of the outstanding PRs that have languished too long to help triage them if that would help provide an amicable way forward.

Bodigrim · 2021-04-08T21:14:01Z

Folks, please let's be kinder to each other.

I suggest we wait some time for composewell/streamly#533. Let's focus on other issues in the meantime.

hasufell · 2023-12-14T05:33:37Z

Streamly is now split into streamly-core, which only depends on boot libraries: https://hackage.haskell.org/package/streamly-core

Bodigrim · 2023-12-19T01:04:24Z

@hasufell could you give a try to d94a988? I'm not really sure why ;) but it seems to make things better:

$ cabal run htar -- -x -f ghc-8.4.3-x86_64-fedora27-linux.tar +RTS -s -M30M
   2,742,724,576 bytes allocated in the heap
      14,786,128 bytes copied during GC
       1,970,912 bytes maximum residency (175 sample(s))
         908,576 bytes maximum slop
              23 MiB total memory in use (4 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       337 colls,     0 par    0.008s   0.008s     0.0000s    0.0001s
  Gen  1       175 colls,     0 par    0.016s   0.019s     0.0001s    0.0003s

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    1.565s  (  1.860s elapsed)
  GC      time    0.024s  (  0.027s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time    1.590s  (  1.890s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    1,752,331,400 bytes per MUT second

  Productivity  98.5% of total user, 98.4% of total elapsed

Bodigrim · 2023-12-19T20:17:42Z

I've added a CI job to test memory consumption on large files, seems all good.

hasufell · 2024-01-02T16:17:17Z

It seems to work.

hasufell changed the title ~~Memork leak in unpack?~~ Memory leak in unpack? Aug 30, 2020

hasufell mentioned this issue Dec 14, 2023

New Hackage release #84

Closed

Bodigrim closed this as completed Dec 19, 2023

hasufell mentioned this issue Jan 2, 2024

Re-introduce tar haskell/ghcup-hs#963

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in unpack? #57

Memory leak in unpack? #57

hasufell commented Aug 30, 2020 •

edited

Loading

hasufell commented Aug 30, 2020

hasufell commented Apr 6, 2021 •

edited

Loading

Bodigrim commented Apr 7, 2021

hasufell commented Apr 7, 2021

Bodigrim commented Apr 7, 2021

hasufell commented Apr 7, 2021

emilypi commented Apr 7, 2021 •

edited

Loading

fgaz commented Apr 7, 2021

hasufell commented Apr 7, 2021 •

edited

Loading

emilypi commented Apr 7, 2021

hasufell commented Apr 7, 2021

gbaz commented Apr 7, 2021

hasufell commented Apr 8, 2021 •

edited

Loading

emilypi commented Apr 8, 2021

hasufell commented Apr 8, 2021

cartazio commented Apr 8, 2021

Bodigrim commented Apr 8, 2021

hasufell commented Dec 14, 2023

Bodigrim commented Dec 19, 2023

Bodigrim commented Dec 19, 2023

hasufell commented Jan 2, 2024

Memory leak in unpack? #57

Memory leak in unpack? #57

Comments

hasufell commented Aug 30, 2020 • edited Loading

hasufell commented Aug 30, 2020

hasufell commented Apr 6, 2021 • edited Loading

Bodigrim commented Apr 7, 2021

hasufell commented Apr 7, 2021

Bodigrim commented Apr 7, 2021

hasufell commented Apr 7, 2021

emilypi commented Apr 7, 2021 • edited Loading

fgaz commented Apr 7, 2021

hasufell commented Apr 7, 2021 • edited Loading

emilypi commented Apr 7, 2021

hasufell commented Apr 7, 2021

gbaz commented Apr 7, 2021

hasufell commented Apr 8, 2021 • edited Loading

emilypi commented Apr 8, 2021

hasufell commented Apr 8, 2021

cartazio commented Apr 8, 2021

Bodigrim commented Apr 8, 2021

hasufell commented Dec 14, 2023

Bodigrim commented Dec 19, 2023

Bodigrim commented Dec 19, 2023

hasufell commented Jan 2, 2024

hasufell commented Aug 30, 2020 •

edited

Loading

hasufell commented Apr 6, 2021 •

edited

Loading

emilypi commented Apr 7, 2021 •

edited

Loading

hasufell commented Apr 7, 2021 •

edited

Loading

hasufell commented Apr 8, 2021 •

edited

Loading