Memory Usage Explosion by zstd #675

manishrjain · 2022-09-19T22:35:18Z

manishrjain
Sep 19, 2022

I just switched to using zstd for compression for Outserv boot (/ Dgraph bulk) loader. My map phase produces ~9000 files, which are read concurrently by the reduce phase.

I was using snappy before, and it was working just fine. When I try to read this via zstd, my memory usage just blows up (using 90 GiB).

$ go tool pprof localhost:8080/debug/pprof/heap
Fetching profile over HTTP from http://localhost:8080/debug/pprof/heap
Saved profile in /home/out/pprof/pprof.outserv.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz
File: outserv
Build ID: a3c11775491cc183708fed08da7df4027c01b3cc
Type: inuse_space
Time: Sep 19, 2022 at 10:31pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 96.96GB, 98.95% of 97.99GB total
Dropped 145 nodes (cum <= 0.49GB)
Showing top 10 nodes out of 12
      flat  flat%   sum%        cum   cum%
   86.97GB 88.76% 88.76%    86.97GB 88.76%  github.com/klauspost/compress/zstd.(*history).ensureBlock (inline)
    8.71GB  8.89% 97.65%     8.71GB  8.89%  bufio.NewReaderSize (inline)
    1.19GB  1.21% 98.86%     1.19GB  1.21%  github.com/klauspost/compress/zstd.(*blockDec).reset
    0.09GB 0.092% 98.95%    97.27GB 99.27%  github.com/outcaste-io/outserv/outserv/cmd/boot.(*reducer).run.func1
         0     0% 98.95%    88.46GB 90.28%  bufio.(*Reader).Read
         0     0% 98.95%    88.46GB 90.28%  github.com/klauspost/compress/zstd.(*Decoder).Read
         0     0% 98.95%    88.46GB 90.28%  github.com/klauspost/compress/zstd.(*Decoder).nextBlock
         0     0% 98.95%    88.46GB 90.28%  github.com/klauspost/compress/zstd.(*Decoder).nextBlockSync
         0     0% 98.95%     1.19GB  1.21%  github.com/klauspost/compress/zstd.(*frameDec).next
         0     0% 98.95%    97.18GB 99.17%  github.com/outcaste-io/outserv/outserv/cmd/boot.newMapIterator

(pprof) list ensureBlock
Total: 97.99GB
ROUTINE ======================== github.com/klauspost/compress/zstd.(*history).ensureBlock in /home/out/go/pkg/mod/github.com/klauspost/compress@v1.15.10/zstd/history.go
   86.97GB    86.97GB (flat, cum) 88.76% of Total
         .          .     96:}
         .          .     97:
         .          .     98:// ensureBlock will ensure there is space for at least one block...
         .          .     99:func (h *history) ensureBlock() {
         .          .    100:	if cap(h.b) < h.allocFrameBuffer {
   86.97GB    86.97GB    101:		h.b = make([]byte, 0, h.allocFrameBuffer)
         .          .    102:		return
         .          .    103:	}
         .          .    104:
         .          .    105:	avail := cap(h.b) - len(h.b)
         .          .    106:	if avail >= h.windowSize || avail > maxCompressedBlockSize {

I'm opening my reader like so:

dec, err := zstd.NewReader(fd, zstd.WithDecoderConcurrency(1), zstd.WithDecoderLowmem(true))

So, trying my best to decrease memory usage. Suggestions?

P.S. Tangential, but perhaps consider using jemalloc for such big allocations instead of Go memory.

klauspost · 2022-09-20T06:21:58Z

klauspost
Sep 20, 2022
Maintainer

Limit your concurrency. You don't have 9000 cores anyway.
Allocate a limited number of Readers. Reuse them using Reset.

Decompressing takes memory. Decompressing 9000x takes huge amount of memory.

0 replies

klauspost · 2022-09-20T07:20:47Z

klauspost
Sep 20, 2022
Maintainer

Alternatively content must be encoded with a very small window, but that requires you to be in control of the encoding as well.

You can enforce a max decoder window, but it will of course fail decoding if that cannot be satisfied.

0 replies

manishrjain · 2022-09-20T15:05:03Z

manishrjain
Sep 20, 2022
Author

I switched to using https://github.com/DataDog/zstd, and it works fine. So, the memory issue is specific to this implementation of zstd.

In the reduce step, the loader has to read through all the map files because each file has sorted data. So, it's sort of like 9000 streams that the reducer has to read to achieve global sorting. With datadog zstd, each stream takes around 1 MiB, so the whole thing can still easily fit in memory.

0 replies

klauspost · 2022-09-25T10:20:49Z

klauspost
Sep 25, 2022
Maintainer

@manishrjain

I am still unable to reproduce the issue. Are you calling Close when you are done reading each instance to release resources?

I added a benchmark that tests allocations on NewReader() (and Close), and for regular streams it should be around 1896 B/op, meaning the NewReader(fd, WithDecoderConcurrency(1)) should only take ~1896 bytes until it is read from.

Once you have Read from it, it will keep the buffer around until you call Close(). It cannot know if you intend to re-use the Decoder.

7 replies

klauspost Sep 25, 2022
Maintainer

@manishrjain A single stream should be fine.

ensureBlock should in lowmem setup allocate window size (defined on stream) plus 128KB (max block size). I don't know the cgo implementation internals. It may be doing progressive allocations.

I cannot see how it will not end up allocating at least "window size" after a few reads.

manishrjain Sep 25, 2022
Author

Here's a few files from the map output: https://link.storjshare.io/jubw4annrrb7kpdzh3rj7efaztoa/zstd

klauspost Sep 26, 2022
Maintainer

@manishrjain

I did a benchmark tests with various window sizes:

BenchmarkDecoder_DecoderNewSomeRead/testdata/000002.map.win32K.zst/stream-single-32         	      48	  24423669 ns/op	    196608 b/op
BenchmarkDecoder_DecoderNewSomeRead/testdata/000002.map.win1MB.zst/stream-single-32         	      44	  33295223 ns/op	   2424832 b/op
BenchmarkDecoder_DecoderNewSomeRead/testdata/000002.map.win8MB.zst/stream-single-32         	      18	  65593411 ns/op	  10814805 b/op

b/op is the heap used by a Reader instance and all its buffers after reading 16MB.

Dissecting the 32KB window

64KB allocated in ensureblock, this is windows size + max block size. 32KB is history between calls and 32KB are output from next block.
8KB ReadTable for huff0. This is needed between blocks and cannot be deallocated.
40KB needed for literal output buffer. This can be deallocated between blocks, although at high cost.
~20KB is 3 FSE decoders. Must be retained between calls.

When Window Size gets above 2MB, it will switch from 2x Window size to Window Size + 2 MB in low-mem mode.

BENCHMARK SOURCE

func BenchmarkDecoder_DecoderNewSomeRead(b *testing.B) {
	var buf [1 << 20]byte
	bench := func(name string, b *testing.B, opts []DOption, in *os.File) {
		b.Helper()
		b.Run(name, func(b *testing.B) {
			//b.ReportAllocs()
			b.ResetTimer()
			var heapTotal int64
			var m runtime.MemStats
			for i := 0; i < b.N; i++ {
				runtime.GC()
				runtime.ReadMemStats(&m)
				heapTotal -= int64(m.HeapInuse)
				_, err := in.Seek(io.SeekStart, 0)
				if err != nil {
					b.Fatal(err)
				}

				dec, err := NewReader(in, opts...)
				if err != nil {
					b.Fatal(err)
				}
				// Read 16 MB
				_, err = io.CopyBuffer(io.Discard, io.LimitReader(dec, 16<<20), buf[:])
				if err != nil {
					b.Fatal(err)
				}
				runtime.GC()
				runtime.ReadMemStats(&m)
				heapTotal += int64(m.HeapInuse)

				dec.Close()
			}
			b.ReportMetric(float64(heapTotal)/float64(b.N), "b/op")
		})
	}
	files := []string{"testdata/000002.map.win32K.zst", "testdata/000002.map.win1MB.zst", "testdata/000002.map.win8MB.zst"}
	for _, file := range files {
		if !strings.HasSuffix(file, ".zst") {
			continue
		}
		r, err := os.Open(file)
		if err != nil {
			b.Fatal(err)
		}
		defer r.Close()

		b.Run(file, func(b *testing.B) {
			bench("stream-single", b, []DOption{WithDecodeBuffersBelow(0), WithDecoderConcurrency(1)}, r)
			//bench("stream-single-himem", b, []DOption{WithDecodeBuffersBelow(0), WithDecoderConcurrency(1), WithDecoderLowmem(false)}, r)
		})
		break
	}
}

I cannot get any numbers from the cgo code, since it doesn't register as Go Heap allocations. Therefore you should double check if you are actually observing memory usage on the OS level or just monitoring the Go heap.

I cannot see how it can be significantly (more than 50%) better, since it would at least need to keep the window and decoders in memory for each between calls.

Reduce your window size when encoding your data. That is the only way to reduce memory use when streaming so many inputs. As a sidenote, it seems your content compresses better with a 32KB window than any of the others.

manishrjain Sep 26, 2022
Author

With DataDog, Go heap only shows 2.23GB at zstd.NewReader

https://github.com/outcaste-io/outserv/blob/mrjn/skip-keys/outserv/cmd/boot/reduce.go#L227

Here's the image from htop -- resident memory usage is around 37 GB. And that also includes around 12 GB of Go memory and 4 GB of jemalloc (so, 16 GB in use, but Go always keeps more than needed).

As you figured out, the map file names are just an artifact of the past when we were using gzip, but they're all zstd. Need to change the naming. I'll try out with 32 KB window size -- that's interesting. I also wonder if Cgo based zstd is better at memory management because of jemalloc / tcmalloc v/s the Go version. We (at Dgraph) definitely saw a huge difference in our programs in terms of memory usage once we switched to jemalloc.

klauspost Sep 26, 2022
Maintainer

Yeah, GOGC=100 will of course allocate double allocs since it is in Go heap, but I expect you've taken that into account. The cgo allocs are made in C-code, so not reflected in Go heap, and GOGC percentage doesn't apply.

It is not possible to stream without keeping window and decoders in memory between calls, so they will for sure allocate. I would expect them to keep buffers allocated between calls, so worst case this package uses 2x of what the cgo uses, plus GOGC overhead.

siara-cc · 2023-03-11T04:04:34Z

siara-cc
Mar 11, 2023

I am also facing the same issue. I am trying to decompress Reddit zstd archives from pushshift.io and they require a 2gb window size. It works fine with this repo, but memory usage goes upto 4gb just for decompression.
If something can be done to reduce memory usage please let me know.
I tried using https://github.com/DataDog/zstd, but it does not support setting window size.

1 reply

klauspost Mar 11, 2023
Maintainer

The code you link has zstd.WithDecoderLowmem(false).

Also note that Go allocates 2x the amount of memory. Google GOGC for info on that.

A window size of 2GB is an extreme choice, so the choice on the encoder side to use that is what is causing this. I suggest if you want to actively work with this data that you recompress it. The zstd command will not even decompress it unless you give it special parameters.

For example wget -O- https://files.pushshift.io/reddit/comments/RC_2012-12.zst | zstd -d -c --long=31 | zstd -9 -c -T0 - >RC_2012-12.zst

Using these settings is for archival only, and not really for "daily use".

siara-cc · 2023-03-11T10:17:42Z

siara-cc
Mar 11, 2023

@klauspost Thank you for the suggestion. I will recompress it.

3 replies

klauspost Mar 11, 2023
Maintainer

If you are worried about size, sorting the JSON keys will make the output smaller.

Here is a Go program I wrote that will sort the files:

package main

import (
	"flag"
	"fmt"
	"io"
	"log"
	"os"
	"sort"
	"sync"

	"github.com/minio/simdjson-go"
)

func main() {
	flag.Parse()
	if !simdjson.SupportedCPU() {
		log.Fatal("Unsupported CPU")
	}
	f := os.Stdin
	if flag.NArg() > 0 {
		var err error
		f, err = os.Open(flag.Arg(0))
		if err != nil {
			log.Fatalf("Failed to load file: %v", err)
		}
		defer f.Close()
	}

	const processors = 4
	results := make(chan simdjson.Stream, processors)
	reuse := make(chan *simdjson.ParsedJson, 1000)
	simdjson.ParseNDStream(f, results, reuse)

	// Order is not important, so we process results concurrently
	var wg sync.WaitGroup
	wg.Add(processors)
	for i := 0; i < processors; i++ {
		go func() {
			defer wg.Done()

			var o *simdjson.Object
			var e *simdjson.Elements
			buf := make([]byte, 1024)
			for r := range results {
				if r.Error != nil {
					if r.Error == io.EOF {
						return
					}
					log.Fatal(r.Error)
				}

				var err error
				err = r.Value.ForEach(func(i simdjson.Iter) error {
					o, err = i.Object(o)
					if err != nil {
						return nil
					}
					e, err = o.Parse(e)
					if err != nil {
						return nil
					}
					sort.Slice(e.Elements, func(i, j int) bool {
						return e.Elements[i].Name < e.Elements[j].Name
					})
					buf, err = e.MarshalJSONBuffer(buf[:0])
					if err != nil {
						return err
					}
					fmt.Fprintln(os.Stdout, string(buf))
					return nil
				})
				if err != nil {
					log.Fatal(err)
				}
				reuse <- r.Value
			}
		}()
	}
	wg.Wait()
}

Set "processors =1" if order is important.

Example: wget -O- https://files.pushshift.io/reddit/comments/RC_2012-12.zst | zstd -d -c --long=31 | sort_objects | zstd -19 -c -T0 - >RC_2012-12.zst

siara-cc Mar 11, 2023

@klauspost Wow this is great!! I don't even need all the fields from the JSON. So it makes it much smaller than what it is originally. Also as a Golang newbie there are so many things I learn from this example. Thanks so much!

klauspost Mar 11, 2023
Maintainer

Added a short test:

	deleteKeys := map[string]struct{}{"archived": {}, "author_flair_css_class": {}, "author_flair_text": {}, "controversiality": {}, "distinguished": {}}
...
	err = o.DeleteElems(nil, deleteKeys)
	if err != nil {
		log.Fatal("delete:", err)
	}

But it revealed a bug in simdjson: minio/simdjson-go#82

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Usage Explosion by zstd #675

{{title}}

Replies: 6 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Memory Usage Explosion by zstd #675

manishrjain Sep 19, 2022

Replies: 6 comments · 11 replies

klauspost Sep 20, 2022 Maintainer

klauspost Sep 20, 2022 Maintainer

manishrjain Sep 20, 2022 Author

klauspost Sep 25, 2022 Maintainer

klauspost Sep 25, 2022 Maintainer

manishrjain Sep 25, 2022 Author

klauspost Sep 26, 2022 Maintainer

manishrjain Sep 26, 2022 Author

klauspost Sep 26, 2022 Maintainer

siara-cc Mar 11, 2023

klauspost Mar 11, 2023 Maintainer

siara-cc Mar 11, 2023

klauspost Mar 11, 2023 Maintainer

siara-cc Mar 11, 2023

klauspost Mar 11, 2023 Maintainer

manishrjain
Sep 19, 2022

Replies: 6 comments 11 replies

klauspost
Sep 20, 2022
Maintainer

klauspost
Sep 20, 2022
Maintainer

manishrjain
Sep 20, 2022
Author

klauspost
Sep 25, 2022
Maintainer

klauspost Sep 25, 2022
Maintainer

manishrjain Sep 25, 2022
Author

klauspost Sep 26, 2022
Maintainer

manishrjain Sep 26, 2022
Author

klauspost Sep 26, 2022
Maintainer

siara-cc
Mar 11, 2023

klauspost Mar 11, 2023
Maintainer

siara-cc
Mar 11, 2023

klauspost Mar 11, 2023
Maintainer

klauspost Mar 11, 2023
Maintainer