Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zed service crash caused by large vacuum operation #4940

Closed
philrz opened this issue Dec 15, 2023 · 1 comment · Fixed by #4974
Closed

Zed service crash caused by large vacuum operation #4940

philrz opened this issue Dec 15, 2023 · 1 comment · Fixed by #4974
Labels
bug Something isn't working

Comments

@philrz
Copy link
Contributor

philrz commented Dec 15, 2023

tl;dr

A zed serve can be made to crash and exit if asked to vacuum 300k objects.

Details

Repro is with Zed commit 59b72d8. I bumped into this issue while running tests related to #4934.

After starting a zed serve, I post 300k objects as separate commits using the record-poster tool. Then I delete all but 5 of the records, compact the pool, and initiate a vacuum to clean up the storage consumed by the deleted data.

$ ./record-poster.py --exitafter 300000
Posting: {"ts": "2023-12-15T18:46:15.641456Z", "message": "tripaschal steekkannen boschbok verificative procivic"}
Posting: {"ts": "2023-12-15T18:46:15.645073Z", "message": "Santos cardiotrophia"}
...

$ zed -version
Version: v1.11.1-27-g59b72d88

$ zed delete -where "ts < $(zed query -z 'from TestPool | head 5 | tail 1 | yield ts')"
2Zan02CJGhRiiwjXNZoLr71STL6 delete committed

$ zed manage
{"level":"info","ts":1702668889.402373,"logger":"pool","msg":"updating pool","name":"TestPool","id":"2ZahDG0Ucix60okBvQn0cU4cu6Q","branch":"main","interval":60,"vectors":false}
{"level":"info","ts":1702668890.182494,"logger":"pool","msg":"compaction completed","name":"TestPool","id":"2ZahDG0Ucix60okBvQn0cU4cu6Q","branch":"main","interval":60,"vectors":false,"runs_found":1,"objects_compacted":5,"vectors_created":0}

$ zed vacuum -f
Post "http://localhost:9867/pool/TestPool/revision/main/vacuum": read tcp [::1]:50304->[::1]:9867: read: connection reset by peer

At this point I find the zed serve process has generated the contents of the large/attached dump.txt.gz and exited. Here's a brief excerpt of the very top of the dump:

runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7ff8196c57a2 m=2 sigcode=0

goroutine 0 [idle]:
runtime: g 0: unknown pc 0x7ff8196c57a2
stack: frame={sp:0x70000ee2cc18, fp:0x0} stack=[0x70000e62d340,0x70000ee2cf40)
0x000070000ee2cb18:  0x0000000000000000  0x0000000000000000 
0x000070000ee2cb28:  0x0000000000000000  0x0000000000000000 
0x000070000ee2cb38:  0x00007ff85cd270f0  0x000000000000003c 
0x000070000ee2cb48:  0x00007ff85cd263e0  0x000070000ee2cb70 
0x000070000ee2cb58:  0x00007ff8195ae977  0x0000000000000045 
0x000070000ee2cb68:  0x0000000002a98fe0  0x000070000ee2cba0 
0x000070000ee2cb78:  0x00007ff8195b08d8  0x00007ff85cd270f0 
0x000070000ee2cb88:  0x000070000ee2ccb0  0x0000000000000023 
0x000070000ee2cb98:  0x0000000000800000  0x000070000ee2cc80 
0x000070000ee2cba8:  0x00007ff8195b087a  0x0000000000000000 
0x000070000ee2cbb8:  0x000070081162a000  0x00007ff819621302 
0x000070000ee2cbc8:  0x00007ff8592ea790  0x0000000000000000 
0x000070000ee2cbd8:  0x0000000001312d00 <net/http.(*Request).ParseMultipartForm+0x00000000000003e0>  0x00007ff85b878190 
0x000070000ee2cbe8:  0x000000000000003c  0x000070000ee2cc10 
0x000070000ee2cbf8:  0x00007ff8196bfea6  0x000000000000003c 
0x000070000ee2cc08:  0x0000000000000000  0x000070000ee2cc50 
0x000070000ee2cc18: <0x00007ff8196fdf30  0x000070000ee2d000 
0x000070000ee2cc28:  0x000070000ee2ccb0  0x000070000ee2cc58 
0x000070000ee2cc38:  0x0000000000000003  0x000070000ee2cc80 
0x000070000ee2cc48:  0x00007ff81961ca49  0x000070000ee2cca0 
0x000070000ee2cc58:  0x00000000fffff9df  0xffffffff00000018 
0x000070000ee2cc68:  0x00007ff85cd270f0  0x0000000000000023 
0x000070000ee2cc78:  0x0000000000800000  0x000070000ee2cca0 
0x000070000ee2cc88:  0x00000000010011be <_cgo_sys_thread_start.cold.1+0x000000000000002e>  0x0000000000000000 
0x000070000ee2cc98:  0x000070000ee2ccf8  0x000070000ee2cd20 
0x000070000ee2cca8:  0x000000000100115e <_cgo_sys_thread_start+0x000000000000009e>  0x0000000054484441 
0x000070000ee2ccb8:  0x0000000000000000  0x0000000000000000 
0x000070000ee2ccc8:  0x0000000000800000  0x00000000000008ff 
0x000070000ee2ccd8:  0x0000000010010101  0x0000000000000000 
0x000070000ee2cce8:  0x0000000000000000  0x0000000000000000 
0x000070000ee2ccf8:  0xffffffff00000000  0x000070000ee2cd98 
0x000070000ee2cd08:  0x0000000000000100  0x000000c0000069c0 
runtime: g 0: unknown pc 0x7ff8196c57a2
stack: frame={sp:0x70000ee2cc18, fp:0x0} stack=[0x70000e62d340,0x70000ee2cf40)
0x000070000ee2cb18:  0x0000000000000000  0x0000000000000000 
0x000070000ee2cb28:  0x0000000000000000  0x0000000000000000 
0x000070000ee2cb38:  0x00007ff85cd270f0  0x000000000000003c 
0x000070000ee2cb48:  0x00007ff85cd263e0  0x000070000ee2cb70 
0x000070000ee2cb58:  0x00007ff8195ae977  0x0000000000000045 
0x000070000ee2cb68:  0x0000000002a98fe0  0x000070000ee2cba0 
0x000070000ee2cb78:  0x00007ff8195b08d8  0x00007ff85cd270f0 
0x000070000ee2cb88:  0x000070000ee2ccb0  0x0000000000000023 
0x000070000ee2cb98:  0x0000000000800000  0x000070000ee2cc80 
0x000070000ee2cba8:  0x00007ff8195b087a  0x0000000000000000 
0x000070000ee2cbb8:  0x000070081162a000  0x00007ff819621302 
0x000070000ee2cbc8:  0x00007ff8592ea790  0x0000000000000000 
0x000070000ee2cbd8:  0x0000000001312d00 <net/http.(*Request).ParseMultipartForm+0x00000000000003e0>  0x00007ff85b878190 
0x000070000ee2cbe8:  0x000000000000003c  0x000070000ee2cc10 
0x000070000ee2cbf8:  0x00007ff8196bfea6  0x000000000000003c 
0x000070000ee2cc08:  0x0000000000000000  0x000070000ee2cc50 
0x000070000ee2cc18: <0x00007ff8196fdf30  0x000070000ee2d000 
0x000070000ee2cc28:  0x000070000ee2ccb0  0x000070000ee2cc58 
0x000070000ee2cc38:  0x0000000000000003  0x000070000ee2cc80 
0x000070000ee2cc48:  0x00007ff81961ca49  0x000070000ee2cca0 
0x000070000ee2cc58:  0x00000000fffff9df  0xffffffff00000018 
0x000070000ee2cc68:  0x00007ff85cd270f0  0x0000000000000023 
0x000070000ee2cc78:  0x0000000000800000  0x000070000ee2cca0 
0x000070000ee2cc88:  0x00000000010011be <_cgo_sys_thread_start.cold.1+0x000000000000002e>  0x0000000000000000 
0x000070000ee2cc98:  0x000070000ee2ccf8  0x000070000ee2cd20 
0x000070000ee2cca8:  0x000000000100115e <_cgo_sys_thread_start+0x000000000000009e>  0x0000000054484441 
0x000070000ee2ccb8:  0x0000000000000000  0x0000000000000000 
0x000070000ee2ccc8:  0x0000000000800000  0x00000000000008ff 
0x000070000ee2ccd8:  0x0000000010010101  0x0000000000000000 
0x000070000ee2cce8:  0x0000000000000000  0x0000000000000000 
0x000070000ee2ccf8:  0xffffffff00000000  0x000070000ee2cd98 
0x000070000ee2cd08:  0x0000000000000100  0x000000c0000069c0 

goroutine 1 [semacquire, 49 minutes]:
runtime.gopark(0x4ae42a40?, 0x3fd5f18?, 0xc0?, 0x0?, 0xc000100480?)
	/usr/local/opt/go/libexec/src/runtime/proc.go:398 +0xce fp=0xc000499820 sp=0xc000499800 pc=0x103ddce
runtime.goparkunlock(...)
	/usr/local/opt/go/libexec/src/runtime/proc.go:404
runtime.semacquire1(0xc0005a1898, 0x0?, 0x1, 0x0, 0x0?)
	/usr/local/opt/go/libexec/src/runtime/sema.go:160 +0x218 fp=0xc000499888 sp=0xc000499820 pc=0x104f258
sync.runtime_Semacquire(0xe8?)
	/usr/local/opt/go/libexec/src/runtime/sema.go:62 +0x25 fp=0xc0004998c0 sp=0xc000499888 pc=0x106d1a5
sync.(*WaitGroup).Wait(0xc0005a1860?)
	/usr/local/opt/go/libexec/src/sync/waitgroup.go:116 +0x48 fp=0xc0004998e8 sp=0xc0004998c0 pc=0x108fe08
github.com/brimdata/zed/pkg/httpd.(*Server).Wait(...)
	/Users/phil/work/zed/pkg/httpd/server.go:59
github.com/brimdata/zed/cmd/zed/serve.(*Command).Run(0xc0000fdc20, {0x2a88c80?, 0x2?, 0x2?})
	/Users/phil/work/zed/cmd/zed/serve/command.go:122 +0x879 fp=0xc000499d00 sp=0xc0004998e8 pc=0x23f24d9
github.com/brimdata/zed/pkg/charm.path.run({0xc0004020f0?, 0x2, 0x2}, {0xc000040290?, 0x0, 0x0?})
	/Users/phil/work/zed/pkg/charm/path.go:11 +0x79 fp=0xc000499dc8 sp=0xc000499d00 pc=0x228ef79
github.com/brimdata/zed/pkg/charm.(*Spec).ExecRoot(0xc000434380?, {0xc000040250, 0x5, 0x5})
	/Users/phil/work/zed/pkg/charm/charm.go:63 +0x3f fp=0xc000499e28 sp=0xc000499dc8 pc=0x228c8ff
main.main()
	/Users/phil/work/zed/cmd/zed/main.go:67 +0xfbb fp=0xc000499f40 sp=0xc000499e28 pc=0x23f791b
runtime.main()
	/usr/local/opt/go/libexec/src/runtime/proc.go:267 +0x2bb fp=0xc000499fe0 sp=0xc000499f40 pc=0x103d95b
runtime.goexit()
	/usr/local/opt/go/libexec/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000499fe8 sp=0xc000499fe0 pc=0x1071a01

goroutine 2 [force gc (idle), 12 minutes]:
runtime.gopark(0x2e79b5b3f5b?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/local/opt/go/libexec/src/runtime/proc.go:398 +0xce fp=0xc000074fa8 sp=0xc000074f88 pc=0x103ddce
runtime.goparkunlock(...)
	/usr/local/opt/go/libexec/src/runtime/proc.go:404
runtime.forcegchelper()
	/usr/local/opt/go/libexec/src/runtime/proc.go:322 +0xb3 fp=0xc000074fe0 sp=0xc000074fa8 pc=0x103dc33
runtime.goexit()
	/usr/local/opt/go/libexec/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000074fe8 sp=0xc000074fe0 pc=0x1071a01
created by runtime.init.6 in goroutine 1
	/usr/local/opt/go/libexec/src/runtime/proc.go:310 +0x1a

goroutine 3 [GC sweep wait]:
runtime.gopark(0x396b001?, 0x396b0c0?, 0xc?, 0x14?, 0x1?)
	/usr/local/opt/go/libexec/src/runtime/proc.go:398 +0xce fp=0xc000075778 sp=0xc000075758 pc=0x103ddce
runtime.goparkunlock(...)
	/usr/local/opt/go/libexec/src/runtime/proc.go:404
runtime.bgsweep(0x0?)
	/usr/local/opt/go/libexec/src/runtime/mgcsweep.go:321 +0xdf fp=0xc0000757c8 sp=0xc000075778 pc=0x10287bf
runtime.gcenable.func1()
	/usr/local/opt/go/libexec/src/runtime/mgc.go:200 +0x25 fp=0xc0000757e0 sp=0xc0000757c8 pc=0x101d905
runtime.goexit()
	/usr/local/opt/go/libexec/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000757e8 sp=0xc0000757e0 pc=0x1071a01
created by runtime.gcenable in goroutine 1
	/usr/local/opt/go/libexec/src/runtime/mgc.go:200 +0x66

goroutine 4 [GC scavenge wait]:
runtime.gopark(0xbfeb88?, 0xbe3449?, 0x0?, 0x0?, 0x0?)
	/usr/local/opt/go/libexec/src/runtime/proc.go:398 +0xce fp=0xc000075f70 sp=0xc000075f50 pc=0x103ddce
runtime.goparkunlock(...)
	/usr/local/opt/go/libexec/src/runtime/proc.go:404
runtime.(*scavengerState).park(0x396c4e0)
	/usr/local/opt/go/libexec/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc000075fa0 sp=0xc000075f70 pc=0x1026029
runtime.bgscavenge(0x0?)
	/usr/local/opt/go/libexec/src/runtime/mgcscavenge.go:658 +0x59 fp=0xc000075fc8 sp=0xc000075fa0 pc=0x10265d9
runtime.gcenable.func2()
	/usr/local/opt/go/libexec/src/runtime/mgc.go:201 +0x25 fp=0xc000075fe0 sp=0xc000075fc8 pc=0x101d8a5
runtime.goexit()
	/usr/local/opt/go/libexec/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000075fe8 sp=0xc000075fe0 pc=0x1071a01
created by runtime.gcenable in goroutine 1
	/usr/local/opt/go/libexec/src/runtime/mgc.go:201 +0xa5

goroutine 5 [finalizer wait, 49 minutes]:
runtime.gopark(0x198?, 0x278e560?, 0x1?, 0xef?, 0x0?)
	/usr/local/opt/go/libexec/src/runtime/proc.go:398 +0xce fp=0xc000074620 sp=0xc000074600 pc=0x103ddce
runtime.runfinq()
	/usr/local/opt/go/libexec/src/runtime/mfinal.go:193 +0x107 fp=0xc0000747e0 sp=0xc000074620 pc=0x101c927
runtime.goexit()
	/usr/local/opt/go/libexec/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000747e8 sp=0xc0000747e0 pc=0x1071a01
created by runtime.createfing in goroutine 1
	/usr/local/opt/go/libexec/src/runtime/mfinal.go:163 +0x3d
...

My notes indicate I originally reproduced the problem after having only posted 100k objects. On a more recent attempt with 100k it didn't repro, but it's repro'ed reliably every time with 300k. I also recognize this is a particularly abusive test, but the community zync user I was helping with #4934 had a pool with about 12k objects to vacuum, so this test isn't too far off from what we might experience in the wild.

@philrz
Copy link
Contributor Author

philrz commented Jan 10, 2024

Verified in Zed commit a5ec530.

After loading with 300k commits like before, I can now delete & vacuum without crashing.

$ zed -version
Version: v1.12.0-20-ga5ec5304

$ zed delete -where "ts < $(zed query -z 'from TestPool | head 5 | tail 1 | yield ts')"
2ajyc6ruLPov0Aj3gQHJJSxmiJg delete committed

$ zed manage
{"level":"info","ts":1704846384.195723,"logger":"pool","msg":"updating pool","name":"TestPool","id":"2ajthiimxtAdFh1J1dNH1BVe3cK","branch":"main","interval":60,"vectors":false}
{"level":"info","ts":1704846384.206878,"logger":"pool","msg":"compaction completed","name":"TestPool","id":"2ajthiimxtAdFh1J1dNH1BVe3cK","branch":"main","interval":60,"vectors":false,"runs_found":1,"objects_compacted":5,"vectors_created":0}

$ zed vacuum -f
vacuumed 300000 objects

Thanks @mattnibs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant