Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MacOS X Panic when running test for github.com/elastic/beats/libbeat/publisher/queue/spool #8490

Closed
ph opened this issue Sep 28, 2018 · 4 comments
Assignees

Comments

@ph
Copy link
Contributor

ph commented Sep 28, 2018

This is issue was reported by @ruflin during PR review of #8376

Failure on the spooler when running the test on macos X (complete gist at https://gist.github.com/ph/d2fd8eecde3c16ee720939d483e7b654)

12:53:51 command [go test -race -cover -coverprofile /var/folders/n9/rjyq805d3xs95862s8msvlqh0000gp/T/gotestcover-102501828 github.com/elastic/beats/libbeat/publisher/queue/spool]: exit status 1
12:53:51 SIGQUIT: quit
12:53:51 PC=0x108da8b m=0 sigcode=0
12:53:51 
12:53:51 goroutine 0 [idle]:
12:53:51 runtime.mach_semaphore_wait(0x1303, 0x116289a, 0x78000bf1ef, 0x0, 0x0, 0x1b06160, 0x7fff5fbfebd8, 0x1087cf3, 0xffffffffffffffff, 0x1162a8f, ...)
12:53:51 	/var/lib/jenkins/.gvm/versions/go1.10.3.darwin.amd64/src/runtime/sys_darwin_amd64.s:540 +0xb
12:53:51 runtime.semasleep1(0xffffffffffffffff, 0x1162a8f)
12:53:51 make[1]: *** [unit-tests] Error 1
12:53:51 	/var/lib/jenkins/.gvm/versions/go1.10.3.darwin.amd64/src/runtime/os_darwin.go:438 +0x52
12:53:51 runtime.semasleep.func1()
12:53:51 make: *** [testsuite] Error 2
12:53:51 	/var/lib/jenkins/.gvm/versions/go1.10.3.darwin.amd64/src/runtime/os_darwin.go:457 +0x33
12:53:51 runtime.semasleep(0xffffffffffffffff, 0x7fff5fbfec48)
12:53:51 	/var/lib/jenkins/.gvm/versions/go1.10.3.darwin.amd64/src/runtime/os_darwin.go:456 +0x44
12:53:51 runtime.notesleep(0x1b067a8)
12:53:51 	/var/lib/jenkins/.gvm/versions/go1.10.3.darwin.amd64/src/runtime/lock_sema.go:167 +0xe9
12:53:51 runtime.stopm()
12:53:51 	/var/lib/jenkins/.gvm/versions/go1.10.3.darwin.amd64/src/runtime/proc.go:1952 +0xe5
12:53:51 + cleanup
12:53:51 + echo 'Running cleanup...'
12:53:51 runtime.findrunnable(0xc42003cf00, 0x0)
12:53:51 	/var/lib/jenkins/.gvm/versions/go1.10.3.darwin.amd64/src/runtime/proc.go:2415 +0x2c8
12:53:51 + rm -rf /var/folders/n9/rjyq805d3xs95862s8msvlqh0000gp/T/tmp.sfkroiny
12:53:51 runtime.schedule()
12:53:51 	/var/lib/jenkins/.gvm/versions/go1.10.3.darwin.amd64/src/runtime/proc.go:2541 +0x151
12:53:51 runtime.park_m(0xc42007ca80)
12:53:51 	/var/lib/jenkins/.gvm/versions/go1.10.3.darwin.amd64/src/runtime/proc.go:2604 +0x81
12:53:51 runtime.mcall(0x0)
12:53:51 	/var/lib/jenkins/.gvm/versions/go1.10.3.darwin.amd64/src/runtime/asm_amd64.s:351 +0x5b
12:53:51 
12:53:51 goroutine 1 [chan receive, 9 minutes]:
12:53:51 testing.(*T).Run(0xc42015a000, 0x1751b98, 0x13, 0x1766818, 0xc42006dc00)
12:53:51 	/var/lib/jenkins/.gvm/versions/go1.10.3.darwin.amd64/src/testing/testing.go:825 +0x597
12:53:51 testing.runTests.func1(0xc42015a000)
12:53:51 	/var/lib/jenkins/.gvm/versions/go1.10.3.darwin.amd64/src/testing/testing.go:1063 +0xa5
12:53:51 testing.tRunner(0xc42015a000, 0xc42006dd48)
12:53:51 	/var/lib/jenkins/.gvm/versions/go1.10.3.darwin.amd64/src/testing/testing.go:777 +0x16e
12:53:51 testing.runTests(0xc4201253e0, 0x1afa7a0, 0x1, 0x1, 0xc4200e4080)
12:53:51 	/var/lib/jenkins/.gvm/versions/go1.10.3.darwin.amd64/src/testing/testing.go:1061 +0x4e2
12:53:51 testing.(*M).Run(0xc4200e4080, 0x0)
12:53:51 	/var/lib/jenkins/.gvm/versions/go1.10.3.darwin.amd64/src/testing/testing.go:978 +0x2ce
12:53:51 main.main()
12:53:51 	_testmain.go:108 +0x325
12:53:51 
@ph
Copy link
Contributor Author

ph commented Sep 28, 2018

Well this look more like a deadlock, the job was killed after 10 mins of execution and generate this trace.

SIGQUIT: quit
 *** Test killed with quit: ran too long (10m0s).

@ph ph assigned urso Sep 28, 2018
@ph
Copy link
Contributor Author

ph commented Sep 28, 2018

@urso Lets look at this on Monday, I need to look back a bit more at the code, I wonder if a scheduled write on disk under some circumstance on macos (vm?) is not able to complete and get stuck in the loop.

Looking at the goroutines dump, we have one waiting on a write tx and we have another one in the for loop of our specific Darwin implementation for file sync.

sync.runtime_Semacquire(0xc4202c247c)
12:53:51 	/var/lib/jenkins/.gvm/versions/go1.10.3.darwin.amd64/src/runtime/sema.go:56 +0x39
12:53:51 sync.(*WaitGroup).Wait(0xc4202c2470)
12:53:51 	/var/lib/jenkins/.gvm/versions/go1.10.3.darwin.amd64/src/sync/waitgroup.go:129 +0xb3
12:53:51 github.com/elastic/beats/vendor/github.com/elastic/go-txfile.(*txWriteSync).Wait(0xc4202c2460, 0xc4203fe180, 0x0)
12:53:51 	/var/lib/jenkins/workspace/elastic+beats+pull-request+multijob-darwin/beat/libbeat/label/macosx/src/github.com/elastic/beats/vendor/github.com/elastic/go-txfile/write.go:283 +0x3f
12:53:51 github.com/elastic/beats/vendor/github.com/elastic/go-txfile.(*Tx).tryCommitChanges(0xc420190410, 0x0, 0x0)
12:53:51 	/var/lib/jenkins/workspace/elastic+beats+pull-request+multijob-darwin/beat/libbeat/label/macosx/src/github.com/elastic/beats/vendor/github.com/elastic/go-txfile/tx.go:428 +0xb7b
12:53:51 github.com/elastic/beats/vendor/github.com/elastic/go-txfile.(*Tx).commitChanges(0xc420190410, 0x0, 0x0)
12:53:51 	/var/lib/jenkins/workspace/elastic+beats+pull-request+multijob-darwin/beat/libbeat/label/macosx/src/github.com/elastic/beats/vendor/github.com/elastic/go-txfile/tx.go:323 +0xb5

12:53:51 goroutine 220 [syscall]:
12:53:51 syscall.Syscall(0x5c, 0x6, 0x33, 0x0, 0xc420099940, 0xc420198ce8, 0x10cc49f)
12:53:51 	/var/lib/jenkins/.gvm/versions/go1.10.3.darwin.amd64/src/syscall/asm_darwin_amd64.s:16 +0x5
12:53:51 github.com/elastic/beats/vendor/golang.org/x/sys/unix.FcntlInt(0x6, 0x33, 0x0, 0x10926b5, 0x1095f37, 0xc420198d68)
12:53:51 	/var/lib/jenkins/workspace/elastic+beats+pull-request+multijob-darwin/beat/libbeat/label/macosx/src/github.com/elastic/beats/vendor/golang.org/x/sys/unix/fcntl.go:17 +0x6d
12:53:51 github.com/elastic/beats/vendor/github.com/elastic/go-txfile/internal/vfs/osfs.(*File).doSync(0xc4207f70f0, 0xc4202c2402, 0xc420198d88, 0x1096067)
12:53:51 	/var/lib/jenkins/workspace/elastic+beats+pull-request+multijob-darwin/beat/libbeat/label/macosx/src/github.com/elastic/beats/vendor/github.com/elastic/go-txfile/internal/vfs/osfs/sync_darwin.go:46 +0x6e
12:53:51 github.com/elastic/beats/vendor/github.com/elastic/go-txfile/internal/vfs/osfs.(*File).Sync(0xc4207f70f0, 0x2, 0x29b1330, 0xc4207f70f0)
12:53:51 	/var/lib/jenkins/workspace/elastic+beats+pull-request+multijob-darwin/beat/libbeat/label/macosx/src/github.com/elastic/beats/vendor/github.com/elastic/go-txfile/internal/vfs/osfs/sync_darwin.go:40 +0x46

I remember that we had discussions about error handling in that part of the code.
Maybe I am wrong, if so I would like to know what are you looking to debug that issue.

@urso
Copy link

urso commented Dec 11, 2019

We've seen this problem on travis, jenkins linux and others. The cause is overloaded CI. These are stress tests that fail because of fsync blocking the go runtime too often for too long. go test times out tests after 10 minutes. We haven't seen the issue for quite some time I think.

@urso urso mentioned this issue Dec 11, 2019
39 tasks
@jlind23
Copy link
Collaborator

jlind23 commented Mar 31, 2022

Backlog grooming: Closing this for now until further activity.

@jlind23 jlind23 closed this as completed Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants