nsqd: mystery bug #346

tj · 2014-05-19T18:57:57Z

not much info yet, basically it only happens when one of our uploads to s3 fails, but we're not even REQing much in this case (300 or so), so it's not overwhelming nsqd like I had previously thought

2014/05/19 18:50:16 LOOKUPD(10.0.1.172:4160): sending heartbeat
2014/05/19 18:50:16 LOOKUPD(10.0.1.173:4160): sending heartbeat
2014/05/19 18:50:25 PROTOCOL(V2): [10.0.1.62:37278] exiting ioloop
2014/05/19 18:50:25 ERROR: client(10.0.1.62:37278) - read tcp 10.0.1.62:37278: i/o timeout

followed by ~50:

E_REQ_FAILED REQ 067bbe8f8d422005 failed ID not in flight - ID not in flight

pretty positive it's something I'm doing but I'll keep updating.

Ahh just got an EPIPE on the client end, and nsqadmin does seem to have periods where it'll display 0 (instead of ~3,000,000) for the topic/chan, there's no sign of nsqd going down though and cpu usage is minimal

followed by:

2014/05/19 19:06:58 PROTOCOL(V2): [10.0.1.62:37739] exiting ioloop
2014/05/19 19:06:58 ERROR: client(10.0.1.62:37739) - read tcp 10.0.1.62:37739: i/o timeout
2014/05/19 19:06:58 PROTOCOL(V2): [10.0.1.62:37739] exiting messagePump
2014/05/19 19:06:58 PROTOCOL(V2): [10.0.1.62:37739] messagePump error - use of closed network connection

The text was updated successfully, but these errors were encountered:

mreiferson · 2014-05-19T19:57:18Z

@visionmedia curious what OS you're running on? also what nsqd version? built with which version of Go?

tj · 2014-05-19T20:23:24Z

hmm interesting nsqd is sitting on a futex(), appears to be locked up, low cpu/mem

tj · 2014-05-19T20:24:08Z

ubuntu@red-extract-1:~$ uname -a
Linux red-extract-1 3.5.0-23-generic #35-Ubuntu SMP Thu Jan 24 13:15:40 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
ubuntu@red-extract-1:~$ nsqd -version
nsqd v0.2.28 (built w/go1.2.1)

mreiferson · 2014-05-19T20:29:16Z

which version of Ubuntu is that?

You mean locked up and it doesn't recover or you mean that it locks up and eventually recovers and you believe that's contributing to the timeouts?

While you're experiencing this issue, if you can capture a CPU profile from the HTTP endpoint /debug/pprof/profile (it will run for 30s and then dump the profile as the HTTP response) and also a goroutine profile from /debug/pprof/goroutine that would be really helpful to start debugging.

Thanks!

tj · 2014-05-19T20:36:17Z

super weird the only activity in our worker seems to be redis, the rest just might be a side-effect of this, I should probably stop trying to shove 1m events into one process but we need the throughput haha. I didn't see this redis issue earlier so this could be it

tj · 2014-05-19T20:45:54Z

think im going to close this actually, finally got a consistent failure going, cpu is pegged in our worker but I still can't really imagine how it's stuck for 1m hmm

mreiferson · 2014-05-19T20:50:13Z

well, if you do suspect nsqd while you're debugging, please grab those profiling outputs as I think it would help diagnose!

Good luck 🔥 🚒

tj · 2014-05-19T20:58:31Z

on a side note i think nsqadmin is timing out requests a little too eagerly as well, making the /stats request locally is working fine but the UI gives me 0 sometimes like I mentioned there, I thought that was a result of blocking but it seems ok

mreiferson · 2014-05-19T21:00:06Z

yea, nsqadmin isn't very flexible or informative when it times out HTTP requests - we should definitely open an issue for that (or just update/tag #323)

tj · 2014-05-19T21:03:19Z

goroutine:

goroutine profile: total 235
1 @ 0x600160 0x5fff59 0x5fcfc4 0x56efef 0x56f49e 0x47c997 0x55643e 0x5543e7 0x419dd0
1 @ 0x419c06 0x40672b 0x406858 0x40150a 0x41789f 0x419dd0
1 @ 0x40d6c6 0x414353 0x419dd0
1 @ 0x40d6c6 0x4268dd 0x419dd0
1 @ 0x40d6c6 0x425772 0x4499ce 0x419dd0
1 @ 0x419c06 0x406412 0x406858 0x40160b 0x419dd0
1 @ 0x419c06 0x407474 0x4070c2 0x486dcc 0x4993e9 0x4a151e 0x419dd0
10 @ 0x419c06 0x407474 0x4070c2 0x47bece 0x419dd0
4 @ 0x419c06 0x406412 0x4068ab 0x497600 0x499679 0x4a151e 0x419dd0
4 @ 0x419c06 0x407474 0x4070c2 0x497579 0x4996a9 0x4a151e 0x419dd0
86 @ 0x419c06 0x424916 0x42427a 0x5166c4 0x516720 0x517aa0 0x526795 0x4d6a30 0x4d75d4 0x487ab0 0x495511 0x419dd0
7 @ 0x419c06 0x407474 0x4070c2 0x47524a 0x419dd0
8 @ 0x419c06 0x406412 0x4068ab 0x474b70 0x4987b9 0x4a151e 0x419dd0
8 @ 0x419c06 0x407474 0x4070c2 0x4755dd 0x4752dd 0x4987e9 0x4a151e 0x419dd0
8 @ 0x419c06 0x407474 0x4070c2 0x4755dd 0x47538d 0x498819 0x4a151e 0x419dd0
85 @ 0x419c06 0x407474 0x4070c2 0x48a04e 0x419dd0
1 @ 0x419c06 0x40619d 0x40680c 0x47503b 0x419dd0
1 @ 0x419c06 0x407474 0x4070c2 0x482dd7 0x499419 0x4a151e 0x419dd0
1 @ 0x419c06 0x424916 0x42427a 0x5166c4 0x516720 0x519832 0x52d0c7 0x52d197 0x4a083a 0x49948c 0x4a151e 0x419dd0
1 @ 0x419c06 0x424916 0x42427a 0x5166c4 0x516720 0x519832 0x52d0c7 0x52d197 0x5565c1 0x49e64b 0x4995ed 0x4a151e 0x419dd0
1 @ 0x419c06 0x407474 0x4070c2 0x494fad 0x499649 0x4a151e 0x419dd0
1 @ 0x419c06 0x4264e4 0x4263b1 0x56e69b 0x47cb01 0x55643e 0x5543e7 0x419dd0
1 @ 0x419c06 0x424916 0x42427a 0x5166c4 0x516770 0x518c02 0x526885 0x4d84d1 0x4d87e4 0x49f76c 0x4885f1 0x488430 0x489fd2 0x419dd0
1 @ 0x40d6c6 0x4088e8 0x408a24 0x60046e 0x419dd0

and this is what I get back from /debug/pprof/profile:

'?q@?p@N?HНA?Ea?A^d=?a?A3?Ba?A

tj · 2014-05-19T21:08:10Z

and the process just exploded OOM haha, I'll try just toning down the v8 GC abuse and see how that goes

mreiferson · 2014-05-19T21:10:03Z

wait, the nsqd process just exploded?

That /debug/pprof/profile output should probably just be redirected to a file, it's binary (email it to me?).

tj · 2014-05-19T21:13:37Z

nope just our node process, I'm going to run it for a while capped at 100k messages and see if this issues pops up at all

tj · 2014-05-19T21:25:42Z

ah fuck, sorry for the noise i forgot v8 has a heap limit of ~1.9G or so, gc was just being a bitch

mreiferson added the question label May 19, 2014

tj closed this as completed May 19, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsqd: mystery bug #346

nsqd: mystery bug #346

tj commented May 19, 2014

mreiferson commented May 19, 2014

tj commented May 19, 2014

tj commented May 19, 2014

mreiferson commented May 19, 2014

tj commented May 19, 2014

tj commented May 19, 2014

mreiferson commented May 19, 2014

tj commented May 19, 2014

mreiferson commented May 19, 2014

tj commented May 19, 2014

tj commented May 19, 2014

mreiferson commented May 19, 2014

tj commented May 19, 2014

tj commented May 19, 2014

nsqd: mystery bug #346

nsqd: mystery bug #346

Comments

tj commented May 19, 2014

mreiferson commented May 19, 2014

tj commented May 19, 2014

tj commented May 19, 2014

mreiferson commented May 19, 2014

tj commented May 19, 2014

tj commented May 19, 2014

mreiferson commented May 19, 2014

tj commented May 19, 2014

mreiferson commented May 19, 2014

tj commented May 19, 2014

tj commented May 19, 2014

mreiferson commented May 19, 2014

tj commented May 19, 2014

tj commented May 19, 2014