EXC_BAD_ACCESS crash in try_gc when under heavy load #1483

agarman · 2016-12-23T20:00:34Z

I'm getting this intermittent error when sending load into a simple web server I created in Pony.

(lldb) target create "./pony0.10.0"
Current executable set to './pony0.10.0' (x86_64).
(lldb) settings set -- target.run-args  "8080"
(lldb) r
Process 19060 launched: './pony0.10.0' (x86_64)
Listening on ::1:8080
Process 19060 stopped
* thread #2: tid = 0x67912, 0x0000000100036e06 pony0.10.0`pony_traceunknown + 9, stop reason = EXC_BAD_ACCESS (code=1, address=0x74736fa8)
    frame #0: 0x0000000100036e06 pony0.10.0`pony_traceunknown + 9
pony0.10.0`pony_traceunknown:
->  0x100036e06 <+9>:  cmpq   $0x0, 0x40(%rdx)
    0x100036e0b <+14>: je     0x100036e14               ; <+23>
    0x100036e0d <+16>: movq   0x18(%rdi), %rax
    0x100036e11 <+20>: popq   %rbp
(lldb) bt
* thread #2: tid = 0x67912, 0x0000000100036e06 pony0.10.0`pony_traceunknown + 9, stop reason = EXC_BAD_ACCESS (code=1, address=0x74736fa8)
  * frame #0: 0x0000000100036e06 pony0.10.0`pony_traceunknown + 9
    frame #1: 0x0000000100001f26 pony0.10.0`Array_u3_t2_String_val_String_val_collections__MapEmpty_val_collections__MapDeleted_val_Trace + 118
    frame #2: 0x00000001000361cc pony0.10.0`ponyint_gc_handlestack + 70
    frame #3: 0x00000001000329f5 pony0.10.0`try_gc + 249
    frame #4: 0x000000010003868e pony0.10.0`run_thread + 280
    frame #5: 0x00007fffcc735aab libsystem_pthread.dylib`_pthread_body + 180
    frame #6: 0x00007fffcc7359f7 libsystem_pthread.dylib`_pthread_start + 286
    frame #7: 0x00007fffcc7351fd libsystem_pthread.dylib`thread_start + 13

Here is the minified version of code I'm running.

use "net/http"
use "time"

actor Main
    new create(env: Env) =>
    let service = try env.args(1) else "50000" end
    let logger = DiscardLog
    let handler = Handler.create("1") //.create(redis)

    let auth = try
      env.root as AmbientAuth
    else
      env.out.print("unable to use network")
      return
    end

    Server(auth, Info(env), handler, logger
      where service=service, reversedns=auth
    )

class Info
  let _env: Env

  new iso create(env: Env) => _env = env

  fun listening(server: Server ref) =>
    try
      (let host, let service) = server.local_address().name()
      _env.out.print("Listening on " + host + ":" + service)
    else
      server.dispose()
    end

  fun not_listening(server: Server ref) => None

  fun closed(server: Server ref) => None

class Handler is RequestHandler
  let _x: String val

  new val create(x: String val) =>
    _x = x

  fun now(): String =>
    Date(where seconds=Time.seconds())
    .format("%FT%TZ")
  fun apply(request: Payload) =>
    let result: String =
      match request.url.path
        | "/nr" => now()
        else "ponyc thinks it can get here. It can't!"
      end

    let response = Payload.response()
    response.add_chunk(result)
    response.add_chunk("\n")

    (consume request).respond(consume response)

Repeatedly running wrk with following options produce the issue.

wrk -t3 -c800 -d30s http://127.0.0.1:8080/nr

Information about my environment.

$ uname -a
Darwin Admins-MacBook-Pro.local 16.3.0 Darwin Kernel Version 16.3.0: Thu Nov 17 20:23:58 PST 2016; root:xnu-3789.31.2~1/RELEASE_X86_64 x86_64

$ ponyc --version
0.10.0 [release]
compiled with: llvm 3.9.0 -- Apple LLVM version 8.0.0 (clang-800.0.42.1)

The text was updated successfully, but these errors were encountered:

Praetonus · 2016-12-24T23:33:07Z

I strongly suspect this is the same GC issue as #1118.

SeanTAllen · 2016-12-25T17:47:46Z

@Praetonus agreed.

agarman · 2017-01-11T15:39:29Z

I tried branch fix-1118 with my code. There is still an issue.

Process 69308 launched: '/Users/agarman/src/ponyc/http-test' (x86_64)
Listening on ::1:50000
Process 69308 stopped
* thread #2: tid = 0x1f489b0, 0x0000000100025ea6 http-test`ponyint_gc_markimmutable + 326, stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x0000000100025ea6 http-test`ponyint_gc_markimmutable + 326
http-test`ponyint_gc_markimmutable:
->  0x100025ea6 <+326>: movq   0x20(%r8), %rcx
    0x100025eaa <+330>: testq  %rcx, %rcx
    0x100025ead <+333>: je     0x100025f60               ; <+512>
    0x100025eb3 <+339>: movq   %rcx, -0x30(%rbp)
(lldb) bt
* thread #2: tid = 0x1f489b0, 0x0000000100025ea6 http-test`ponyint_gc_markimmutable + 326, stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x0000000100025ea6 http-test`ponyint_gc_markimmutable + 326
    frame #1: 0x000000010002a4bb http-test`run_thread + 1515
    frame #2: 0x00007fffcc735aab libsystem_pthread.dylib`_pthread_body + 180
    frame #3: 0x00007fffcc7359f7 libsystem_pthread.dylib`_pthread_start + 286
    frame #4: 0x00007fffcc7351fd libsystem_pthread.dylib`thread_start + 13

agarman · 2017-03-11T21:06:53Z

I can no longer reproduce with:

0.11.0 [release]
compiled with: llvm 3.9.1 -- Apple LLVM version 8.0.0 (clang-800.0.42.1)

Also latency is much improved:

2.5ms in 0.11.0 versus 25ms in 0.10.0
99% of <4ms versus <60ms
only downside is that the 1% maxed at 250ms versus 125ms in 0.10.0

Closing the issue.

agarman · 2017-03-15T13:03:38Z

Reopening after 0.11.1 release

$ ponyc --version
0.11.1 [release]
compiled with: llvm 3.8.1 -- Apple LLVM version 8.0.0 (clang-800.0.42.1)

$ uname -a
Darwin Andrews-MacBook-Pro.local 16.4.0 Darwin Kernel Version 16.4.0: Thu Dec 22 22:53:21 PST 2016; root:xnu-3789.41.3~3/RELEASE_X86_64 x86_64

$ lldb -f httptest
(lldb) target create "httptest"
Current executable set to 'httptest' (x86_64).
(lldb) r
Process 25651 launched: '/Users/agarman/src/web-tests/httptest/httptest' (x86_64)
Process 25651 stopped
* thread #2: tid = 0x93fbb3, 0x0000000100023a13 httptest`ponyint_gc_release + 199, stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
frame #0: 0x0000000100023a13 httptest`ponyint_gc_release + 199
httptest`ponyint_gc_release:
->  0x100023a13 <+199>: subq   %rcx, 0x10(%rax)
0x100023a17 <+203>: movq   0x28(%rbx), %rdx
0x100023a1b <+207>: movq   0x18(%rbx), %rsi
0x100023a1f <+211>: movq   0x20(%rbx), %rcx
(lldb) bt
* thread #2: tid = 0x93fbb3, 0x0000000100023a13 httptest`ponyint_gc_release + 199, stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
* frame #0: 0x0000000100023a13 httptest`ponyint_gc_release + 199
frame #1: 0x000000010001f4b3 httptest`handle_message + 139
frame #2: 0x0000000100025dd6 httptest`run_thread + 262
frame #3: 0x00007fffe5b32aab libsystem_pthread.dylib`_pthread_body + 180
frame #4: 0x00007fffe5b329f7 libsystem_pthread.dylib`_pthread_start + 286
frame #5: 0x00007fffe5b321fd libsystem_pthread.dylib`thread_start + 13
(lldb)

SeanTAllen · 2017-03-15T13:08:45Z

@agarman there's nothing in the 0.11.1 release that should impact on this. I assume that this doesn't always happen?

Are you building from source?

agarman · 2017-03-15T13:13:01Z

use "net/http"
use "time"

// TODO: link hiredis on linux
use "path:/usr/local/Cellar/hiredis/0.13.3/lib/" if osx
use "path:./hiredis" if linux
use "lib:hiredis"


actor Main
  """
  A simple HTTP server.
  """
  new create(env: Env) =>
    let service = try env.args(1) else "50000" end
    let limit = try env.args(2).usize() else 100 end
    let host = "localhost"
    let redis = Redis.connect()

    //let logger = CommonLog(env.out)
    // let logger = ContentsLog(env.out)
    let logger = DiscardLog

    let auth = try
      env.root as AmbientAuth
    else
      env.out.print("unable to use network")
      return
    end

    // Start the top server control actor.
    HTTPServer(
      auth,
      ListenHandler(env),
      BackendMaker.create(env, redis),
      logger
      where service=service, host=host, limit=limit, reversedns=auth)


class ListenHandler
  let _env: Env

  new iso create(env: Env) =>
    _env = env

  fun ref listening(server: HTTPServer ref) =>
    try
      (let host, let service) = server.local_address().name()
    else
      _env.out.print("Couldn't get local address.")
      server.dispose()
    end

  fun ref not_listening(server: HTTPServer ref) =>
    _env.out.print("Failed to listen.")

  fun ref closed(server: HTTPServer ref) =>
    _env.out.print("Shutdown.")


class BackendMaker is HandlerFactory
  let _env: Env
  let _redis: Redis val

  new val create(env: Env, redis: Redis val) =>
    _env = env
    _redis = redis

  fun apply(session: HTTPSession): HTTPHandler^ =>
    BackendHandler.create(_env, session, _redis)


class BackendHandler is HTTPHandler
  """
  Notification class for a single HTTP session.  A session can process
  several requests, one at a time.
  """
  let _env: Env
  let _session: HTTPSession
  let _redis: Redis val

  fun now(): String =>
    Date(where seconds=Time.seconds())
    .format("%FT%TZ")

  fun seen(p: String): String =>
    let x = _redis.get(p)
    if x == "" then
      let now' = now()
      _redis.set(p, now')
      now'
    else
      x
    end

  new ref create(env: Env, session: HTTPSession, redis: Redis val) =>
    """
    Create a context for receiving HTTP requests for a session.
    """
    _env = env
    _session = session
    _redis = redis

  fun ref apply(request: Payload val) =>
    """
    Start processing a request.
    """

    let result: String =
      match request.url.path
        | "/nr" => now() // SEGFAULT when invoking this route.
        | "/ping" => _redis.ping()
        | let p: String => seen(p)
        else "ponyc thinks it can get here. It can't!"
      end

    let response = Payload.response()
    response.add_chunk(result)
    response.add_chunk("\n")

    _session(consume response)

// IGNORE CODE BELOW... Error occurs on testing /nr route.

class Redis
  """
  Real redis code & ffi removed.
  """
  let _ctx: String

  new val connect(host: String = "127.0.0.1", port: U64 = 6379) =>
    _ctx = "ctx"

  fun ping() : String =>
    "pong" 

  fun get(key: String) : String =>
    "got"

  fun set(key: String, value: String) : String =>
    "set"

agarman · 2017-03-15T13:14:14Z

@SeanTAllen this is an intermittent error. It only occurs when load testing.

agarman · 2017-03-15T13:15:33Z

@SeanTAllen I had a series of load tests run for 30 minutes after 0.11.0 release. The error didn't occur. After the 0.11.1 release it occured three times in the first 5 minutes of testing.

agarman · 2017-03-15T13:19:05Z

@SeanTAllen answer your last question... ponyc is from brew. Though when there's a relevant patch, I have tested against a local build of ponyc as well.

SeanTAllen · 2017-03-15T13:25:42Z

@agarman this is very interesting. When did you install 0.11.1 via homebrew, they updated the formula and switched it from using llvm 3.8 to 3.9 yesterday so, perhaps this is an llvm version related issue.

agarman · 2017-03-15T15:30:54Z

@SeanTAllen I did a brew upgrade this morning.
ponyc --version still states that it's using llvm 3.8.1, though llvm 4.0 was installed via brew upgrade.

ilovezfs · 2017-03-15T15:33:44Z

the change was 3.9 -> 3.8 while everything else was upgraded 3.9 -> 4.0

SeanTAllen · 2017-03-15T16:14:38Z

We had an issue with a similar crashing bug with 3.8.1 at Sendence. I'm going to discuss in the sync today if we can/should downgrade to 3.7.1 (in homebrew formulae) and deprecate 3.8.1 support on OSX.

agarman · 2017-03-15T17:24:17Z

The issue is not isolated to llvm 3.8.1:

$ ./build/release/ponyc  --version
0.11.1-036cafa [release]
compiled with: llvm 3.7.1 -- Apple LLVM version 8.0.0 (clang-800.0.42.1)

$ lldb -f httptest 
(lldb) target create "httptest"
Current executable set to 'httptest' (x86_64).
(lldb) r
Process 34033 launched: '/Users/agarman/src/ponyc/httptest' (x86_64)
Process 34033 stopped
* thread #2: tid = 0x95b14f, 0x000000010002627d httptest`ponyint_gc_release + 157, stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
frame #0: 0x000000010002627d httptest`ponyint_gc_release + 157
httptest`ponyint_gc_release:
->  0x10002627d <+157>: subq   %rcx, 0x10(%rax)
0x100026281 <+161>: cmpq   $0x0, 0x18(%r14)
0x100026286 <+166>: je     0x1000262fb               ; <+283>
0x100026288 <+168>: movq   0x20(%r14), %rcx
(lldb) bt
* thread #2: tid = 0x95b14f, 0x000000010002627d httptest`ponyint_gc_release + 157, stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
* frame #0: 0x000000010002627d httptest`ponyint_gc_release + 157
frame #1: 0x000000010001ee07 httptest`handle_message + 263
frame #2: 0x000000010002a8f0 httptest`run_thread + 720
frame #3: 0x00007fffe5b32aab libsystem_pthread.dylib`_pthread_body + 180
frame #4: 0x00007fffe5b329f7 libsystem_pthread.dylib`_pthread_start + 286
frame #5: 0x00007fffe5b321fd libsystem_pthread.dylib`thread_start + 13
(lldb)

SeanTAllen · 2017-03-15T17:29:05Z

Wonderful so it was fixed in 3.9 but existed prior to that.

SeanTAllen · 2017-03-15T17:33:23Z

Thank you for investing time in this @agarman. If you install LLVM 3.9 and build pony from source on OSX, you shouldn't have an issue. We are going to be figuring out how we want to handle this. We need to get LLVM 4 support finalized but beyond that, issues like this with changing LLVM dependencies might happen with homebrew so, we eventually need a solution to that.

ilovezfs · 2017-03-15T17:35:12Z

FWIW: this is why I notified you 27 days ago about the 4.0 problem #1592. This was the only upstream that did not have a 4.0 fix by release date.

agarman · 2017-03-15T18:02:56Z

@SeanTAllen still an issue in 3.9.0 ... there's no release artifact for 3.9.1 for OSX

$ ./build/release/ponyc --version
0.11.1-036cafa [release]
compiled with: llvm 3.9.0 -- Apple LLVM version 8.0.0 (clang-800.0.42.1)

$ lldb -f httptest 
(lldb) target create "httptest"
Current executable set to 'httptest' (x86_64).
(lldb) r
Process 39638 launched: '/Users/agarman/src/ponyc/httptest' (x86_64)
Process 39638 stopped
* thread #2: tid = 0x960ef8, 0x000000010002621d httptest`ponyint_gc_release + 157, stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
frame #0: 0x000000010002621d httptest`ponyint_gc_release + 157
httptest`ponyint_gc_release:
->  0x10002621d <+157>: subq   %rcx, 0x10(%rax)
0x100026221 <+161>: cmpq   $0x0, 0x18(%r14)
0x100026226 <+166>: je     0x10002629b               ; <+283>
0x100026228 <+168>: movq   0x20(%r14), %rcx
(lldb) bt
* thread #2: tid = 0x960ef8, 0x000000010002621d httptest`ponyint_gc_release + 157, stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
* frame #0: 0x000000010002621d httptest`ponyint_gc_release + 157
frame #1: 0x000000010001eda7 httptest`handle_message + 263
frame #2: 0x000000010002a890 httptest`run_thread + 720
frame #3: 0x00007fffe5b32aab libsystem_pthread.dylib`_pthread_body + 180
frame #4: 0x00007fffe5b329f7 libsystem_pthread.dylib`_pthread_start + 286
frame #5: 0x00007fffe5b321fd libsystem_pthread.dylib`thread_start + 13
(lldb)

ilovezfs · 2017-03-17T14:03:57Z

I have it depending on 3.9.1 again now: Homebrew/homebrew-core@de282db

SeanTAllen · 2017-03-17T14:08:06Z

awesome. @agarman you should try updating your ponyc via homebrew.

agarman · 2017-03-17T14:21:11Z

$ uname -a 
Darwin Andrews-MacBook-Pro.local 16.4.0 Darwin Kernel Version 16.4.0: Thu Dec 22 22:53:21 PST 2016; root:xnu-3789.41.3~3/RELEASE_X86_64 x86_64

$ ponyc --version
0.11.3 [release]
compiled with: llvm 3.9.1 -- Apple LLVM version 8.0.0 (clang-800.0.42.1)

$ lldb -f httptest
(lldb) target create "httptest"
Current executable set to 'httptest' (x86_64).
(lldb) r
Process 63193 launched: '/Users/agarman/src/web-tests/httptest/httptest' (x86_64)
Process 63193 stopped
* thread #2: tid = 0x9df048, 0x0000000100023a16 httptest`ponyint_gc_release + 199, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
frame #0: 0x0000000100023a16 httptest`ponyint_gc_release + 199
httptest`ponyint_gc_release:
->  0x100023a16 <+199>: subq   %rcx, 0x8(%rax)
0x100023a1a <+203>: movq   0x28(%rbx), %rdx
0x100023a1e <+207>: movq   0x18(%rbx), %rsi
0x100023a22 <+211>: movq   0x20(%rbx), %rcx
(lldb) bt
* thread #2: tid = 0x9df048, 0x0000000100023a16 httptest`ponyint_gc_release + 199, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
* frame #0: 0x0000000100023a16 httptest`ponyint_gc_release + 199
frame #1: 0x000000010001f453 httptest`handle_message + 139
frame #2: 0x0000000100025dd2 httptest`run_thread + 262
frame #3: 0x00007fffe5b32aab libsystem_pthread.dylib`_pthread_body + 180
frame #4: 0x00007fffe5b329f7 libsystem_pthread.dylib`_pthread_start + 286
frame #5: 0x00007fffe5b321fd libsystem_pthread.dylib`thread_start + 13
(lldb)

agarman · 2017-03-17T14:26:19Z

Reproduce with:

wrk -t3 -c4000 -d600s --latency http://localhost:50000/nr

SeanTAllen · 2017-03-17T14:29:19Z

sounds like it previously not happening with 3.9 was purely coincidental?

agarman · 2017-03-17T14:34:18Z

(lldb) thread backtrace all
  thread #1: tid = 0x9df032, 0x00007fffe5a49fda libsystem_kernel.dylib`__semwait_signal + 10, queue = 'com.apple.main-thread'
    frame #0: 0x00007fffe5a49fda libsystem_kernel.dylib`__semwait_signal + 10
    frame #1: 0x00007fffe5b34855 libsystem_pthread.dylib`pthread_join + 425
    frame #2: 0x00000001000270a0 pony0.11.0`pony_start + 284
    frame #3: 0x00007fffe591b255 libdyld.dylib`start + 1

* thread #2: tid = 0x9df048, 0x0000000100023a16 pony0.11.0`ponyint_gc_release + 199, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x0000000100023a16 pony0.11.0`ponyint_gc_release + 199
    frame #1: 0x000000010001f453 pony0.11.0`handle_message + 139
    frame #2: 0x0000000100025dd2 pony0.11.0`run_thread + 262
    frame #3: 0x00007fffe5b32aab libsystem_pthread.dylib`_pthread_body + 180
    frame #4: 0x00007fffe5b329f7 libsystem_pthread.dylib`_pthread_start + 286
    frame #5: 0x00007fffe5b321fd libsystem_pthread.dylib`thread_start + 13

  thread #3: tid = 0x9df046, 0x00007fffe5a4ae2a libsystem_kernel.dylib`kevent + 10
    frame #0: 0x00007fffe5a4ae2a libsystem_kernel.dylib`kevent + 10
    frame #1: 0x00000001000203fb pony0.11.0`ponyint_asio_backend_dispatch + 80
    frame #2: 0x00007fffe5b32aab libsystem_pthread.dylib`_pthread_body + 180
    frame #3: 0x00007fffe5b329f7 libsystem_pthread.dylib`_pthread_start + 286
    frame #4: 0x00007fffe5b321fd libsystem_pthread.dylib`thread_start + 13

  thread #4: tid = 0x9df047, 0x00007fffe5a4b892 libsystem_kernel.dylib`writev + 10
    frame #0: 0x00007fffe5a4b892 libsystem_kernel.dylib`writev + 10
    frame #1: 0x000000010002498d pony0.11.0`pony_os_writev + 12
    frame #2: 0x0000000100012ea3 pony0.11.0`___lldb_unnamed_symbol406$$pony0.11.0 + 275
    frame #3: 0x0000000100012d77 pony0.11.0`___lldb_unnamed_symbol405$$pony0.11.0 + 583
    frame #4: 0x000000010000318e pony0.11.0`net_TCPConnection_Dispatch + 334
    frame #5: 0x000000010001f4b3 pony0.11.0`handle_message + 235
    frame #6: 0x0000000100025dd2 pony0.11.0`run_thread + 262
    frame #7: 0x00007fffe5b32aab libsystem_pthread.dylib`_pthread_body + 180
    frame #8: 0x00007fffe5b329f7 libsystem_pthread.dylib`_pthread_start + 286
    frame #9: 0x00007fffe5b321fd libsystem_pthread.dylib`thread_start + 13

  thread #5: tid = 0x9df049, 0x0000000100022712 pony0.11.0`detect + 677
    frame #0: 0x0000000100022712 pony0.11.0`detect + 677
    frame #1: 0x0000000100021b5a pony0.11.0`cycle_dispatch + 420
    frame #2: 0x000000010001f4b3 pony0.11.0`handle_message + 235
    frame #3: 0x0000000100025dd2 pony0.11.0`run_thread + 262
    frame #4: 0x00007fffe5b32aab libsystem_pthread.dylib`_pthread_body + 180
    frame #5: 0x00007fffe5b329f7 libsystem_pthread.dylib`_pthread_start + 286
    frame #6: 0x00007fffe5b321fd libsystem_pthread.dylib`thread_start + 13

  thread #6: tid = 0x9df04a, 0x000000010000e740 pony0.11.0`___lldb_unnamed_symbol262$$pony0.11.0 + 416
    frame #0: 0x000000010000e740 pony0.11.0`___lldb_unnamed_symbol262$$pony0.11.0 + 416
    frame #1: 0x000000010000ddd8 pony0.11.0`___lldb_unnamed_symbol255$$pony0.11.0 + 72
    frame #2: 0x000000010000d6a8 pony0.11.0`___lldb_unnamed_symbol251$$pony0.11.0 + 56
    frame #3: 0x0000000100011df4 pony0.11.0`___lldb_unnamed_symbol382$$pony0.11.0 + 212
    frame #4: 0x00000001000123bd pony0.11.0`___lldb_unnamed_symbol391$$pony0.11.0 + 285
    frame #5: 0x00000001000033c0 pony0.11.0`net_TCPConnection_Dispatch + 896
    frame #6: 0x000000010001f4b3 pony0.11.0`handle_message + 235
    frame #7: 0x0000000100025dd2 pony0.11.0`run_thread + 262
    frame #8: 0x00007fffe5b32aab libsystem_pthread.dylib`_pthread_body + 180
    frame #9: 0x00007fffe5b329f7 libsystem_pthread.dylib`_pthread_start + 286
    frame #10: 0x00007fffe5b321fd libsystem_pthread.dylib`thread_start + 13
(lldb)

agarman · 2017-03-17T14:40:23Z

@SeanTAllen yeah, I closed it before increasing load. There were changes by @dipinhora and @pdtwonotes that may have made the issue harder to track down. But it eventually occurs whenever I run the stress test now (sometimes in a few seconds, sometimes after a few minutes).

SeanTAllen · 2017-03-29T20:26:32Z

@agarman can you test this with the latest master?

agarman · 2017-03-30T13:32:11Z

With large number of connections (4000+) the test program hit the EXC_BAD_ACCESS almost immediately. This is regression from previous tests.

Previously, I thought the problem was resolved because I couldn't reproduce with low connection count (<1000). There's a regression here as well. -c400 triggered EXC_BAD_ACCESS in seconds.

Same behavior with --debug compile.

$ ./build/release/ponyc --version
0.12.1-71148ff [release]

$ lldb -f web-test 
(lldb) target create "web-test"
Current executable set to 'web-test' (x86_64).
(lldb) r
Process 18517 launched: '/Users/agarman/src/ponyc/web-test' (x86_64)
Process 18517 stopped
* thread #2: tid = 0x111558d, 0x00000001000257cd web-test`ponyint_gc_release + 157, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x00000001000257cd web-test`ponyint_gc_release + 157
web-test`ponyint_gc_release:
->  0x1000257cd <+157>: subq   %rcx, 0x8(%rax)
    0x1000257d1 <+161>: cmpq   $0x0, 0x18(%r14)
    0x1000257d6 <+166>: je     0x10002584b               ; <+283>
    0x1000257d8 <+168>: movq   0x20(%r14), %rcx
(lldb) bt
* thread #2: tid = 0x111558d, 0x00000001000257cd web-test`ponyint_gc_release + 157, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x00000001000257cd web-test`ponyint_gc_release + 157
    frame #1: 0x000000010001e267 web-test`handle_message + 263
    frame #2: 0x0000000100029b90 web-test`run_thread + 720
    frame #3: 0x00007fffe5b32aab libsystem_pthread.dylib`_pthread_body + 180
    frame #4: 0x00007fffe5b329f7 libsystem_pthread.dylib`_pthread_start + 286
    frame #5: 0x00007fffe5b321fd libsystem_pthread.dylib`thread_start + 13

Theodus · 2017-04-07T15:00:44Z

I can confirm on MacOS 16.4.0

$ ponyc --version
0.12.3-0040b85a8 [debug]
compiled with: llvm 3.9.1 -- Apple LLVM version 8.1.0 (clang-802.0.38)

Process 3628 stopped
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x0000000100028749 httpserver`ponyint_gc_release(gc=0x00000001197f02a8, aref=0x0000000118d69840) at gc.c:671
   668 	    void* p = obj->address;
   669 	    object_t* obj_local = ponyint_objectmap_getobject(&gc->local, p, &index);
   670 	
-> 671 	    pony_assert(obj_local->rc >= obj->rc);
   672 	    obj_local->rc -= obj->rc;
   673 	  }
   674

SeanTAllen · 2017-04-07T15:02:00Z

Thanks. There's another fix coming down the line that I hope might address this.

agarman · 2017-04-22T18:17:42Z

$ ponyc -version
0.13.1 [release]
compiled with: llvm 3.9.1 -- Apple LLVM version 8.1.0 (clang-802.0.42)

@SeanTAllen - wasn't sure if the fix you mentioned was in 0.13.1 or not, so I tested again...

* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x0000000100022dcf web-test`ponyint_gc_release + 197
web-test`ponyint_gc_release:
->  0x100022dcf <+197>: subq   %rcx, 0x8(%rax)
    0x100022dd3 <+201>: movq   0x28(%rbx), %rdx
    0x100022dd7 <+205>: movq   0x18(%rbx), %rsi
    0x100022ddb <+209>: movq   0x20(%rbx), %rcx
(lldb)

agarman · 2017-04-28T18:15:07Z

#1872

@dipinhora

I'm going to update this commit if it holds true. I'd like some additional testing. @dipinhora @sylvanc can you test #1872 for this. @Cloven can you see if this fixes the issue for you. @agarman @Theodus I don't think this will fix #1483 but can you check and see as there is a chance it might.

agarman · 2017-04-30T18:42:14Z

Fixed by #1876

If you were being facetious, you could describe the Pony runtime as a series of hashmaps that are held together by some code. Hash performance and correctness can have a great impact on everything else in the runtime because they are at the basis of most everything else in the runtime. This change fixes a number of issues that appears to be garbage collection bugs but were in fact, problems with invariant violation in the underlying hash implementation. It should be noted that while the rest of this comment discuss invariant violations that exist in our Robin Hood hash implementation, some of the bugs that this closes predate the Robin Hood implementation. This leads me to believe that the previous implementation had some subtle problem that could occur under some rare interleaving of operations. How this occurred is unknown at this time and probably always will be unless someone wants to go back to the previous version and use what we learned here to diagnose the state of the code at that time. This patch closes issues #1781, #1872, and #1483. It's the result of teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should show that we were all involved in this resolution. The skinny: When garbage collecting items from our hash, that is, removing deleted items to free up space, we can end up violating hash invariants. Previously, one of these invariants was correctly fixed, however, it incorrectly assumed another invariant held but that is not the case. Post garbage collection, if any items have been deleted from our hash, we do an "optimize" operation on each hash item. We check to see if the location the item would hash to is now an empty bucket. If it is, we move the item to that location thereby restoring the expected chaining. There is, however, a problem with doing this. It's possible over time to violate another invariant when fixing the first violation. For a given item at a given location in the hash, each item has a probe value. An invariant of our data structure is that items at earlier locations in the hash will always have an equal or lower probe value for that location than items that come later. For example, items: "foo" and "bar". Given a hashmap whose size is 8, where "foo" would made to index 1 and "bar" would map to index "2". When looking at the probe values for "foo" and "bar" at index 1, "foo" would have a probe value of "0" as it is at the location it hashes to whereas "bar" would have a probe value of "7". The value is the number of indexes away from our "natural" hash index that the item is. When search the hash, we can use this probe value to not do a linear search of all indexes for the a given key. Once we find an item whose probe value for a given index is higher than ours, we know that the key can't be in the map past that index. Except our course for when we are restoring invariants after a delete. It's possible, due to the sequential nature of our "optimize" repair step, to violate this "always lower probe value" invariant. The previous implementation of "optimize_item" assumed that in invariant held true. By not detecting the invariant violation and fixing it, we could end up with maps where a key existed in it but it wouldn't be found. When the map in question was an object map used to hold gc'able items, this would result in an error that appears to be a gc error. See #1781, #1872, and #1483. Closes #1781 Closes #1872 Closes #1483

If you were being facetious, you could describe the Pony runtime as a series of hashmaps that are held together by some code. Hash performance and correctness can have a great impact on everything else in the runtime because they are at the basis of most everything else in the runtime. This change fixes a number of issues that appears to be garbage collection bugs but were in fact, problems with invariant violation in the underlying hash implementation. It should be noted that while the rest of this comment discuss invariant violations that exist in our Robin Hood hash implementation, some of the bugs that this closes predate the Robin Hood implementation. This leads me to believe that the previous implementation had some subtle problem that could occur under some rare interleaving of operations. How this occurred is unknown at this time and probably always will be unless someone wants to go back to the previous version and use what we learned here to diagnose the state of the code at that time. This patch closes issues #1781, #1872, and #1483. It's the result of teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should show that we were all involved in this resolution. The skinny: When garbage collecting items from our hash, that is, removing deleted items to free up space, we can end up violating hash invariants. Previously, one of these invariants was correctly fixed, however, it incorrectly assumed another invariant held but that is not the case. Post garbage collection, if any items have been deleted from our hash, we do an "optimize" operation on each hash item. We check to see if the location the item would hash to is now an empty bucket. If it is, we move the item to that location thereby restoring the expected chaining. There is, however, a problem with doing this. It's possible over time to violate another invariant when fixing the first violation. For a given item at a given location in the hash, each item has a probe value. An invariant of our data structure is that items at earlier locations in the hash will always have an equal or lower probe value for that location than items that come later. For example, items: "foo" and "bar". Given a hashmap whose size is 8, where "foo" would made to index 1 and "bar" would map to index "2". When looking at the probe values for "foo" and "bar" at index 1, "foo" would have a probe value of "0" as it is at the location it hashes to whereas "bar" would have a probe value of "7". The value is the number of indexes away from our "natural" hash index that the item is. When search the hash, we can use this probe value to not do a linear search of all indexes for the a given key. Once we find an item whose probe value for a given index is higher than ours, we know that the key can't be in the map past that index. Except our course for when we are restoring invariants after a delete. It's possible, due to the sequential nature of our "optimize" repair step, to violate this "always lower probe value" invariant. The previous implementation of "optimize_item" assumed that in invariant held true. By not detecting the invariant violation and fixing it, we could end up with maps where a key existed in it but it wouldn't be found. When the map in question was an object map used to hold gc'able items, this would result in an error that appears to be a gc error. See #1781, #1872, and #1483. It should be noted, that because of the complex chain of events that needs to occur to trigger this problem that we were unable to devise a unit test to catch this problem. If we had property based testing for the Pony runtime, this most likely would have been caught. Hopefully, PR #1840 to add rapidcheck into Pony happens soon. Closes #1781 Closes #1872 Closes #1483

SeanTAllen · 2017-05-06T12:26:17Z

Fix for this has been released as part of 0.14.0

Praetonus added bug: 1 - needs investigation triggers release Major issue that when fixed, results in an "emergency" release labels Dec 24, 2016

agarman mentioned this issue Jan 11, 2017

Recurse when tracing an iso that has become a val #1507

Merged

agarman closed this as completed Mar 11, 2017

agarman reopened this Mar 15, 2017

ilovezfs mentioned this issue Mar 15, 2017

ponyc: depend on llvm@3.7 Homebrew/homebrew-core#11112

Closed

4 tasks

SeanTAllen added the needs discussion during sync label Mar 17, 2017

SeanTAllen removed the needs discussion during sync label Mar 29, 2017

SeanTAllen mentioned this issue Apr 29, 2017

Possible #1872 issue fix #1876

Closed

SeanTAllen mentioned this issue May 4, 2017

Fix invariant violation in Pony runtime hashes #1886

Merged

SeanTAllen closed this as completed in #1886 May 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EXC_BAD_ACCESS crash in try_gc when under heavy load #1483

EXC_BAD_ACCESS crash in try_gc when under heavy load #1483

agarman commented Dec 23, 2016

Praetonus commented Dec 24, 2016

SeanTAllen commented Dec 25, 2016

agarman commented Jan 11, 2017 •

edited

Loading

agarman commented Mar 11, 2017

agarman commented Mar 15, 2017

SeanTAllen commented Mar 15, 2017

agarman commented Mar 15, 2017 •

edited by jemc

Loading

agarman commented Mar 15, 2017

agarman commented Mar 15, 2017

agarman commented Mar 15, 2017

SeanTAllen commented Mar 15, 2017

agarman commented Mar 15, 2017 •

edited

Loading

ilovezfs commented Mar 15, 2017

SeanTAllen commented Mar 15, 2017

agarman commented Mar 15, 2017

SeanTAllen commented Mar 15, 2017

SeanTAllen commented Mar 15, 2017

ilovezfs commented Mar 15, 2017

agarman commented Mar 15, 2017

ilovezfs commented Mar 17, 2017

SeanTAllen commented Mar 17, 2017

agarman commented Mar 17, 2017

agarman commented Mar 17, 2017

SeanTAllen commented Mar 17, 2017

agarman commented Mar 17, 2017

agarman commented Mar 17, 2017

SeanTAllen commented Mar 29, 2017

agarman commented Mar 30, 2017

Theodus commented Apr 7, 2017

SeanTAllen commented Apr 7, 2017

agarman commented Apr 22, 2017

agarman commented Apr 28, 2017

agarman commented Apr 30, 2017

SeanTAllen commented May 6, 2017

EXC_BAD_ACCESS crash in try_gc when under heavy load #1483

EXC_BAD_ACCESS crash in try_gc when under heavy load #1483

Comments

agarman commented Dec 23, 2016

Praetonus commented Dec 24, 2016

SeanTAllen commented Dec 25, 2016

agarman commented Jan 11, 2017 • edited Loading

agarman commented Mar 11, 2017

agarman commented Mar 15, 2017

SeanTAllen commented Mar 15, 2017

agarman commented Mar 15, 2017 • edited by jemc Loading

agarman commented Mar 15, 2017

agarman commented Mar 15, 2017

agarman commented Mar 15, 2017

SeanTAllen commented Mar 15, 2017

agarman commented Mar 15, 2017 • edited Loading

ilovezfs commented Mar 15, 2017

SeanTAllen commented Mar 15, 2017

agarman commented Mar 15, 2017

SeanTAllen commented Mar 15, 2017

SeanTAllen commented Mar 15, 2017

ilovezfs commented Mar 15, 2017

agarman commented Mar 15, 2017

ilovezfs commented Mar 17, 2017

SeanTAllen commented Mar 17, 2017

agarman commented Mar 17, 2017

agarman commented Mar 17, 2017

SeanTAllen commented Mar 17, 2017

agarman commented Mar 17, 2017

agarman commented Mar 17, 2017

SeanTAllen commented Mar 29, 2017

agarman commented Mar 30, 2017

Theodus commented Apr 7, 2017

SeanTAllen commented Apr 7, 2017

agarman commented Apr 22, 2017

agarman commented Apr 28, 2017

agarman commented Apr 30, 2017

SeanTAllen commented May 6, 2017

agarman commented Jan 11, 2017 •

edited

Loading

agarman commented Mar 15, 2017 •

edited by jemc

Loading

agarman commented Mar 15, 2017 •

edited

Loading