Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executeBatch can cause a dead lock to a nil deref causing c.prepMu not to be unlocked. #179

Merged

Conversation

nemosupremo
Copy link
Contributor

Currently in a high load situation flight or flight.info can be nil when execute batch is called (the prepareStatement call can take a long time or even timeout, during this time flight.info can be nil).

As executebatch searches for the prepared statement to reset, it can come across such a statement and panic. This panic causes c.prepMu not to be unlocked causing a dead lock.

@0x6e6562
Copy link
Contributor

This is a an interesting corner case - do you know if there is a tractable way of reproducing the issue?

@nemosupremo
Copy link
Contributor Author

Because golang map iterator is randomized the code path becomes pretty nondeterministic.

To reproduce this you can (while being faithful to production conditions)
1.) Make sure you are only using 1 connection
2.) Add some garbage prepared statements cassandra_test.go's injectInvalidPreparedStatement comes to mind
3.) Place a sleep in conn.go prepareStatement after the exec for 5-10 seconds.
4.) In one goroutine run a query (to create the nil flight)
5.) In the other run an executeBatch the the garbage prepared statements.

Due to the random iteration it may not always happen but essentially what you want is while one routine is in the middle of executing a prepare request for executeBatch to iterate over that prepare request.

@0x6e6562
Copy link
Contributor

The change looks good to me - I might have a look to see if your instructions can reproduce the symptom.

@0x6e6562
Copy link
Contributor

So I had a go at trying to replicate your steps.

Here is the injected sleep:

diff --git a/conn.go b/conn.go
index be78566..3e603fa 100644
--- a/conn.go
+++ b/conn.go
@@ -327,6 +327,9 @@ func (c *Conn) prepareStatement(stmt string, trace Tracer) (*queryInfo, error) {
        c.prepMu.Unlock()

        resp, err := c.exec(&prepareFrame{Stmt: stmt}, trace)
+
+       time.Sleep(5 * time.Second)
+
        if err != nil {
                flight.err = err
        } else {

And here is the test routine:

func TestNilFlight(t *testing.T) {
    session := createSession(t)
    defer session.Close()
    stmt, conn := injectInvalidPreparedStatement(t, session, "test_reprepare_statement")

    for i := 0; i < 10000; i++ {

        var w sync.WaitGroup
        w.Add(2)

        go func() {
            batch := session.NewBatch(UnloggedBatch)
            batch.Query(stmt, "bar")
            if err := conn.executeBatch(batch); err != nil {
                w.Done()
                t.Fatalf("Failed to execute query for reprepare statement: %v", err)
            }
            w.Done()
        }()

        go func() {
            query := session.Query("select host_id from system.local")
            if err := conn.executeQuery(query).Close(); err != nil {
                w.Done()
                t.Fatalf("Failed to execute query: %v", err)
            }
            w.Done()
        }()

        w.Wait()
    }
}

I've run this against the current master (i.e. still containing the bug) but unfortunately I haven't been able to reproduce the error - any idea why?

@nemosupremo
Copy link
Contributor Author

You actually have to muck around a bit more with the internals. For example you need to clear c.prep and call injectInvalidPreparedStatement within the loop (meaning you also have to drop table test_reprepare_statement within the loop as well. But the way I will describe doesn't require repeating the test multiple times (at the expense of real world performance).

  1. Add a time.Sleep(1 * time.Second) before batch := session.NewBatch(UnloggedBatch) in your test function. What we essentially want is -
    1. We issue a fresh query that isn't prepared (and for whatever reason it takes a long time to prepare).
    2. We issue a batch query where our query needs to be reprepared

Now even then using your function the panic may still not occur because the map iterator may always select the right prepared statement sidestepping the error. Still we can cause the error by removing the break like so:

func (c *Conn) executeBatch(batch *Batch) error {
...
    for stmt, flight := range c.prep {
            if bytes.Equal(flight.info.id, x.StatementId) {
                found = true
                delete(c.prep, stmt)
                //break
            }
        }
...

Now since every flight has a different id, and there is only one flight that satisfies the conditional, we can agree that the break is optional and is just a performance benefit (if you have found the right flight you don't have to search the others). If we remove the break we don't have to worry about somehow forcing the invalid flight to come before the one we are looking for. With that change made, using this function:

func TestNilFlight(t *testing.T) {
    session := createSession(t)
    defer session.Close()

    for i := 0; i < 10000; i++ {
        stmt, conn := injectInvalidPreparedStatement(t, session, "test_reprepare_statement")
        var w sync.WaitGroup
        w.Add(2)

        go func() {
            time.Sleep(2 * time.Second)
            batch := session.NewBatch(UnloggedBatch)
            batch.Query(stmt, "bar")
            if err := conn.executeBatch(batch); err != nil {
                w.Done()
                t.Fatalf("Failed to execute query for reprepare statement: %v", err)
            }
            w.Done()
        }()

        go func() {
            query := session.Query("select host_id from system.local")
            if err := conn.executeQuery(query).Close(); err != nil {
                w.Done()
                t.Fatalf("Failed to execute query: %v", err)
            }
            w.Done()
        }()

        w.Wait()
        conn.executeQuery(session.Query("DROP TABLE test_reprepare_statement"))
    }
}

Should panic with

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x43cddb]

goroutine 21 [running]:
runtime.panic(0x5bd1e0, 0x7fd428)
    /usr/local/go/src/pkg/runtime/panic.c:266 +0xb6
github.com/channelmeter/gocql.(*Conn).executeBatch(0xc21006f300, 0xc210055870, 0x3d, 0xc210037a40)
    /home/nimi/go/src/github.com/channelmeter/gocql/conn.go:513 +0x85b

on the first try.

@0x6e6562
Copy link
Contributor

If I comment out the break statement on line 514, I do indeed get the same panic on line 511.

@0x6e6562
Copy link
Contributor

That said, I've been running the test with break commented back in, and it exceeds the maximum timeout of the testing framework (600 secs).

@0x6e6562
Copy link
Contributor

I'm not 100% sure that changing the program flow (by commenting out the break statement) is a good way to verify this bug. I'm happier to accept the fact that we need to inject a pause into the actual production code to reproduce the symptom, since adverse scheduling is an issue that you're much more likely to see under high load.

So whilst the change you are suggesting looks pretty good, it would make me hesitate less to merge the patch if either:

  • We could reproduce the issue by only introducing pauses into the program flow, or
  • One of the other maintainers considers the knock on effects of the patch and they are happy with it, or
  • I analyze the patch more myself and end up convincing myself it won't result in a different bug

Sorry that I appear to be blocking this change, but my primary focus is maintaining the stability of the library (even though it could well be more stable with this change, as it seems to be helping you in a real world deployment).

@nemosupremo
Copy link
Contributor Author

The problem isn't the presence of the break, the problem is the iteration order. I'm having a hard time reproducing the error because I can't influence the iteration order of a map.

Consider we have this map (iterated in this order):
{"select * from b":SomeData{id=4}, "update b set a=b":nil},

and lets say we are looking for id=4. We iterate the map, we find SomeData{id=4}, we break and go on as planned.

Now if we were to iterate in this order (maps aren't guaranteed to be any order):
{"update b set a=b":nil, "select * from b":SomeData{id=4}},

and we are still looking for id=4, without the nil check, we try and dereference nil, crash, never unlock the mutex and dead lock.

This function will also crash (doesn't require the removal of the break) but depends on session.Pool.Pick(nil) returning the same connection twice in a row.

func TestNilFlight(t *testing.T) {
    session := createSession(t)
    defer session.Close()

    for i := 0; i < 10000; i++ {
        var w sync.WaitGroup
        w.Add(2)

        go func() {
            query := session.Query("SELECT host_id from system.local")
            if err := session.Pool.Pick(nil).executeQuery(query).Close(); err != nil {
                w.Done()
                t.Fatalf("Failed to execute query: %v", err)
            }
            w.Done()
        }()

        stmt, conn := injectInvalidPreparedStatement(t, session, "test_reprepare_statement")

        go func() {
            time.Sleep(1 * time.Second)
            batch := session.NewBatch(UnloggedBatch)
            batch.Query(stmt, "bar")
            if err := conn.executeBatch(batch); err != nil {
                w.Done()
                t.Fatalf("Failed to execute query for reprepare statement: %v", err)
            }
            w.Done()
        }()

        w.Wait()
        conn.executeQuery(session.Query("DROP TABLE test_reprepare_statement"))
    }
}

@nemosupremo
Copy link
Contributor Author

Secondly, in the batch goroutine, if you recover and the panic does not crash the app, you will no longer be able to execute any more queries on conn (it will hang forever).

@0x6e6562
Copy link
Contributor

I think gocql is probably better off with this change, so what I propose is that we merge this if nobody is against the idea. So if either @phillipCouto, @tux21b or @Zariel don't raise any objections within a reasonable timeframe, I'll merge this.

@phillipCouto
Copy link
Contributor

Ok so I just reviewed it and I think this can fly without a test case. Reason is flight being nil or flight.info being nil shouldn't trigger a panic and gocql should continue to operate and treat it as the flight information as missing and worst case reprepare the statement.

This LGTM.

Really sorry for the long delay on reviewing this.

phillipCouto pushed a commit that referenced this pull request Jun 17, 2014
executeBatch can cause a dead lock to a nil deref causing c.prepMu not to be unlocked.
@phillipCouto phillipCouto merged commit d97fc30 into apache:master Jun 17, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants