panic in freelist.go #513

xiang90 · 2016-02-16T23:13:15Z

In etcd test suite, we keep on hard killing processes.

Also we are keeping on snapshot by using Tx.Copy.

We found boltdb panics when open right after a hard kill around 1 out of 10000

here is the stack trace

goroutine 1 [running]:
runtime.throw(0x153208, 0x5)
    /usr/local/go/src/runtime/panic.go:527 +0x90 fp=0xc82003bb20 sp=0xc82003bb08
runtime.sigpanic()
    /usr/local/go/src/runtime/sigpanic_unix.go:21 +0x1e4 fp=0xc82003bb70 sp=0xc82003bb20
github.com/boltdb/bolt.(*freelist).read(0xc8200661e0, 0x6496000)
    /Users/xiangli/go/src/github.com/boltdb/bolt/freelist.go:162 +0x27 fp=0xc82003bc10 sp=0xc82003bb70
github.com/boltdb/bolt.Open(0xc8200620c0, 0x51, 0x180, 0x2303a0, 0x51, 0x0, 0x0)
    /Users/xiangli/go/src/github.com/boltdb/bolt/db.go:216 +0x5ee fp=0xc82003bd58 sp=0xc82003bc10
main.main()

Here is the db file: https://storage.googleapis.com/failure-archive/agent2/2016-02-14T18%3A34%3A44Z/agent.etcd/member/snap/db

The text was updated successfully, but these errors were encountered:

xiang90 · 2016-02-16T23:15:31Z

We are using ee4a088.

xiang90 · 2016-02-16T23:26:06Z

After more investigation, I found the size of the file does not seem to be correct. I am not 100% sure this is boltdb issue.

benbjohnson · 2016-02-17T00:40:06Z

@xiang90 There's a bolt.Tx.Size() that you can use with the Tx.Copy so you can verify the correct size.

xiang90 · 2016-02-17T00:41:40Z

@benbjohnson
https://github.com/coreos/etcd/pull/4539/files

It is what we just added. We will report back soon.

xiang90 · 2016-02-17T17:27:41Z

@benbjohnson We reproduced this a few times again. The size we got from the WriteTo is the same as the file we saved. (I plan to get the hash now. But I think probably the hashes will just match)

So the problem is that the file written by tx.WriteTo cannot be reopened. Is is possible that we misused boltdb API that can corrupt a tx?

Any idea about this issue would be appreciated!

xiang90 · 2016-02-19T07:13:10Z

@benbjohnson After more investigation, I found out the a problem.

For all the panic cases, we get a db with highwater mark X in its metadata. Its freelist page is X-1. But the size of db is (X-3)*page_size. So when db tries to access the freelist, it panics.

Here is an example

meta &{magic:3977042669 version:2 pageSize:4096 flags:0 root:{root:70665 sequence:0} freelist:70666 pgid:70667 txid:1509 checksum:2153854730997130652}
access position 289447936 -> pgid 70666
db size 289439744 -> pgid 70664

After adding some logs to writeTo, I found that writeTo writes less than it should.

func (tx *Tx) WriteTo(w io.Writer) (n int64, err error) {
    // Attempt to open reader with WriteFlag
    f, err := os.OpenFile(tx.db.path, os.O_RDONLY|tx.WriteFlag, 0)
    if err != nil {
        return 0, err
    }
    defer func() { _ = f.Close() }()

    // Copy the meta pages.
    tx.db.metalock.Lock()
    n, err = io.CopyN(w, f, int64(tx.db.pageSize*2))

        // DEBUGGING 
        // fmt.Println(meta) -> &{magic:3977042669 version:2 pageSize:4096 flags:0 root:{root:70665 sequence:0} freelist:70666 pgid:70667 txid:1509 checksum:2153854730997130652}

    tx.db.metalock.Unlock()
    if err != nil {
        return n, fmt.Errorf("meta copy: %s", err)
    }

    // Copy data pages.
    wn, err := io.CopyN(w, f, tx.Size()-int64(tx.db.pageSize*2))
    n += wn
    if err != nil {
        return n, err
    }

        // DEBUGGING 
        // fmt.Println(n) -> 289439744. This should be 70667 * 4096 = 289452032

    return n, f.Close()
}

It seems like writeTo write a newer metadata page with an old Tx.

I am not very familiar with the codebase and the logic and I might be wrong. Any help would be appreciated!

xiang90 · 2016-02-19T07:22:57Z

@benbjohnson I feel we should write the meta as the tx.meta rather than doing a copy. But I am not sure.

benbjohnson · 2016-02-19T15:22:39Z

@xiang90 Are you running Tx.WriteTo() from a writeable transaction or a read-only transaction?

xiang90 · 2016-02-19T15:28:06Z

@benbjohnson A readonly one. We cannot afford the cost to block other readers/writers.

This commit changes `Tx.WriteTo()` to use the transaction's in-memory meta page instead of copying from the disk. This is needed because the transaction uses the size from its meta page but writes the current meta page on disk which may have allocated additional pages since the transaction started. Fixes boltdb#513

benbjohnson · 2016-02-19T16:23:10Z

@xiang90 I think you're right about the tx.meta. Can you try the fix here: #515

This commit changes `Tx.WriteTo()` to use the transaction's in-memory meta page instead of copying from the disk. This is needed because the transaction uses the size from its meta page but writes the current meta page on disk which may have allocated additional pages since the transaction started. Fixes boltdb#513

gyuho mentioned this issue Feb 16, 2016

etcdserver: panic after raft applied incoming snapshot etcd-io/etcd#4527

Closed

benbjohnson mentioned this issue Feb 19, 2016

Use tx.meta during Tx.WriteTo() #515

Merged

benbjohnson closed this as completed in #515 Mar 10, 2016

daviian mentioned this issue Oct 16, 2017

Gitea 1.1.3 crashes with SIGBUS go-gitea/gitea#2311

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic in freelist.go #513

panic in freelist.go #513

xiang90 commented Feb 16, 2016

xiang90 commented Feb 16, 2016

xiang90 commented Feb 16, 2016

benbjohnson commented Feb 17, 2016

xiang90 commented Feb 17, 2016

xiang90 commented Feb 17, 2016

xiang90 commented Feb 19, 2016

xiang90 commented Feb 19, 2016

benbjohnson commented Feb 19, 2016

xiang90 commented Feb 19, 2016

benbjohnson commented Feb 19, 2016

panic in freelist.go #513

panic in freelist.go #513

Comments

xiang90 commented Feb 16, 2016

xiang90 commented Feb 16, 2016

xiang90 commented Feb 16, 2016

benbjohnson commented Feb 17, 2016

xiang90 commented Feb 17, 2016

xiang90 commented Feb 17, 2016

xiang90 commented Feb 19, 2016

xiang90 commented Feb 19, 2016

benbjohnson commented Feb 19, 2016

xiang90 commented Feb 19, 2016

benbjohnson commented Feb 19, 2016