-
Notifications
You must be signed in to change notification settings - Fork 1.5k
panic in freelist.go #513
Comments
We are using ee4a088. |
After more investigation, I found the size of the file does not seem to be correct. I am not 100% sure this is boltdb issue. |
@xiang90 There's a |
@benbjohnson It is what we just added. We will report back soon. |
@benbjohnson We reproduced this a few times again. The size we got from the WriteTo is the same as the file we saved. (I plan to get the hash now. But I think probably the hashes will just match) So the problem is that the file written by tx.WriteTo cannot be reopened. Is is possible that we misused boltdb API that can corrupt a tx? Any idea about this issue would be appreciated! |
@benbjohnson After more investigation, I found out the a problem. For all the panic cases, we get a db with highwater mark X in its metadata. Its freelist page is X-1. But the size of db is (X-3)*page_size. So when db tries to access the freelist, it panics. Here is an example
After adding some logs to writeTo, I found that writeTo writes less than it should.
It seems like writeTo write a newer metadata page with an old Tx. I am not very familiar with the codebase and the logic and I might be wrong. Any help would be appreciated! |
@benbjohnson I feel we should write the meta as the tx.meta rather than doing a copy. But I am not sure. |
@xiang90 Are you running |
@benbjohnson A readonly one. We cannot afford the cost to block other readers/writers. |
This commit changes `Tx.WriteTo()` to use the transaction's in-memory meta page instead of copying from the disk. This is needed because the transaction uses the size from its meta page but writes the current meta page on disk which may have allocated additional pages since the transaction started. Fixes boltdb#513
This commit changes `Tx.WriteTo()` to use the transaction's in-memory meta page instead of copying from the disk. This is needed because the transaction uses the size from its meta page but writes the current meta page on disk which may have allocated additional pages since the transaction started. Fixes boltdb#513
In etcd test suite, we keep on hard killing processes.
Also we are keeping on snapshot by using Tx.Copy.
We found boltdb panics when open right after a hard kill around 1 out of 10000
here is the stack trace
Here is the db file: https://storage.googleapis.com/failure-archive/agent2/2016-02-14T18%3A34%3A44Z/agent.etcd/member/snap/db
The text was updated successfully, but these errors were encountered: