Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI: setattr races with write #317

Open
slovichon opened this issue Jun 21, 2015 · 0 comments
Open

CLI: setattr races with write #317

slovichon opened this issue Jun 21, 2015 · 0 comments
Labels

Comments

@slovichon
Copy link
Contributor

On 2012-12-20 12:35:00 -0500, Zhihui Zhang wrote:

I hit this while debugging my biorq patch,  I think it may be a bug in the code already:

[1355953451:002213 msfsthr20:7fff13fff700:def main.c mslfsop_write 2316] fid: 0x048c00000005969e, size from 2560 to 6062
[1355953451:002226 msfsthr20:7fff13fff700:fcmh main.c mslfsop_write 2338] fcmh@0x8b2820 f+g=0x048c00000005969e:0 flg=0x63104:BAS+ ref=4 sz=6062 bsz=262144 mode=0100664 : write: buf=0x7fffec6e9060 rc=0 sz=3502 off=2560
[1355953451:004472 mseqpollthr:7fff8effd700:rpc events.c pscrpc_reply_in_callback 281] req@0x7ffed8001570 x196539/t0 cb=(nil) c0 o38->@54321-128.182.99.29@tcp10:10 lens 304/304 ref 1 res 0 ret 0 fl Rpc:/0/0 replyc 0 rc 0/0 to=60 sent=1355953450 :: type 1, status 0 initiator ;54321-128.182.99.29@tcp10;
[1355953451:004513 msfsthr13:7ffef17fa700:rpc rpcclient.c pscrpc_check_reply 595] req@0x7ffed8001570 x196539/t0 cb=(nil) c0 o38->@54321-128.182.99.29@tcp10:10 lens 304/304 ref 1 res 0 ret 0 fl Rpc:R/0/0 replyc 0 rc 0/0 to=60 sent=1355953450 :: REPLIED:
[1355953451:004599 msfsthr13:7ffef17fa700:rpc rpcclient.c pscrpc_check_reply 623] req@0x7ffed8001570 x196539/t0 cb=(nil) c0 o38->@54321-128.182.99.29@tcp10:10 lens 304/304 ref 1 res 0 ret 0 fl Rpc:R/0/0 replyc 0 rc 0/0 to=60 sent=1355953450 :: rc=1
[1355953451:004620 msfsthr13:7ffef17fa700:rpc rpcclient.c pscrpc_queue_wait 805] req@0x7ffed8001570 x196539/t0 cb=(nil) c0 o38->@54321-128.182.99.29@tcp10:10 lens 304/304 ref 1 res 0 ret 0 fl Rpc:R/0/0 replyc 0 rc 0/0 to=60 sent=1355953450 :: completed (rc=0), replied=1
[1355953451:004641 msfsthr13:7ffef17fa700:rpc rpcclient.c pscrpc_unregister_reply 637] req@0x7ffed8001570 x196539/t0 cb=(nil) c0 o38->@54321-128.182.99.29@tcp10:10 lens 304/304 ref 1 res 0 ret 0 fl Rpc:R/0/0 replyc 0 rc 0/0 to=60 sent=1355953450 :: receving_reply=0
[1355953451:004674 msfsthr13:7ffef17fa700:def main.c mslfsop_setattr 2201] sstb@0x8b2820 fg:0x048c00000005969e:0 dev:0 mode:0100664 nlink:1 uid:1000 gid:1000 rdev:0 sz:2560 blksz:2728 blkcnt:0 atime:1355953452:689921000 mtime:1355771694:0 ctime:1355953452:702393000 fcmh 0x8b2820 post setattr
[1355953451:004703 msfsthr13:7ffef17fa700:def fuse.c pscfs_reply_setattr 1314] returned item 0x11e1a90 to pool fsrq
[1355953451:004719 msfsthr13:7ffef17fa700:rpc niobuf.c _pscrpc_free_req 698] req@0x7ffed8001570 x196539/t0 cb=(nil) c0 o38->@54321-128.182.99.29@tcp10:10 lens 304/304 ref 0 res 0 ret 0 fl Interpret:R/0/0 replyc 0 rc 0/0 to=60 sent=1355953450 :: freeing (rqcomp_compcnt=-999)
[1355953451:004738 msfsthr13:7ffef17fa700:def main.c mslfsop_setattr 2245] csvc@0x7fff30000d10 fl=0x2:O ref:0 decref
[1355953451:004888 msfsthr20:7fff13fff700:def fuse.c pscfs_fuse_handle_flush 772] got item 0x11e1a40 from pool fsrq
[1355953451:004916 msfsthr20:7fff13fff700:fcmh main.c mslfsop_flush 1497] fcmh@0x8b2820 f+g=0x048c00000005969e:0 flg=0x62104:BA+ ref=3 sz=2560 bsz=2728 mode=0100664 : flushing (mfh=0x7ffee4001e90)
[1355953451:004938 msfsthr20:7fff13fff700:fcmh main.c mslfsop_flush 1503] fcmh@0x8b2820 f+g=0x048c00000005969e:0 flg=0x62104:BA+ ref=3 sz=2560 bsz=2728 mode=0100664 : done flushing (mfh=0x7ffee4001e90)

On 2012-12-20 12:37:29 -0500, Zhihui Zhang wrote:

post setattr  -----> sz:2560

Before that: size from 2560 to 6062

Maybe there are some callback intween not captured by the trace

Anyway, I beef up the trace so we can find more evidence next time.

Right now I suspect MDS.

On 2012-12-20 14:24:50 -0500, Zhihui Zhang wrote:

[1355953451:000903 msfsthr20:7fff13fff700:bmap mount_slash_int.c msl_dio_cb 1147] biorq@0xb96610 fl=0x4400042:wdp ref=1 o=2560 l=3502 r=0 buf=0x7fffec6e9060 q=0x7fff04000cd0 sliod=ffffffff np=0 b=0x7ffee8001ae0 ex=1355953451:007160032 : destroying (ref=1)
[1355953451:000925 msfsthr20:7fff13fff700:rpc niobuf.c _pscrpc_free_req 698] req@0x7fff04001eb0 x196538/t0 cb=(nil) c0 o42->@54321-128.182.99.27@tcp10:30 lens 248/176 ref 0 res 0 ret 0 fl Complete:R/0/0 replyc 0 rc 0/0 to=60 sent=1355953450 :: freeing (rqcomp_compcnt=-999)
[1355953451:000954 msfsthr20:7fff13fff700:def mount_slash_int.c msl_pages_dio_getput 1274] csvc@0x7fff20000d80 fl=0x2:O ref:11 decref
[1355953451:000969 msfsthr20:7fff13fff700:bmap mount_slash_int.c msl_io 2300] biorq@0xb96610 fl=0x4400042:wdp ref=0 o=2560 l=3502 r=0 buf=0x7fffec6e9060 q=0x7fff04000cd0 sliod=ffffffff np=0 b=0x7ffee8001ae0 ex=1355953451:007160032 : destroying (ref=0)
[1355953451:000986 msfsthr20:7fff13fff700:bmap mount_slash_int.c msl_io 2300] biorq = 0xb96610, fsrq = 0x7fff04000cd0, pfr = 0x11e19a0
[1355953451:000999 msfsthr20:7fff13fff700:bmap mount_slash_int.c msl_io 2300] biorq@0xb96610 fl=0x4400142:wdDp ref=0 o=2560 l=3502 r=0 buf=0x7fffec6e9060 q=0x7fff04000cd0 sliod=ffffffff np=0 b=0x7ffee8001ae0 ex=1355953451:007160032 : destroying (nwaiters=0)
[1355953451:001017 msfsthr20:7fff13fff700:bmap mount_slash_int.c msl_io 2300] bmap@0x7ffee8001ae0 bno:0 flg:0x20a:WDT fid:0x048c00000005969e opcnt=2 remove biorq=0xb96610 nitems_pndg(0)
[1355953451:001039 msfsthr20:7fff13fff700:bmap mount_slash_int.c msl_io 2300] returned item 0xb96610 to pool biorq
[1355953451:001101 msfsthr13:7ffef17fa700:def fuse.c pscfs_fuse_handle_getattr 808] got item 0x11e19f0 from pool fsrq
[1355953451:001123 msfsthr13:7ffef17fa700:fcmh main.c msl_stat 589] fcmh@0x8b2820 f+g=0x048c00000005969e:0 flg=0x62104:BA+ ref=4 sz=2560 bsz=0 mode=0100664 : attrs retrieved from local cache
[1355953451:001164 msfsthr13:7ffef17fa700:def fuse.c pscfs_reply_getattr 1191] returned item 0x11e19f0 to pool fsrq
[1355953451:001192 msfsthr13:7ffef17fa700:def fuse.c pscfs_fuse_handle_setattr 979] got item 0x11e1a90 from pool fsrq
[1355953451:001211 msfsthr13:7ffef17fa700:def rpc_cli.c slc_rmc_getcsvc 157] csvc@0x7fff30000d10 fl=0x2:O ref:1 incref
[1355953451:001227 msfsthr13:7ffef17fa700:rpc rpcclient.c pscrpc_prep_req_pool 228] request 0x7ffed8001570 request->rq_reqmsg 0x7ffed8001050
[1355953451:001240 msfsthr13:7ffef17fa700:rpc rpcclient.c pscrpc_import_get 112] import get 0x7fff30000e90 refcount=3
[1355953451:001271 msfsthr13:7ffef17fa700:rpc rpcclient.c pscrpc_queue_wait 748] req@0x7ffed8001570 x196539/t0 cb=(nil) c0 o38->@54321-128.182.99.29@tcp10:10 lens 304/304 ref 1 res 0 ret 0 fl New:/0/0 replyc deadbeef rc 0/0 to=60 sent=0 :: sending..
[1355953451:001290 msfsthr13:7ffef17fa700:rpc rpcclient.c pscrpc_queue_wait 787] request 0x7ffed8001570 request->rq_reqmsg 0x7ffed8001050
[1355953451:001304 msfsthr13:7ffef17fa700:rpc niobuf.c pscrpc_send_rpc 524] LNetMEAttach() gave handle 2e079a
[1355953451:001349 msfsthr13:7ffef17fa700:rpc rpcclient.c pscrpc_queue_wait 796] req@0x7ffed8001570 x196539/t0 cb=(nil) c0 o38->@54321-128.182.99.29@tcp10:10 lens 304/304 ref 2 res 0 ret 0 fl Rpc:/0/0 replyc 0 rc 0/0 to=60 sent=1355953450 :: sleeping for 60 sec
[1355953451:001353 mseqpollthr:7fff8effd700:rpc events.c pscrpc_request_out_callback 58] req@0x7ffed8001570 x196539/t0 cb=(nil) c0 o38->@54321-128.182.99.29@tcp10:10 lens 304/304 ref 2 res 0 ret 0 fl Rpc:/0/0 replyc 0 rc 0/0 to=60 sent=1355953450 :: type 4, status 0
[1355953451:002195 msfsthr20:7fff13fff700:def fuse.c pscfs_reply_write 1351] returned item 0x11e19a0 to pool fsrq
[1355953451:002213 msfsthr20:7fff13fff700:def main.c mslfsop_write 2316] fid: 0x048c00000005969e, size from 2560 to 6062

On 2012-12-20 14:28:09 -0500, Zhihui Zhang wrote:

msfsthr13: a setattr gets in while the write is still in progress.  I suspect it uses the premature size value to set it.

On 2013-08-05 12:57:25 -0400, Jared Yanovich wrote:

If an application is by nature racing a nonzero truncate(2) with write(2) activity, this may be possible.  I think such behavior would be undefined according to POSIX.

Since even mslfsop_setattr() does this:

         fcmh_setattrf(c, &mp->attr, FCMH_SETATTRF_SAVELOCAL |
            FCMH_SETATTRF_HAVELOCK);

Meaning local values will be preserved.

Further, we probably need "pre setattr" and "post setattr" psclogs to figure out when the SETATTR gets issued.

My suspicion is that something is doing SETSIZE unbeknownst to us that is wiping out the SAVELOCAL.  We specifically prohibit nonzero truncate(2) right now so it cannot be an application truncate(2) call.  This should be easy to trace if I have the steps to reproduce it.
@slovichon slovichon added the bug label Jun 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant