-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deadlock between lookup thread / VRT_AddDirector() and VDI_Event(..., VCL_EVENT_COLD) #110
Comments
As is, this issue is hard. I could easily add a workaround, but that "workaround" basically also exists in Varnish-Cache, because a shortcut is taken in So for all practical purposes, this issue exists primarily in VTC, but we can not guarantee that it does not happen in production deployments. Holding the vcl_mtx goes back to varnishcache/varnish-cache#3094 / varnishcache/varnish-cache@465f2f8 and I do now wonder if this is actually a good idea for sending |
More pondering: When sending the cold event, we need to ensure that the list of directors operated upon is complete, so holding the edit: the actual temperature change happens outside |
Hi, I agree to add the test case 😄 |
some notes: we can not join the resolver thread after the COLD transition has completed, because by then (during discard) the resolver context may have become invalid (for the same reason we can not detach it). I am pretty much out of ideas even including changes to varnish-cache: for example, turning |
in VRT_AddDirector, we create the new vcldir with an initial reference, which we need to drop if we can not add it. Compare: VRT_AddDirector() ... vdir->refcnt++; vcldir_free() ... AZ(vdir->refcnt); Noticed when testing other experimental changes while working on nigoroll/libvmod-dynamic#110 #5 0x000055820c8cb845 in VAS_Fail (func=0x55820c904559 "vcldir_free", file=0x55820c903a47 "cache/cache_vrt_vcl.c", line=150, cond=0x55820c90459a "(vdir->refcnt) == 0", kind=VAS_ASSERT) at vas.c:67 #6 0x000055820c83a442 in vcldir_free (vdir=0x7f662aa53140) at cache/cache_vrt_vcl.c:150 #7 0x000055820c839fe1 in VRT_AddDirector (ctx=0x7f662befe250, m=0x55820c965260 <vbe_methods_noprobe>, priv=0x7f662aa20780, fmt=0x55820c900f7f "%s") at cache/cache_vrt_vcl.c:219 #8 0x000055820c7c7c4d in VRT_new_backend_clustered (ctx=0x7f662befe250, vc=0x0, vrt=0x7f662befdd10, via=0x0) at cache/cache_backend.c:737 #9 0x000055820c7c8632 in VRT_new_backend (ctx=0x7f662befe250, vrt=0x7f662befdd10, via=0x0) at cache/cache_backend.c:755
If VRT_AddDirector() was called from handling a VCL_COLD event or, indirectly, from another thread which the VCL_COLD event handler was waiting for, varnishd would deadlock and prevent any CLI or director changes, because VRT_AddDirector() requires the vcl_mtx, which is held during vcl_BackendEvent() to ensure a consistent view of the director list. Because of the early return from VRT_AddDirector() this likely only happened in VTC mode, but the underlying race existed nevertheless. This patch _almost_ fixes the issue with the intend of making it highly unlikely to occur without getting too involved with the vcl temperature controls: We now check the same conditions under which vcl_set_state() would transition the temperature to COOLING and, if they apply, use Lck_Trylock() in a try/wait loop instead of Lck_Lock(), avoiding the deadlock. The patch presumably still does not fix the problem entirely, because the reads of vcl->busy and vcl->temp before the Lck_Trylock() could still be outdated. With the temperature controls otherwise unchanged, the only alternative idea I could come up with was to always use a try/wait loop, which I dismissed due to the performance impact (overhead and added latency). Ref nigoroll/libvmod-dynamic#110
At this point I think this can only be fixed in varnish-cache, see varnishcache/varnish-cache#4048 |
If VRT_AddDirector() was called from handling a VCL_COLD event or, indirectly, from another thread which the VCL_COLD event handler was waiting for, varnishd would deadlock and prevent any CLI or director changes, because VRT_AddDirector() requires the vcl_mtx, which is held during vcl_BackendEvent() to ensure a consistent view of the director list. Because of the early return from VRT_AddDirector() this likely only happened in VTC mode, but the underlying race existed nevertheless. This patch _almost_ fixes the issue with the intend of making it highly unlikely to occur without getting too involved with the vcl temperature controls: We now check the same conditions under which vcl_set_state() would transition the temperature to COOLING and, if they apply, use Lck_Trylock() in a try/wait loop instead of Lck_Lock(), avoiding the deadlock. The patch presumably still does not fix the problem entirely, because the reads of vcl->busy and vcl->temp before the Lck_Trylock() could still be outdated. With the temperature controls otherwise unchanged, the only alternative idea I could come up with was to always use a try/wait loop, which I dismissed due to the performance impact (overhead and added latency). Ref nigoroll/libvmod-dynamic#110
Fixed via varnishcache/varnish-cache#4048 |
@delthas reported another issue based on a test case which I could have looked at in more detail earlier (I did not because I did not want to use
example.com
, but I really should have):(I have slightly modified the test case - @delthas, do you agree to add it?)
the issue here is that
VDI_Event(..., VCL_EVENT_COLD)
waits for the lookup threads to finish while holding thevcl_mtx
, which prevents the lookup threads to ... finish:The text was updated successfully, but these errors were encountered: