Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate duped ops due to socket reuse in odsp driver #3627

Closed
heliocliu opened this issue Sep 14, 2020 · 9 comments
Closed

Investigate duped ops due to socket reuse in odsp driver #3627

heliocliu opened this issue Sep 14, 2020 · 9 comments
Assignees
Labels
area: driver Driver related issues hotpatch
Milestone

Comments

@heliocliu
Copy link
Contributor

heliocliu commented Sep 14, 2020

See: #3605 which adds some telemetry, Teams thread jk no teams thread

There's some issue resulting in multiple connections re-sending pending ops to the server, leading to data loss. We are observing in one instance that during network instability, multiple active reconnections are re-sending pending ops and not correctly identifying those ops as local, leading to duplication.

Ops list
seq: 5120  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 --------------join--------------
seq: 5126  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 5     #
seq: 5127  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 6
seq: 5135  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 13    O
seq: 5136  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 14    P
seq: 5140  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 17    O
seq: 5141  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 18    p
seq: 5145  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 21    [
seq: 5146  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 22    I
seq: 5147  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 23    m
seq: 5148  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 24    p
seq: 5149  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 25    o
seq: 5150  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 26    r
seq: 5151  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 27    t
seq: 5152  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 28    a
seq: 5153  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 29    n
seq: 5154  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 30    t
seq: 5155  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 31    ]
seq: 5157  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 32
seq: 5158  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 33    O
seq: 5159  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 34    p
seq: 5160  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 35    e
seq: 5161  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 36    n
seq: 5162  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 37
seq: 5163  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 38    Q
seq: 5164  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 39    u
seq: 5165  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 40    e
seq: 5166  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 41    s
seq: 5167  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 42    t
seq: 5168  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 43    i
seq: 5169  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 44    o
seq: 5170  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 45    n
seq: 5171  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 46    s
seq: 5177  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 50    1
seq: 5178  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 51    .
seq: 5179  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 52
seq: 5193  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 66    D
seq: 5194  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 67    o
seq: 5198  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 70    D
seq: 5199  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 71    o
seq: 5200  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 72
seq: 5201  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 73    h
seq: 5202  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 74    o
seq: 5203  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 75    s
seq: 5204  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 76    t
seq: 5205  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 77    s
seq: 5206  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 78
seq: 5207  cid: 156c88f8-db41-4579-89b1-850147df8c09 --------------join--------------
seq: 5208  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 79    h
seq: 5209  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 80    a
seq: 5210  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 81    v
seq: 5211  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 82    e
seq: 5212  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 83
seq: 5213  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 84    t
seq: 5214  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 85    h
seq: 5215  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 86    e
seq: 5216  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 87
seq: 5217  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 88    c
seq: 5218  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 89    a
seq: 5219  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 90    p
seq: 5220  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 91    a
seq: 5221  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 92    b
seq: 5222  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 93    i
seq: 5223  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 94    l
seq: 5224  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 95    i
seq: 5225  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 96    t
seq: 5226  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 97    y
seq: 5227  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 98
seq: 5228  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 99    t
seq: 5229  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 100   o
seq: 5230  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 101
seq: 5231  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 102   d
seq: 5232  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 103   o
seq: 5233  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 104
seq: 5234  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 105   a
seq: 5235  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 106
seq: 5236  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 107   (
seq: 5237  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 108   f
seq: 5238  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 109   a
seq: 5239  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 110   s
seq: 5240  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 111   t
seq: 5241  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 112   )
seq: 5242  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 113
seq: 5243  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 114   a
seq: 5244  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 115   s
seq: 5245  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 116   y
seq: 5246  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 117   n
seq: 5247  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 118   c
seq: 5248  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 119
seq: 5249  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 120   t
seq: 5250  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 121   a
seq: 5251  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 122   s
seq: 5252  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 123   k
seq: 5253  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 124
seq: 5254  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 125   b
seq: 5255  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 126   e
seq: 5256  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 127   f
seq: 5257  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 128   o
seq: 5258  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 129   r
seq: 5259  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 130   e
seq: 5260  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 131
seq: 5261  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 cseq: 132   k
seq: 5262  cid: 7591168a-25ed-4b0e-a60f-7af2dd3cf2b7 --------------leave--------------
seq: 5264  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 1     h
seq: 5265  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 2     a
seq: 5266  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 3     v
seq: 5267  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 4     e
seq: 5268  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 5
seq: 5269  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 6     t
seq: 5270  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 7     h
seq: 5271  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 8     e
seq: 5272  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 9
seq: 5273  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 10    c
seq: 5274  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 11    a
seq: 5275  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 12    p
seq: 5276  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 13    a
seq: 5277  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 14    b
seq: 5278  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 15    i
seq: 5279  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 16    l
seq: 5280  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 17    i
seq: 5281  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 18    t
seq: 5282  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 19    y
seq: 5283  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 20
seq: 5284  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 21    t
seq: 5285  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 22    o
seq: 5286  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 23
seq: 5287  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 24    d
seq: 5288  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 25    o
seq: 5289  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 26
seq: 5290  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 27    a
seq: 5291  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 28
seq: 5292  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 29    (
seq: 5293  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 30    f
seq: 5294  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 31    a
seq: 5295  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 32    s
seq: 5296  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 33    t
seq: 5297  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 34    )
seq: 5298  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 35
seq: 5299  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 36    a
seq: 5300  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 37    s
seq: 5301  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 38    y
seq: 5302  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 39    n
seq: 5303  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 40    c
seq: 5304  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 41
seq: 5305  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 42    t
seq: 5306  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 43    a
seq: 5307  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 44    s
seq: 5308  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 45    k
seq: 5309  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 46
seq: 5310  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 47    b
seq: 5311  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 48    e
seq: 5312  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 49    f
seq: 5313  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 50    o
seq: 5314  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 51    r
seq: 5315  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 52    e
seq: 5316  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 53
seq: 5317  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 54    k
seq: 5318  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 55    i
seq: 5319  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 56    l
seq: 5320  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 57    l
seq: 5321  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 58    i
seq: 5322  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 59    n
seq: 5323  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 60    g
seq: 5324  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 61
seq: 5325  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 62    t
seq: 5326  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 63    h
seq: 5327  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 64    e
seq: 5328  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 65
seq: 5329  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 66    i
seq: 5330  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 67    f
seq: 5331  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 68    r
seq: 5332  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 69    a
seq: 5333  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 70    m
seq: 5334  cid: 156c88f8-db41-4579-89b1-850147df8c09 cseq: 71    e
seq: 5345  cid: 156c88f8-db41-4579-89b1-850147df8c09 --------------leave--------------
Suspicion is around the odsp driver's reuse of its socket over multiple connections, where old messages get flushed after the corresponding connection is expected to already be closed. Also lots of potential for other lurking issues due to connection reuse/similar.
@heliocliu heliocliu added the area: driver Driver related issues label Sep 14, 2020
@anthony-murphy
Copy link
Contributor

anthony-murphy commented Sep 14, 2020

specifically, on reconnect we see:

  1. new client id join message
  2. ops sent from the old client id
  3. the same ops send by the new client id as part of reconnect.
  4. old client leave message

2 should never happen after 1, the join message acts as a barrier in the op stream so the client only watches for ops from itself until it sees its new join message. in this case i doesn't recognize the ops in step 2 as it's own, so resend them when the reconnection completes in 3.

@danielroney
Copy link
Contributor

Per discussion with Matt, removing hotpatch tag and pushing out to October.

@anthony-murphy
Copy link
Contributor

@danielroney and @ChumpChief this is causing active data corruption issues. i think it should be hot patch.

@heliocliu can you provide the loader versions, and expected log entry for each version to detect this issue as it has changed a few times.

@heliocliu
Copy link
Contributor Author

heliocliu commented Sep 30, 2020

So for 0.24/0.25, there should be a telemetry event named matchedOldClientIdInRemoteMessage.

For 0.26/0.27, there will be a container error with message either messageClientIdMissingFromQuorum (indicating we got a message from a client that's not in the quorum and should be) or messageClientIdShouldHaveLeft (indicating we got a message from a client that should NOT be in the quorum but is).

@markfields
Copy link
Member

markfields commented Oct 9, 2020

@heliocliu Any updates here? Is there concrete work we should do this month to move along this investigation or mitigate the issue? Or are we waiting for releases to get picked up by OWH and deployed by partners to see telemetry?

@heliocliu
Copy link
Contributor Author

@markfields Don't really have any here... The old telemetry suggested this issue wasn't as prevalent as feared and the new telemetry (which hasn't been picked up yet afaik) introduces some throws, so we should have more to work with come 0.26 integration

@ChumpChief
Copy link
Contributor

0.26 integration is done but not deployed yet, so we'll keep an eye on this telemetry after that goes out.

@ChumpChief
Copy link
Contributor

0.26 bump deployed yesterday, no hits on the new telemetry yet. Will continue to monitor - possible we just haven't happened to hit it or that other issues are masking it.

@curtisman curtisman modified the milestones: October 2020, November 2020 Oct 28, 2020
@ChumpChief
Copy link
Contributor

No hits on the new telemetry (>0.26), and no hits on the old telemetry (<=0.25) since 10/27, so seems plausible that #3787 was successful in mitigating. Closing as there are no recent hits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: driver Driver related issues hotpatch
Projects
None yet
Development

No branches or pull requests

6 participants