-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for Duplicate IP issues #2105
Conversation
This commit contains fixes for duplicate IP with 3 issues addressed: 1) Race condition when datastore is not present in cases like swarmkit 2) Byte Offset calculation depending on where the start of the bit in the bitsequence is, the offset was adding more bytes to the offset when the start of the bit is in the middle of one of the instances in a block 3) Finding the available bit was returning the last bit in the curent instance in a block if the block is not full and the current bit is after the last available bit. Signed-off-by: Abhinandan Prativadi <abhi@docker.com>
This commit fixes panic due to concurrent map access Signed-off-by: Abhinandan Prativadi <abhi@docker.com>
Codecov Report
@@ Coverage Diff @@
## master #2105 +/- ##
=========================================
Coverage ? 40.43%
=========================================
Files ? 139
Lines ? 22376
Branches ? 0
=========================================
Hits ? 9047
Misses ? 12000
Partials ? 1329
Continue to review full report at Codecov.
|
ping @aboch |
@@ -498,24 +506,40 @@ func (h *Handle) UnmarshalJSON(data []byte) error { | |||
func getFirstAvailable(head *sequence, start uint64) (uint64, uint64, error) { | |||
// Find sequence which contains the start bit | |||
byteStart, bitStart := ordinalToPos(start) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a reminder for myself, byteStart and bitStart here are relative to the first byte of the block where the ordinal start falls into
@@ -498,24 +506,40 @@ func (h *Handle) UnmarshalJSON(data []byte) error { | |||
func getFirstAvailable(head *sequence, start uint64) (uint64, uint64, error) { | |||
// Find sequence which contains the start bit | |||
byteStart, bitStart := ordinalToPos(start) | |||
current, _, _, inBlockBytePos := findSequence(head, byteStart) | |||
|
|||
current, _, precBlocks, inBlockBytePos := findSequence(head, byteStart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current is the pointer to the block where byteStart fall into
precBlocks, is the number of blocks within the same block where the byteStart fall into
example: head->0xFFFF00FF|10 and my byte start is 5 because my ordinal is 42, in this case precBlocks is 1, indicating that my ordinal falls in the second compressed block
goto next | ||
} | ||
if err != nil { | ||
// There are some more instances of the same block, so add the offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only possible in the case of serial allocation, where the start can be with in the block itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First set of comments. Still going through the actual RLE changes.
if store != nil { | ||
h.Unlock() // The lock is acquired in the GetObject |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears store.GetObject
will modify h
(through the SetValue
and SetIndex
calls). If so, shouldn't we keep h
locked while that is happening? Also, it appears the lock acquired in GetObject
is a datastore lock rather than the object's lock. Is h.Unlock()
needed and correct before the GetObject
call invoked on h
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way the store logic works in libnetwork is , the in memory data structure gets overloaded by the store data. So there need not be lock going into the GetObjec
if store != nil { | ||
h.Unlock() // The lock is acquired in the GetObject | ||
if err := store.GetObject(datastore.Key(h.Key()...), h); err != nil && err != datastore.ErrKeyNotFound { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
h.unlock()
will be needed at this point since we are returning. Potentially in a later patch, do you think the code can be restructured a little bit to just defer
unlocks (instead of the unlock
/lock
sequences) after grabbing a lock on h
before the for
loop. I.e.
h.lock()
defer h.unlock()
for {
.
.
.
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I hate the sequence too. But because of the way we handle differently for swarmkit and native, its kind of needed like this. I will work on the refactor next.
if err := nh.writeToStore(); err != nil { | ||
if _, ok := err.(types.RetryError); !ok { | ||
return ret, fmt.Errorf("internal failure while setting the bit: %v", err) | ||
if h.store != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check h.store
again at this point? It seems we already check earlier in the function. If we lock
h outside the for
early in the function, I think h.store
should not become nil mid-way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to differentiate between a store logic and a non store logic codepath. I need to unlock before going to write store it will lead to deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it regarding the logic to differentiate store/non-store codepaths.
For the locks, it looks like we are calling nh.writeToStore()
below which will lock nh
's mutex and not that of h
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep thats right. All the more reason not to have a lock around the operation right ? I retained the logic needed for store code path the way it was.
This commit contains test cases to verify the changes and to solidify the library. Signed-off-by: Abhinandan Prativadi <abhi@docker.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@abhi @fcrisciani
then doing a nslookup from a running replica for some-xyz-service, shows an extra IP that does not show on a
|
@alexafshar typically I would say no. But this fix will be effective on a network that is not already corrupted. so the suggestion is to check that the network does not show duplicates and if so then the best is to start with a new network. |
While investigating duplicate IP issue in IPAM library the following bugs were found:
For eg:
Consider the block if
curr:16
andsequence:0xfffeffff, count:10, next:{sequence:0x0fffffff, count:6}
then thegetAvailableBit(16)
would return 31 as the next available bit and we will end up with duplicate IPs/vxlanids etc.Tests are added to verify the above scenrios
Signed-off-by: Abhinandan Prativadi abhi@docker.com