-
Notifications
You must be signed in to change notification settings - Fork 15
Conversation
9a2d432
to
595724c
Compare
While we try to reconnect to proxy on errors while reading from proxy socket, I think we should do the same for failures to write to the proxy socket. While this is a good start for ensuring high availability, I would like to mention that we need to keep track of messages for which the shim has not received any acknowledgement from the proxy and attempt to resend them. Right now the shim does not keep track of the responses received from proxy for the messages that it sends. I will open a separate issue for that so that it can be tackled in a separate PR. |
src/shim.c
Outdated
@@ -703,6 +704,9 @@ handle_proxy_message(struct cc_shim *shim) | |||
|
|||
fr = read_frame(shim); | |||
if ( !fr) { | |||
if (! reconnect_to_proxy(shim)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're making lots of assumption by calling this after you got a fr == NULL
. Indeed, looking at read_frame()
function, there are several reasons why we could get an error, and they are not all related to the proxy disconnection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sboeuf, thanks for the review.
Okay. Re-connection when header length or header version is wrong doesn't make much sense. If I change for these cases instead of returning false to abort()? Would it make sense for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand you well, you want to change the return false
with abort()
in case you get header length or header version which is wrong ? Because you consider those cases like real issues and we should just terminate the shim, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delayed response.
So read_frame() can return NULL in case when read_wire_data() fails --> we re-connect.
Also read_frame() return NULL in cases:
shim == NULL or shim->proxy_sock_fd < 0
fr->header.header_len < MIN_HEADER_WORD_SIZE
fr->header.version != PROXY_VERSION
My proposition is to handle these cases the same as failures with allocation --> calling abort(). Doing this way will make read_frame() return NULL only in cases when read_wire_date() failed.
UPDATE:
Maybe it's not the best idea to address this problem as part of this PR. For simplicity I reworked the patch to handle only the case when recv() returns 0 ("a stream socket peer has performed an orderly shutdown").
src/shim.c
Outdated
shim_error("Could not send connect command to proxy\n"); | ||
goto try_again; | ||
} | ||
add_pollfd(poll_fds, PROXY_SOCK_INDEX, shim->proxy_sock_fd, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment explaining why this is being called again for PROXY_SOCK_INDEX
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Will add a comment here. Thanks.
src/shim.c
Outdated
// How many seconds shim tries to reconnect to proxy | ||
const int RECONNECT_TIMEOUT_S = 10; | ||
// Period between each attempt in milliseconds | ||
const int RECONNECT_POLL_MS = 333; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful if these two variables could be specified as getopt options, defaulting to #define
'd values you're using here (as that would allow the defaults to be modified at build time and runtime).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. Makes sense. Will redo the patch.
src/shim.c
Outdated
// Period between each attempt in milliseconds | ||
const int RECONNECT_POLL_MS = 333; | ||
|
||
shim_warning("Reconnecting to proxy\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you reference RECONNECT_TIMEOUT_S
in this message so it is clear from the log how long the shim will wait before giving up. I appreciate that you are logging that on error but our "rule of thumb" is: "if the code is going to sleep / block, always add a log call before that action in case 'the impossible' happens" 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Will rework.
595724c
to
bd82046
Compare
@sboeuf, I've updated the patch. Could you please review it. Thanks. |
@dvoytik sure, I'll do that tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, but the PR looks globally good to me.
src/shim.c
Outdated
|
||
// Period between attempts to reconnect to proxy in milliseconds | ||
#ifndef RECONNECT_POLL_MS | ||
#define RECONNECT_POLL_MS 333 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this value come from ? I think we should go with something more simple: 100
, 200
, 300
, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll set it to 100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
src/shim.c
Outdated
@@ -954,6 +966,51 @@ connect_to_proxy(struct cc_shim *shim) | |||
return false; | |||
} | |||
|
|||
inline void sleep_ms(int ms) | |||
{ | |||
struct timespec ts = { 0, ms * 1000000L }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might worth changing 1000000L
with 1e6
for more readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gcc 6.4.1 complains, it seems 1e6
is a double literal. I'll replace it with:
struct timespec ts = { 0, ms * 1000 * 1000 };
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh okay, then that's fine, don't change it.
src/shim.c
Outdated
shim->timeout); | ||
int time = 0; | ||
|
||
try_again: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to avoid using goto
when possible for more clarity in the code. Use a loop instead:
do {
...
} while ()`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'll rework this piece.
src/shim.c
Outdated
@@ -971,6 +1028,7 @@ print_usage(void) { | |||
printf(" -c, --container-id Container id\n"); | |||
printf(" -d, --debug Enable debug output\n"); | |||
printf(" -t, --token Connection token passed by cc-proxy\n"); | |||
printf(" -r, --rtimeout Reconnection timeout to cc-proxy in seconds\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the long option name, I think you can go with something longer like reconnect-timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
src/shim.c
Outdated
try_again: | ||
sleep_ms(RECONNECT_POLL_MS); | ||
time += RECONNECT_POLL_MS; | ||
if (time >= shim->timeout * 1000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1e3
instead of 1000
sounds clearer to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. For me it's completely opposite :) Nevertheless, I'm not sure it's a good idea to mix double/float literals with integer variables.
src/shim.c
Outdated
@@ -983,6 +1041,7 @@ main(int argc, char **argv) | |||
.container_id = NULL, | |||
.proxy_sock_fd = -1, | |||
.token = NULL, | |||
.timeout = RECONNECT_TIMEOUT_S, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please align this with other assignments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
586c4da
to
18213cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two tiny comments.
@jodh-intel @amshinde please take a look.
src/shim.c
Outdated
} | ||
|
||
shim_debug("Sending connect command\n"); | ||
if (! send_connect_command(shim)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since that's the second place in the code where we use both connect_to_proxy()
with send_connect_command()
right after (which is logical), I think we could factorize this through a new function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dvoytik you have updated the PR, but I don't think you addressed this comment. Am I missing something ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dvoytik ping ?
src/shim.c
Outdated
@@ -971,6 +1031,7 @@ print_usage(void) { | |||
printf(" -c, --container-id Container id\n"); | |||
printf(" -d, --debug Enable debug output\n"); | |||
printf(" -t, --token Connection token passed by cc-proxy\n"); | |||
printf(" -r, --reconnect-timeout Reconnection timeout to cc-proxy in seconds\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please align this with the rest of the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
0666c12
to
cf3da3d
Compare
@amshinde - could you re-review? |
cf3da3d
to
0df68ba
Compare
@jodh-intel, sure, no problem. I've updated the PR according to your comments. Thanks. |
@dvoytik in case you missed that in the old comments, you've missed one comment I left two weeks ago:
Does that make sense ? |
0df68ba
to
df6d2ca
Compare
@sboeuf, sorry, indeed I missed that comment. Thanks for reminding. I've updated the PR. |
Hi @dvoytik - are you still planning to work on this? |
df6d2ca
to
b8325e8
Compare
@jodh-intel, this PR was ready to merge one month ago in my opinion. I've rebased it again on the latest master. If all tests pass then it should be merged. This PR is needed by clearcontainers/proxy#107 |
@sboeuf pls re-review - we'll need your status ack :-) |
Yes this is fine. Let's wait for the CI on this PR and we can merge it. |
This still needs an ack from a project approver. @sameo, @grahamwhaley, @amshinde - could you take a look? |
pingety-ping! 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
I'd like to hear from @amshinde though for final ack, as primary original author/owner of the proxy iirc
src/shim.c
Outdated
@@ -955,6 +968,71 @@ connect_to_proxy(struct cc_shim *shim) | |||
} | |||
|
|||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: @jodh-intel do these need to be \*!
comment starts in order to get the marked up comment notionally processed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - good catch.
Aside: I just ran doxygen and found that there are a couple of errors that need correcting for master
too [ #82]).
nice work btw @dvoytik :-) |
Try to reconnect to proxy if connection is lost (e.g. proxy died). Timeout is defined either during configure phase (--with-timeout) or by passing --reconnect-timeout command line parameter. This patch is needed to implement high availability of cc-proxy. Fixes clearcontainers#53. Signed-off-by: Dmitry Voytik <dmitry.voytik@huawei.com>
b8325e8
to
ec00fcf
Compare
@grahamwhaley, thank you! :) |
Hi @amshinde - please can you review? |
@amshinde - I've tested this PR in combination with clearcontainers/proxy#107 and it works a treat! ;-) |
@jodh-intel thanks! |
Try to re-connect to proxy if connection is lost (e.g. proxy died)
with RECONNECT_TIMEOUT_S timeout.
This patch is needed to implement high availability of cc-proxy.
Fixes #53.
Signed-off-by: Dmitry Voytik dmitry.voytik@huawei.com