Skip to content
This repository has been archived by the owner on May 6, 2020. It is now read-only.

connect: Try to re-connect to proxy #54

Merged
merged 1 commit into from
Oct 10, 2017

Conversation

dvoytik
Copy link
Contributor

@dvoytik dvoytik commented Aug 1, 2017

Try to re-connect to proxy if connection is lost (e.g. proxy died)
with RECONNECT_TIMEOUT_S timeout.

This patch is needed to implement high availability of cc-proxy.

Fixes #53.

Signed-off-by: Dmitry Voytik dmitry.voytik@huawei.com

@sboeuf sboeuf changed the title connect: Try to recconect to proxy connect: Try to re-connect to proxy Aug 1, 2017
@sboeuf sboeuf self-requested a review August 1, 2017 16:57
@amshinde
Copy link
Contributor

amshinde commented Aug 3, 2017

While we try to reconnect to proxy on errors while reading from proxy socket, I think we should do the same for failures to write to the proxy socket.

While this is a good start for ensuring high availability, I would like to mention that we need to keep track of messages for which the shim has not received any acknowledgement from the proxy and attempt to resend them. Right now the shim does not keep track of the responses received from proxy for the messages that it sends. I will open a separate issue for that so that it can be tackled in a separate PR.

src/shim.c Outdated
@@ -703,6 +704,9 @@ handle_proxy_message(struct cc_shim *shim)

fr = read_frame(shim);
if ( !fr) {
if (! reconnect_to_proxy(shim)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're making lots of assumption by calling this after you got a fr == NULL. Indeed, looking at read_frame() function, there are several reasons why we could get an error, and they are not all related to the proxy disconnection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sboeuf, thanks for the review.

Okay. Re-connection when header length or header version is wrong doesn't make much sense. If I change for these cases instead of returning false to abort()? Would it make sense for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand you well, you want to change the return false with abort() in case you get header length or header version which is wrong ? Because you consider those cases like real issues and we should just terminate the shim, right ?

Copy link
Contributor Author

@dvoytik dvoytik Aug 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed response.

So read_frame() can return NULL in case when read_wire_data() fails --> we re-connect.
Also read_frame() return NULL in cases:

shim == NULL or shim->proxy_sock_fd < 0
fr->header.header_len < MIN_HEADER_WORD_SIZE
fr->header.version != PROXY_VERSION

My proposition is to handle these cases the same as failures with allocation --> calling abort(). Doing this way will make read_frame() return NULL only in cases when read_wire_date() failed.

UPDATE:
Maybe it's not the best idea to address this problem as part of this PR. For simplicity I reworked the patch to handle only the case when recv() returns 0 ("a stream socket peer has performed an orderly shutdown").

src/shim.c Outdated
shim_error("Could not send connect command to proxy\n");
goto try_again;
}
add_pollfd(poll_fds, PROXY_SOCK_INDEX, shim->proxy_sock_fd,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment explaining why this is being called again for PROXY_SOCK_INDEX.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Will add a comment here. Thanks.

src/shim.c Outdated
// How many seconds shim tries to reconnect to proxy
const int RECONNECT_TIMEOUT_S = 10;
// Period between each attempt in milliseconds
const int RECONNECT_POLL_MS = 333;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful if these two variables could be specified as getopt options, defaulting to #define'd values you're using here (as that would allow the defaults to be modified at build time and runtime).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Makes sense. Will redo the patch.

src/shim.c Outdated
// Period between each attempt in milliseconds
const int RECONNECT_POLL_MS = 333;

shim_warning("Reconnecting to proxy\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you reference RECONNECT_TIMEOUT_S in this message so it is clear from the log how long the shim will wait before giving up. I appreciate that you are logging that on error but our "rule of thumb" is: "if the code is going to sleep / block, always add a log call before that action in case 'the impossible' happens" 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Will rework.

@dvoytik
Copy link
Contributor Author

dvoytik commented Aug 17, 2017

@sboeuf, I've updated the patch. Could you please review it. Thanks.

@sboeuf
Copy link
Contributor

sboeuf commented Aug 17, 2017

@dvoytik sure, I'll do that tomorrow.

Copy link
Contributor

@sboeuf sboeuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, but the PR looks globally good to me.

src/shim.c Outdated

// Period between attempts to reconnect to proxy in milliseconds
#ifndef RECONNECT_POLL_MS
#define RECONNECT_POLL_MS 333
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this value come from ? I think we should go with something more simple: 100, 200, 300, ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll set it to 100.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

src/shim.c Outdated
@@ -954,6 +966,51 @@ connect_to_proxy(struct cc_shim *shim)
return false;
}

inline void sleep_ms(int ms)
{
struct timespec ts = { 0, ms * 1000000L };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might worth changing 1000000L with 1e6 for more readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gcc 6.4.1 complains, it seems 1e6 is a double literal. I'll replace it with:

struct timespec ts = { 0, ms * 1000 * 1000 };

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay, then that's fine, don't change it.

src/shim.c Outdated
shim->timeout);
int time = 0;

try_again:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to avoid using goto when possible for more clarity in the code. Use a loop instead:

do {
   ...
} while ()`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll rework this piece.

src/shim.c Outdated
@@ -971,6 +1028,7 @@ print_usage(void) {
printf(" -c, --container-id Container id\n");
printf(" -d, --debug Enable debug output\n");
printf(" -t, --token Connection token passed by cc-proxy\n");
printf(" -r, --rtimeout Reconnection timeout to cc-proxy in seconds\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the long option name, I think you can go with something longer like reconnect-timeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

src/shim.c Outdated
try_again:
sleep_ms(RECONNECT_POLL_MS);
time += RECONNECT_POLL_MS;
if (time >= shim->timeout * 1000) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1e3 instead of 1000 sounds clearer to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. For me it's completely opposite :) Nevertheless, I'm not sure it's a good idea to mix double/float literals with integer variables.

src/shim.c Outdated
@@ -983,6 +1041,7 @@ main(int argc, char **argv)
.container_id = NULL,
.proxy_sock_fd = -1,
.token = NULL,
.timeout = RECONNECT_TIMEOUT_S,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please align this with other assignments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@dvoytik dvoytik force-pushed the try_to_reconnect branch 2 times, most recently from 586c4da to 18213cd Compare August 21, 2017 15:05
Copy link
Contributor

@sboeuf sboeuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two tiny comments.
@jodh-intel @amshinde please take a look.

src/shim.c Outdated
}

shim_debug("Sending connect command\n");
if (! send_connect_command(shim)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since that's the second place in the code where we use both connect_to_proxy() with send_connect_command() right after (which is logical), I think we could factorize this through a new function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dvoytik you have updated the PR, but I don't think you addressed this comment. Am I missing something ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dvoytik ping ?

src/shim.c Outdated
@@ -971,6 +1031,7 @@ print_usage(void) {
printf(" -c, --container-id Container id\n");
printf(" -d, --debug Enable debug output\n");
printf(" -t, --token Connection token passed by cc-proxy\n");
printf(" -r, --reconnect-timeout Reconnection timeout to cc-proxy in seconds\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please align this with the rest of the output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

@dvoytik dvoytik force-pushed the try_to_reconnect branch 2 times, most recently from 0666c12 to cf3da3d Compare August 22, 2017 09:02
@jodh-intel
Copy link
Contributor

jodh-intel commented Aug 30, 2017

There are a couple of minor issues:

  • I'd add the new option in alphabetical order everywhere in the code.
  • Lots of strings hard-coding cc-proxy (I'd consider #define'ing it to avoid the duplications and make maintenance easier).

But these are nits, so:

lgtm

Approved with PullApprove

@jodh-intel
Copy link
Contributor

@amshinde - could you re-review?

@dvoytik
Copy link
Contributor Author

dvoytik commented Aug 31, 2017

@jodh-intel, sure, no problem. I've updated the PR according to your comments. Thanks.

@sboeuf
Copy link
Contributor

sboeuf commented Sep 1, 2017

@dvoytik in case you missed that in the old comments, you've missed one comment I left two weeks ago:

Since that's the second place in the code where we use both connect_to_proxy() with send_connect_command() right after (which is logical), I think we could factorize this through a new function.

Does that make sense ?

@dvoytik
Copy link
Contributor Author

dvoytik commented Sep 1, 2017

@sboeuf, sorry, indeed I missed that comment. Thanks for reminding. I've updated the PR.

@jodh-intel
Copy link
Contributor

Hi @dvoytik - are you still planning to work on this?

@dvoytik
Copy link
Contributor Author

dvoytik commented Oct 2, 2017

@jodh-intel, this PR was ready to merge one month ago in my opinion. I've rebased it again on the latest master. If all tests pass then it should be merged. This PR is needed by clearcontainers/proxy#107

@grahamwhaley
Copy link
Contributor

@sboeuf pls re-review - we'll need your status ack :-)

@sboeuf
Copy link
Contributor

sboeuf commented Oct 2, 2017

Yes this is fine. Let's wait for the CI on this PR and we can merge it.
LGTM

@jodh-intel
Copy link
Contributor

This still needs an ack from a project approver. @sameo, @grahamwhaley, @amshinde - could you take a look?

@jodh-intel
Copy link
Contributor

pingety-ping! 😄

Copy link
Contributor

@grahamwhaley grahamwhaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
I'd like to hear from @amshinde though for final ack, as primary original author/owner of the proxy iirc

src/shim.c Outdated
@@ -955,6 +968,71 @@ connect_to_proxy(struct cc_shim *shim)
}

/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: @jodh-intel do these need to be \*! comment starts in order to get the marked up comment notionally processed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - good catch.

Aside: I just ran doxygen and found that there are a couple of errors that need correcting for master too [ #82]).

@grahamwhaley
Copy link
Contributor

nice work btw @dvoytik :-)

Try to reconnect to proxy if connection is lost (e.g. proxy died).
Timeout is defined either during configure phase (--with-timeout) or
by passing --reconnect-timeout command line parameter.

This patch is needed to implement high availability of cc-proxy.

Fixes clearcontainers#53.

Signed-off-by: Dmitry Voytik <dmitry.voytik@huawei.com>
@dvoytik
Copy link
Contributor Author

dvoytik commented Oct 9, 2017

@grahamwhaley, thank you! :)

@jodh-intel
Copy link
Contributor

Hi @amshinde - please can you review?

@jodh-intel
Copy link
Contributor

@amshinde - I've tested this PR in combination with clearcontainers/proxy#107 and it works a treat! ;-)

@amshinde
Copy link
Contributor

amshinde commented Oct 10, 2017

@jodh-intel thanks!
lgtm

Approved with PullApprove Approved with PullApprove

@amshinde amshinde merged commit ff07765 into clearcontainers:master Oct 10, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants