Skip to content
This repository has been archived by the owner on Jun 28, 2024. It is now read-only.

create retry function #401

Closed
chavafg opened this issue Jun 11, 2018 · 8 comments
Closed

create retry function #401

chavafg opened this issue Jun 11, 2018 · 8 comments

Comments

@chavafg
Copy link
Contributor

chavafg commented Jun 11, 2018

We sometimes hit network issues when setting up the kata-containers environment, and the jobs fail even before starting the tests. If we can add a retry function in this cases, it would be helpful to minimize this type of failures.

@chavafg
Copy link
Contributor Author

chavafg commented Jun 11, 2018

/cc @grahamwhaley

@grahamwhaley
Copy link
Contributor

Great. Yes, I think this might be helpful in some cases. Quite often we are seeing network timeouts from:

  • docker pulls
  • apt and dnf updates
  • curl or wget pulls

I think there is no guarantee this will fix things - but, the only way to see is to try it. I think we'd probably like a few retries over a decent period of time - if a network is down sometimes it is very transient, and sometimes it is a long term issue. So something like:

  • retry again immediately
  • retry after 1 minute
  • retry after 5 minutes
  • if that did not work, then fail.

@grahamwhaley
Copy link
Contributor

I mulled over this a touch more (as I might need it for one of my CI systems that seems to be suffering more network timeout related failures than I'd like - and not from inside the containers, but for things like docker pull and gpg key fetches....)

I think we could implement this one of two ways:

  1. write a retry function, and prefix all the relevant calls (maybe add with multiple PRs - there are likely many places we'd need to do this)
    retry curl http://my_magic_url
  2. write a function and then alias the relevant commands such as docker, curl, wget, apt, dnf etc. in our global lib.sh, so that all calls to any network related func. automatically gets a timeout retry wrapped around it.
    alias wget retry_wget

I like (2) as it is a simple quick fix that will (hopefully) cover all invocations, but it will also hide the retry functionality from anybody reading or implementing a script.

Any preference @chavafg ? :-)

@grahamwhaley
Copy link
Contributor

I had a peek at this adding something like the following to the lib.sh (note, bash functions are better than and have basically replaced 'alias', so functions it was...)

kata_retry() {
        local count
        local wait_time=0
        local cmd="$*"

        # First try is outside the loop to make it less verbose
        command $cmd && return

        # Now we failed, we go into the retry loop
        for count in {1..3}; do
                echo "Retry $count of [$cmd]"

                command $cmd && return
                echo "Failed attempt $count of cmd [$cmd]"

                case "$count" in
                        1) wait_time=3 ;;
                        2) wait_time=6 ;;
                        # We do not die() here, as we don't know what the caller wants to do.
                        *) info "Timed out retrying command [$cmd]";;
                esac

                echo "Sleep for ${wait_time}s before retry"
                sleep ${wait_time}
        done
}

docker() {
        kata_retry docker $*
}

I then went to check what cmds might need function redirections - and realised a number of the commands (take for instance yum), are executed via sudo, which does not inherit the shell function over-rides.
At that point, I suspect we might have to go with hand-editing all relevant invocation sites. Thoughts @jodh-intel @chavafg ?

@chavafg
Copy link
Contributor Author

chavafg commented Jun 15, 2018

Hi @grahamwhaley, sorry for my late response here.
So it seems that option number 1 is better although it may require more work, right?
Also, using option 2, are we will be able to handle things like systemctl restart docker ? where docker is not really the command ?

I'll prefer option 1, I think it will be more readable...

@marcov
Copy link
Contributor

marcov commented Sep 19, 2018

+1 for adding this feature. Recently I am getting a few curl failures involving timeout or SSL errors.

BTW are you aware if the often used packages like golang are cached? That would be useful to avoid network induced failures and for speeding up jobs.

@grahamwhaley
Copy link
Contributor

Hi @marcov afaik, nothing is cached. For our Jenkins CIs that is predominantly because we (deliberately) use a clean fresh machine/instance/VM for each build (see kata-containers/ci#39 for some details as to why). Thus, we'd need some external cache layer outside of our build instances - so a proxy cache or something.
Travis iirc had some cache ability, but we struggled to get it to work reliably. Also see (kata-containers/ci#64) for some thoughts on having some sort of cache for Jenkins, which could also be used for things like the golang binary install/download.

@GabyCT
Copy link
Contributor

GabyCT commented Nov 15, 2019

Closing this issue as it has been solved

@GabyCT GabyCT closed this as completed Nov 15, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants