fix(healthcheck) record `last_run` when healthcheck is scheduled rather than completed #72

onematchfox · 2021-06-01T07:41:44Z

The active_check_timer in this library is currently executed every 0.1 seconds.

lua-resty-healthcheck/lib/resty/healthcheck.lua

Lines 1470 to 1493 in 9e9a936

    
           active_check_timer, err = resty_timer({ 
        
             recurring = true, 
        
             interval = CHECK_INTERVAL, 
        
             detached = false, 
        
             expire = function() 
        
               self:renew_periodic_lock() 
        
               local cur_time = ngx_now() 
        
               for _, checker_obj in ipairs(hcs) do 
        
                 if checker_obj.checks.active.healthy.active and 
        
                   (checker_obj.checks.active.healthy.last_run + 
        
                   checker_obj.checks.active.healthy.interval <= cur_time) 
        
                 then 
        
                   checker_callback(checker_obj, "healthy") 
        
                 end 
        
                 if checker_obj.checks.active.unhealthy.active and 
        
                   (checker_obj.checks.active.unhealthy.last_run + 
        
                   checker_obj.checks.active.unhealthy.interval <= cur_time) 
        
                 then 
        
                   checker_callback(checker_obj, "unhealthy") 
        
                 end 
        
               end 
        
             end, 
        
           })

This loop checks the value of last_run to determine whether or not to schedule a healthcheck. However, last_run is only set once the check completes:

lua-resty-healthcheck/lib/resty/healthcheck.lua

Lines 1031 to 1041 in 9e9a936

    
           local timer, err = resty_timer({ 
        
             interval = 0, 
        
             recurring = false, 
        
             immediate = false, 
        
             detached = true, 
        
             expire = function() 
        
               self:log(DEBUG, "checking ", health_mode, " targets: #", #list_to_check) 
        
               self:active_check_targets(list_to_check) 
        
               self.checks.active[health_mode].last_run = ngx_now() 
        
             end, 
        
           })

In the case that the healthcheck for a set of targets takes longer than 0.1 seconds to complete, this results in additional healthchecks being scheduled rather than waiting for the configured interval to pass. This can cause multiple issues and even lead to a thundering herd style problem on upstreams that are having issues. To begin with, if unhealthy failures are set to a number less than the total number of checks that are scheduled in the period it takes to complete a healthcheck then a target will be marked as UNHEALTHY immediately. E.g. I have a timeout of 1 second set, if an upstream broaches that time because of some network glitch then I could potentially have 10 healthcheck scheduled in the period in which it take the first to complete. If my Unhealthy timeouts/failures happens to be 5 then my target could quite easily be marked as UNHEALTHY after what should have been a single healthcheck. The problem is subsequently compounded when running a cluster of Kong nodes, particularly if one checks unhealthy targets more often that healthy (because one wants to mark the upstream/target as healthy quickly as possible). In my case this prevented an upstream from coming back up again.

Supersedes #70 (resolves that issue in addition to the one described above).

Resolves Kong/kong#7409

Logs:

kong_1            | Waiting for PostgreSQL on 'kong-database:5432'.  up!
kong_1            | Database already bootstrapped
kong_1            | Database is already up-to-date
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] globalpatches.lua:10: installing the globalpatches
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:455: init(): [dns-client] (re)configuring dns client
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:460: init(): [dns-client] staleTtl = 4
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:463: init(): [dns-client] validTtl = nil
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:467: init(): [dns-client] noSynchronisation = false
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:486: init(): [dns-client] query order = LAST, SRV, A, CNAME
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:541: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-allrouters = [ff02::2]
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:541: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-localhost = [::1]
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:541: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-loopback = [::1]
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:526: init(): [dns-client] adding A-record from 'hosts' file: vagranthost = 10.0.2.2
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:541: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-mcastprefix = [ff00::0]
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:526: init(): [dns-client] adding A-record from 'hosts' file: localhost = 127.0.0.1
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:541: init(): [dns-client] adding AAAA-record from 'hosts' file: localhost = [::1]
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:541: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-localnet = [fe00::0]
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:541: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-allnodes = [ff02::1]
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:526: init(): [dns-client] adding A-record from 'hosts' file: b3445413d294 = 172.26.128.5
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:585: init(): [dns-client] nameserver 127.0.0.11
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:590: init(): [dns-client] attempts = 5
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:597: init(): [dns-client] no_random = true
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:606: init(): [dns-client] timeout = 2000 ms
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:610: init(): [dns-client] ndots = 1
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:612: init(): [dns-client] search =
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:618: init(): [dns-client] badTtl = 1 s
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:620: init(): [dns-client] emptyTtl = 30 s
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] globalpatches.lua:263: randomseed(): seeding PRNG from OpenSSL RAND_bytes()
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:455: init(): [dns-client] (re)configuring dns client
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:460: init(): [dns-client] staleTtl = 4
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:463: init(): [dns-client] validTtl = nil
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:467: init(): [dns-client] noSynchronisation = false
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:486: init(): [dns-client] query order = LAST, SRV, A, CNAME
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:541: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-allrouters = [ff02::2]
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:541: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-localhost = [::1]
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:541: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-loopback = [::1]
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:526: init(): [dns-client] adding A-record from 'hosts' file: vagranthost = 10.0.2.2
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:541: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-mcastprefix = [ff00::0]
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:526: init(): [dns-client] adding A-record from 'hosts' file: localhost = 127.0.0.1
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:541: init(): [dns-client] adding AAAA-record from 'hosts' file: localhost = [::1]
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:541: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-localnet = [fe00::0]
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:541: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-allnodes = [ff02::1]
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:526: init(): [dns-client] adding A-record from 'hosts' file: b3445413d294 = 172.26.128.5
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:585: init(): [dns-client] nameserver 127.0.0.11
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:590: init(): [dns-client] attempts = 5
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:597: init(): [dns-client] no_random = true
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:606: init(): [dns-client] timeout = 2000 ms
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:610: init(): [dns-client] ndots = 1
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:612: init(): [dns-client] search =
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:618: init(): [dns-client] badTtl = 1 s
kong_1            | 2021/06/01 07:39:21 [debug] 1#0: [lua] client.lua:620: init(): [dns-client] emptyTtl = 30 s
kong_1            | 2021/06/01 07:39:22 [debug] 1#0: [lua] plugins.lua:245: load_plugin(): Loading plugin: request-transformer
kong_1            | 2021/06/01 07:39:22 [notice] 1#0: using the "epoll" event method
kong_1            | 2021/06/01 07:39:22 [notice] 1#0: openresty/1.19.3.1
kong_1            | 2021/06/01 07:39:22 [notice] 1#0: built by gcc 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
kong_1            | 2021/06/01 07:39:22 [notice] 1#0: OS: Linux 4.15.0-143-generic
kong_1            | 2021/06/01 07:39:22 [notice] 1#0: getrlimit(RLIMIT_NOFILE): 1048576:1048576
kong_1            | 2021/06/01 07:39:22 [notice] 1#0: start worker processes
kong_1            | 2021/06/01 07:39:22 [notice] 1#0: start worker process 43
kong_1            | 2021/06/01 07:39:22 [debug] 43#0: *1 [lua] globalpatches.lua:263: randomseed(): seeding PRNG from OpenSSL RAND_bytes()
kong_1            | 2021/06/01 07:39:22 [debug] 43#0: *1 [lua] events.lua:211: do_event_json(): worker-events: handling event; source=resty-worker-events, event=started, pid=43, data=nil
kong_1            | 2021/06/01 07:39:22 [notice] 43#0: *1 [lua] warmup.lua:92: single_dao(): Preloading 'services' into the core_cache..., context: init_worker_by_lua*
kong_1            | 2021/06/01 07:39:22 [notice] 43#0: *1 [lua] warmup.lua:129: single_dao(): finished preloading 'services' into the core_cache (in 76ms), context: init_worker_by_lua*
kong_1            | 2021/06/01 07:39:22 [notice] 43#0: *2 [lua] warmup.lua:34: warming up DNS entries ..., context: ngx.timer
kong_1            | 2021/06/01 07:39:22 [debug] 43#0: *3 [lua] base.lua:1503: new(): [upstream:mock 1] balancer_base created
kong_1            | 2021/06/01 07:39:22 [debug] 43#0: *3 [lua] round_robin.lua:165: new(): [upstream:mock 1] round_robin balancer created
kong_1            | 2021/06/01 07:39:22 [debug] 43#0: *3 [lua] base.lua:917: newHost(): [upstream:mock 1] created a new host for: mock.example.com
kong_1            | 2021/06/01 07:39:22 [debug] 43#0: *3 [lua] base.lua:655: queryDns(): [upstream:mock 1] querying dns for mock.example.com
kong_1            | 2021/06/01 07:39:22 [debug] 43#0: *3 [lua] base.lua:570: f(): [upstream:mock 1] dns record type changed for mock.example.com, nil -> 1
kong_1            | 2021/06/01 07:39:22 [debug] 43#0: *3 [lua] base.lua:370: newAddress(): [upstream:mock 1] new address for host 'mock.example.com' created: 0.0.0.0:443 (weight 10)
kong_1            | 2021/06/01 07:39:22 [debug] 43#0: *3 [lua] base.lua:634: f(): [upstream:mock 1] updating balancer based on dns changes for mock.example.com
kong_1            | 2021/06/01 07:39:22 [debug] 43#0: *3 [lua] base.lua:644: f(): [upstream:mock 1] querying dns and updating for mock.example.com completed
kong_1            | 2021/06/01 07:39:22 [debug] 43#0: *3 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) Got initial target list (0 targets)
kong_1            | 2021/06/01 07:39:22 [debug] 43#0: *3 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) active check flagged as active
kong_1            | 2021/06/01 07:39:22 [debug] 43#0: *3 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) starting timer to check active checks
kong_1            | 2021/06/01 07:39:22 [debug] 43#0: *3 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) Healthchecker started!
kong_1            | 2021/06/01 07:39:22 [debug] 43#0: *3 [lua] events.lua:211: do_event_json(): worker-events: handling event; source=lua-resty-healthcheck [bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock], event=healthy, pid=43, data=table: 0x7fdc2035e248
kong_1            | 2021/06/01 07:39:22 [debug] 43#0: *3 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) event: target added 'mock.example.com(0.0.0.0:443)'
kong_1            | 2021/06/01 07:39:22 [debug] 43#0: *3 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) event: target status 'mock.example.com(0.0.0.0:443)' from 'false' to 'true'
kong_1            | 2021/06/01 07:39:22 [debug] 43#0: *3 [lua] balancer.lua:850: create_balancers(): initialized 1 balancer(s), 0 error(s)
kong_1            | 2021/06/01 07:39:22 [notice] 43#0: *2 [lua] warmup.lua:58: finished warming up DNS entries' into the cache (in 750ms), context: ngx.timer
kong_1            | 2021/06/01 07:39:23 [debug] 43#0: *40 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) worker 0 (pid: 43) starting active check timer
kong_1            | 2021/06/01 07:39:27 [debug] 43#0: *45 [lua] init.lua:288: [cluster_events] polling events from: 1622533162.126
kong_1            | 2021/06/01 07:39:32 [debug] 43#0: *53 [lua] init.lua:288: [cluster_events] polling events from: 1622533162.126
kong_1            | 2021/06/01 07:39:37 [debug] 43#0: *61 [lua] init.lua:288: [cluster_events] polling events from: 1622533162.126
kong_1            | 2021/06/01 07:39:37 [debug] 43#0: *41 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) checking unhealthy targets: nothing to do
kong_1            | 2021/06/01 07:39:42 [debug] 43#0: *69 [lua] init.lua:288: [cluster_events] polling events from: 1622533162.126
kong_1            | 2021/06/01 07:39:47 [debug] 43#0: *77 [lua] init.lua:288: [cluster_events] polling events from: 1622533162.126
kong_1            | 2021/06/01 07:39:52 [debug] 43#0: *85 [lua] init.lua:288: [cluster_events] polling events from: 1622533162.126
kong_1            | 2021/06/01 07:39:52 [debug] 43#0: *41 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) checking unhealthy targets: nothing to do
kong_1            | 2021/06/01 07:39:52 [debug] 43#0: *89 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) checking healthy targets: #1
kong_1            | 2021/06/01 07:39:52 [debug] 43#0: *89 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) Checking mock.example.com 0.0.0.0:443 (currently healthy)
kong_1            | 2021/06/01 07:39:52 [debug] 43#0: *91 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) checking healthy targets: #1
kong_1            | 2021/06/01 07:39:52 [debug] 43#0: *91 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) Checking mock.example.com 0.0.0.0:443 (currently healthy)
kong_1            | 2021/06/01 07:39:52 [debug] 43#0: *93 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) checking healthy targets: #1
kong_1            | 2021/06/01 07:39:52 [debug] 43#0: *93 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) Checking mock.example.com 0.0.0.0:443 (currently healthy)
kong_1            | 2021/06/01 07:39:52 [debug] 43#0: *95 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) checking healthy targets: #1
kong_1            | 2021/06/01 07:39:52 [debug] 43#0: *95 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) Checking mock.example.com 0.0.0.0:443 (currently healthy)
kong_1            | 2021/06/01 07:39:52 [debug] 43#0: *97 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) checking healthy targets: #1
kong_1            | 2021/06/01 07:39:52 [debug] 43#0: *97 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) Checking mock.example.com 0.0.0.0:443 (currently healthy)
kong_1            | 2021/06/01 07:39:52 [debug] 43#0: *89 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) Reporting 'mock.example.com (0.0.0.0:443)' (got HTTP 200)
kong_1            | 2021/06/01 07:39:53 [debug] 43#0: *91 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) Reporting 'mock.example.com (0.0.0.0:443)' (got HTTP 200)
kong_1            | 2021/06/01 07:39:53 [debug] 43#0: *93 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) Reporting 'mock.example.com (0.0.0.0:443)' (got HTTP 200)
kong_1            | 2021/06/01 07:39:53 [debug] 43#0: *95 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) Reporting 'mock.example.com (0.0.0.0:443)' (got HTTP 200)
kong_1            | 2021/06/01 07:39:53 [debug] 43#0: *97 [lua] healthcheck.lua:1127: log(): [healthcheck] (bdd6a8cd-50a2-4650-aba2-2e2cf172dc85:mock) Reporting 'mock.example.com (0.0.0.0:443)' (got HTTP 200)
kong_1            | 2021/06/01 07:39:57 [debug] 43#0: *103 [lua] init.lua:288: [cluster_events] polling events from: 1622533162.126

onematchfox · 2021-06-01T08:27:01Z

Reopening to kick off Travis again. Running OpenResty 1.19.3-2 locally and tests pass fine:

$ make test                                                                                                                                                                                            <aws:revolution-dev>
PATH=/usr/local/openresty/nginx/sbin:$PATH prove -I../test-nginx/lib -r t
t/00-new.t ................................... ok     
t/01-start-stop.t ............................ ok     
t/02-add_target.t ............................ ok     
t/03-get_target_status.t ..................... ok   
t/04-report_success.t ........................ ok     
t/05-report_failure.t ........................ ok     
t/06-report_http_status.t .................... ok     
t/07-report_tcp_failure.t .................... ok     
t/08-report_timeout.t ........................ ok     
t/09-active_probes.t ......................... ok     
t/10-garbagecollect.t ........................ ok   
t/11-clear.t ................................. ok     
t/12-set_target_status.t ..................... ok   
t/13-integration.t ........................... ok   
t/14-tls_active_probes.t ..................... ok   
t/15-get_virtualhost_target_status.t ......... ok     
t/16-set_all_target_statuses_for_hostname.t .. ok   
t/17-mtls.t .................................. ok   
All tests successful.
Files=18, Tests=312, 59 wallclock secs ( 0.08 usr  0.02 sys +  2.79 cusr  0.66 csys =  3.55 CPU)
Result: PASS

Prevents a thundering herd issue whereby additional healthchecks are scheduled in the time in which it takes the healthcheck to complete.

locao · 2021-06-02T17:19:19Z

Hi @onematchfox! Thank you for your submission and detailed explanation about the issue! I'm adding this to our tasks, so your PR will be reviewed in the next few days.

ghost · 2021-06-07T20:46:32Z

I've decided to write a small regression test case. The test isn't perfect but should do. If you run these tests against the branch release/1.4.1 it will fail, i.e., BAD will be returned in the response_body: #73

This is a squashed commit that realigns master branch with 3.0.0 release. In order to do so the master branch was reverted back to 1.3.0 release (commit: dc2a6b6) and then the 3.0.0 release branch was merged to it (up to commit: a2bec67). Below you can see all the details of the squashed commits. --------- * release 1.4.0 * fix(healthcheck) use single timer for all active checks (#62) * fix(healthcheck) use single timer for all active checks * tests(*) removed tests that are not needed * docs(*) docs for release 1.4.0 * chore(ci) use newer openresty and luarocks releases (#68) * fix(healthcheck) single worker actively checks the status (#67) * release 1.4.1 * fix(healthcheck) record `last_run` when healthcheck is scheduled (#72) Prevents a thundering herd issue whereby additional healthchecks are scheduled in the time in which it takes the healthcheck to complete. * tests(active-probes) interval is respected (#73) * fix(healthcheck) record `last_run` when healthcheck is scheduled Prevents a thundering herd issue whereby additional healthchecks are scheduled in the time in which it takes the healthcheck to complete. * tests(active-probes) interval is respected Co-authored-by: Brian Fox <brianhfox@gmail.com> * fix(healthcheck) remove event watcher when stopping hc (#74) Co-authored-by: Brian Fox <brianhfox@gmail.com> Co-authored-by: Brian Fox <brianhfox@gmail.com> * tests(*) avoid some flakiness (#75) * release 1.4.2 * chore(*) add GitHub Actions workflows (#82) * chore(*) add GitHub Actions workflows * fix(healthcheck) lint error * Simplify start of the checking timer (#85) * simplify start of the checking timer, ensuring only one worker actively sends healthchecks. one timer per worker, but before doing anything, tries to acquire an expiration lock. if fails, try again later. if the "winning" worker ever fails to renew it, some other worker would get it. * chore(rockspec) added rockspec for release 1.5.0-1 Also: - updated scm-1 rockspec - bumped openresty version in CI tests * feat(*) add header support for active checks * feat(active) support map headers * feat(healthcheck) delayed_clear function (#88) Added new function delayed_clear. This function marks all targets to be removed, but do not actually remove them. If before the delay parameter any of them is re-added, it is unmarked for removal. This function makes it possible to keep target state during config changes, where the targets might be removed and then re-added. * chore(readme) 1.5.0 release (#91) * chore(readme) 1.5.0 release * docs(*) release 1.5.0 Also added docs missing to delayed_clear() function. * fix(healthcheck) Use pair instead ipair for hcs weak table (#93) * release 1.5.1 (#95) * chore(readme) update badges (#98) * docs(readme) updated with 1.4.x changes * chore(workflows) updates for 1.6.0 release - added latest openresty to the CI matrix - added tests for when lua-resty-worker-events or lua-resty-events are used * feat(healthcheck) support setting the events module (#105) * feat(healthcheck) support setting the events module * fix(healthcheck) defaults to lua-resty-worker-events * tests(workflows) fixed manual deps install * fix(healthcheck) check empty opts * chore(workflows) use last luarocks * test(workflows) use pre-built deps, test with or 1.13-1.21 * chore(workflows) install lua-resty-events in ci * tests(workflows) debug * fixed tests and resty-events usage * init resty-events in init_worker * fix(tests) init events module (#107) * add init_worker in 03-get_target_status.t * fix 03-get_target_status.t * fix 03-get_target_status_with_sleeps.t * fix 04-report_success.t * fix 05/06 * fix 07/08 * fix 09 * change 10 * fix 11 * fix 12 * change 13 * fix 15 * partial fix 16 * change 17 * fix 18 * change 13 * fix 16 * style 05 * fix 01/02 * use string.buffer in OpenResty 1.21.4.1 (#109) * use string.buffer in OpenResty 1.21.4.1 * remove cjson require * fix(healthcheck) use the events module set in defaults * tests(with_resty-events) disabled tests that need more work * fix(healthcheck) avoid breaking when opts are nil * tests(with_resty-events) removed unnecessary test * tests(with_resty-events) increased sleeps Co-authored-by: Chrono <chrono_cpp@me.com> * release 1.6.0 (#110) * docs(readme) release 1.6.0 * fix(rockspec) typo * chore(rockspec) release 1.6.0 * docs(*) release 1.6.0 * chore(*) localize string.format (#111) * fix(healthcheck) support any lua-resty-events 0.1.x (#118) * chore(workflows) bump deps versions * chore(helathcheck) support any lua-resty-events 0.1.x * fix(healthchecker) port 2.x lock fixes to 1.5.x (#113) * fix(healthchecker) port 2.x lock fixes to 1.5.x * chore(healthcheck) remove unused vars * chore(healthcheck) fix indent level * fix(healthcheck) correct duplicate handling in add_target * fix(healthchecker) handle fetch_target_list failure in checker callback * chore(healthcheck) apply suggestions from #112 Co-authored-by: Vinicius Mignot <vinicius.mignot@gmail.com> * chore(healthcheck) increase verbosity for locked function failures (#114) * chore(healthcheck) increase verbosity for locked function failures * tests(healthcheck) add tests for run_locked() * fix(healthcheck) lower the cleanup check frequency the health-check timer also checks if targets must be removed. to safely remove targets, the targets list is locked. if this check runs on every health-check cycle and there are a large number of targets, a bazillion locks will be created. this change avoids that by lowering the frequency the cleanup list is checked. the side-effect is that targets marked for cleanup may exist for more time (2.5s) than expected, and some unexpected active checks could happen. * tests(clear) increase delay for delayed clear tests with less locks the wait for delayed clean is longer. * docs(readme) release 1.6.1 * chore(rockspecs) release 1.6.1 * release 1.6.1 * docs(readme) updated build badge * chore(ci) remove old openresty versions * feat(healthcheck) avoid duplication post in rebuild healthcheck scenario * release 1.6.2 * Added support for https_sni in healthcheck.lua (#49) * fix(mtls) use OpenResty's API for mtls (#99) * chore(ci): fix cache path (#136) ${{ env.* }} is not evaluated in `with` causing gha tries to cache `/`. * release 1.6.3 (#135) * release 3.0.0 (#142) * feat(ci/KAG-1800): add lint and sast workflows using shared actions * chore(ci): pin shared code quality actions * chore(*): backport - localize some functions A commit on master 80ee2e1 introduced localizing some functions. This commit backports that one. Backports: #92 * fix(healthcheck): fixed incorrect default http_statuses when new() was called multiple times (#83) * chore(lint): bump kong/public-shared-actions * docs(README): added 1.5.2 and 1.5.3 releases * chore(*) rename readme, add release instructions * chore(healthcheck): fix get_defaults function * fix(test): fix worker-events test * release 3.0.0 * chore(github): cancel in progress workflows when new pushed --------- Co-authored-by: saisatish karra <saisatish.karra@konghq.com> Co-authored-by: Shuoqing Ding <dsq704136@gmail.com> Co-authored-by: Vinicius Mignot <vinicius.mignot@gmail.com> Co-authored-by: Thijs Schreijer <thijs@thijsschreijer.nl> * chore(*): revert commits back to 1.3.0 This reverts the master branch backs to the commit of dc2a6b6 so that we can skip over 2.0.0 release. The 1.3.0 release is the first common commit between master branch and 1.6.x (also 3.0.x) branches. * chore(docs): fix semgrep https warnings * docs(readme): update shield badges Co-authored-by: Vinicius Mignot <vinicius.mignot@gmail.com> * chore(*): add 2.0.0 rockspecs and fix tests Release 2.0.x introduced some rockspecs with fixes. Reverting back to 1.3.0 and reapplying changes from 3.0.0 reversed those fixes. This commit reintroduces them. KAG-2704 --------- Co-authored-by: Vinicius Mignot <vinicius.mignot@gmail.com> Co-authored-by: Brian Fox <brianhfox@gmail.com> Co-authored-by: Murillo Paula <murillo@murillopaula.com> Co-authored-by: Javier <javier.guerra@konghq.com> Co-authored-by: Thijs Schreijer <thijs@thijsschreijer.nl> Co-authored-by: Mayo <i@shoujo.io> Co-authored-by: Tomasz Nowak <tomanowa@gmail.com> Co-authored-by: Chrono <chrono_cpp@me.com> Co-authored-by: Michael Martin <flrgh@protonmail.com> Co-authored-by: Jun Ouyang <ouyangjun1999@gmail.com> Co-authored-by: HansK-p <42314815+HansK-p@users.noreply.github.com> Co-authored-by: Qi <call_far@outlook.com> Co-authored-by: Wangchong Zhou <fffonion@gmail.com> Co-authored-by: Aapo Talvensaari <aapo.talvensaari@gmail.com> Co-authored-by: saisatish karra <saisatish.karra@konghq.com> Co-authored-by: Shuoqing Ding <dsq704136@gmail.com>

This was referenced Jun 1, 2021

fix(healthcheck) ensure last_run is updated when there's "nothing to do" #70

Closed

Healthchecks not respecting configured intervals Kong/kong#7409

Closed

onematchfox closed this Jun 1, 2021

onematchfox reopened this Jun 1, 2021

fix(healthcheck) record last_run when healthcheck is scheduled

72c5443

Prevents a thundering herd issue whereby additional healthchecks are scheduled in the time in which it takes the healthcheck to complete.

onematchfox force-pushed the fix/healthcheck-thundering-herd branch from 08cf81d to 72c5443 Compare June 1, 2021 12:17

ghost mentioned this pull request Jun 7, 2021

tests(active-probes) interval is respected #73

Merged

locao changed the base branch from release/1.4.1 to release/1.4.2 June 15, 2021 19:01

locao merged commit 03660ab into Kong:release/1.4.2 Jun 15, 2021

onematchfox deleted the fix/healthcheck-thundering-herd branch June 15, 2021 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(healthcheck) record `last_run` when healthcheck is scheduled rather than completed #72

fix(healthcheck) record `last_run` when healthcheck is scheduled rather than completed #72

onematchfox commented Jun 1, 2021 •

edited

Loading

onematchfox commented Jun 1, 2021

locao commented Jun 2, 2021

ghost commented Jun 7, 2021

	active_check_timer, err = resty_timer({
	recurring = true,
	interval = CHECK_INTERVAL,
	detached = false,
	expire = function()
	self:renew_periodic_lock()
	local cur_time = ngx_now()
	for _, checker_obj in ipairs(hcs) do
	if checker_obj.checks.active.healthy.active and
	(checker_obj.checks.active.healthy.last_run +
	checker_obj.checks.active.healthy.interval <= cur_time)
	then
	checker_callback(checker_obj, "healthy")
	end

	if checker_obj.checks.active.unhealthy.active and
	(checker_obj.checks.active.unhealthy.last_run +
	checker_obj.checks.active.unhealthy.interval <= cur_time)
	then
	checker_callback(checker_obj, "unhealthy")
	end
	end
	end,
	})

	local timer, err = resty_timer({
	interval = 0,
	recurring = false,
	immediate = false,
	detached = true,
	expire = function()
	self:log(DEBUG, "checking ", health_mode, " targets: #", #list_to_check)
	self:active_check_targets(list_to_check)
	self.checks.active[health_mode].last_run = ngx_now()
	end,
	})

fix(healthcheck) record last_run when healthcheck is scheduled rather than completed #72

fix(healthcheck) record last_run when healthcheck is scheduled rather than completed #72

Conversation

onematchfox commented Jun 1, 2021 • edited Loading

onematchfox commented Jun 1, 2021

locao commented Jun 2, 2021

ghost commented Jun 7, 2021

fix(healthcheck) record `last_run` when healthcheck is scheduled rather than completed #72

fix(healthcheck) record `last_run` when healthcheck is scheduled rather than completed #72

onematchfox commented Jun 1, 2021 •

edited

Loading