feat: add health check package and update SQS receive func #178

bpeng · 2024-11-04T03:14:41Z

Proposed Changes

Ticket https://github.com/GeoNet/tickets/issues/13878

Changes proposed in this pull request:

package to check health
update sqs receiveMesage to pass in a callback func to do something when the receive continues waiting on empty queue

Production Changes

The following production changes are required to deploy these changes:

None

Review

Check the box that applies to this code review. If necessary please seek help with adding a checklist guide for the reviewer.
When assigning the code review please consider the expertise needed to review the changes.

This is a content (documentation, web page etc) only change.
This is a minor change (meta data, bug fix, improve test coverage etc).
This is a larger change (new feature, significant refactoring etc). Please use the code review guidelines to add a checklist below to guide the code reviewer.

Code Review Guide

Insert check list here if needed.

CallumNZ · 2024-11-05T22:30:32Z

So from what I can tell, the health package is to be used by services that don't have an http server for their primary function, and simply adds one with an /soh endpoint - is that right?

How does the ECS health check work - does ECS call the endpoint at regular intervals?
How would normal usage of this package look like?

bpeng · 2024-11-05T23:59:40Z

The check is scheduled on AWS via terraform , the health status can be viewed on dev

CallumNZ · 2024-11-05T22:16:18Z

aws/sqs/sqs.go

+// define a func to callback when no message received from the sqs and the process continues
+// to receive for a long time, for the calling func to do sth in the middle,
+// e.g. send a heartbeat
+type CallBackOperation func()


Seems wrong to couple SQS to the health check.
Can the caller be responsible for running the health check? Like how Mark does it? https://github.com/GeoNet/tickets/issues/13878#issuecomment-1667189581

the health check is run by the caller, this is just sending a heartbeat in between the SQS receives, otherwise it will wait on the empty queue for ages (like the tilde-ingest-backfill) and the health check will fail, unless we don't do age check.

Can we not trust if the queue is being listened to, all is well?

What is the ECS health check trying to achieve? Is getting a successful response from the health http server not enough? What failure states are possible?

Yes, this is another option to do, but allow to send a heartbeat in between is also reasonable

health/service.go

CallumNZ · 2024-11-06T01:01:50Z

health/service.go

+}
+
+func (s *Service) handler(w http.ResponseWriter, r *http.Request) {
+	switch ok := s.state(); {


What's the reason for scoping ok like this? Strange to read.

Mark's code, technically correct but not easy to read

CallumNZ · 2024-11-06T01:05:19Z

health/service.go

+// Alive allows an application to perform a complex task while still sending hearbeats.
+func (s *Service) Alive(ctx context.Context, heartbeat time.Duration) context.CancelFunc {
+	ctx, cancel := context.WithCancel(ctx)
+
+	go func() {
+		defer cancel()
+
+		ticker := time.NewTicker(heartbeat)
+		defer ticker.Stop()
+
+		s.Ok()
+
+		for {
+			select {
+			case <-ticker.C:
+				s.Ok()
+			case <-ctx.Done():
+				return
+			}
+		}
+	}()
+
+	return cancel
+}
+
+// Pause allows an application to stall for a set period of time while still sending hearbeats.
+func (s *Service) Pause(ctx context.Context, deadline, heartbeat time.Duration) context.CancelFunc {
+	ctx, cancel := context.WithTimeout(ctx, deadline)
+
+	go func() {
+		defer cancel()
+
+		ticker := time.NewTicker(heartbeat)
+		defer ticker.Stop()
+
+		s.Ok()
+
+		for {
+			select {
+			case <-ticker.C:
+				s.Ok()
+			case <-ctx.Done():
+				return
+			}
+		}
+	}()
+
+	return cancel
+}


What's the use case for these? Might be unneeded complexity.
If the application has a long running process, wouldn't you just adjust aged ?

Currently not used, don't know if there is any use case by Mark, i would keep it in case we need it later

I think you could Pause() while you wait to receive a message, for example

CallumNZ · 2024-11-06T01:06:46Z

health/service.go

+	// aged is the time if no updates have happened indicates the service is no longer running.
+	// set to 0 if no age check needed
+	aged time.Duration


This value will need to be coordinated with the ECS health check frequency in terraform, is that right?

health/service.go

health/check.go

junghao · 2024-11-08T01:23:49Z

My suggestion to this:

In sqs.go:

type SQS struct {
	client   *sqs.Client
	Callback *func()
}

Then in (s *SQS) receiveMessage():

for {
		if s.Callback != nil {
			(*s.Callback)()
		}
                .....

So from the service, we do

	sqsClient, err = sqs.New()
	if err != nil {
		log.Fatal(err)
	}

	callback := func() {
		log.Println("callback")
	}

	sqsClient.Callback = &callback

In this way, it becomes optional and not change the signature of the main function in SQS, nor it creates a new function to hook the callback.

junghao · 2024-11-08T04:47:25Z

aws/sqs/sqs.go

@@ -134,7 +137,12 @@ func (s *SQS) receiveMessage(ctx context.Context, input *sqs.ReceiveMessageInput

 		switch {
 		case r == nil || len(r.Messages) == 0:
-			// no message received
+			// No message received
+			retryCount++


Retrying this doesn't seem to help.

No matter it's retried or not, the caller has to dealt with IsNoMessageError, and continue the loop with the same sqsClient.receiveMewssage() call.

eg.

m, err := sqsClient.receiveMessage() if err, ok := err.(* IsNoMessageError); ok { continue }

No matter we have retry inside the wrapper or not, the caller's code doesn't change. And retrying inside the wrapper does the same number of API requests as looping from the caller.

So I suggest that we remove this retry, and let the caller decide whether it wants to count the number of retries.

Retrying this doesn't seem to help.

Retrying just lets it to wait longer for messages to come, and with less interaction

Say, the message arrives after 99 aws.receiveMessage.

The existing implementation takes one call to kit.receive, and no need to deal with empty retires.
I would say this is the reason we want a retry inside kit.

Removing the retry completely takes 99 calls to kit.receive, and in main we need to deal with empty retries.

Adding 3 retries takes 33 calls to kit.receive, and in main we need to deal with empty retries.

Reducing the number of call to kit.receive doesn't save anything, but adding the complexity of retrying inside kit.

bpeng · 2024-11-08T21:49:50Z

Callback
In sqs.go:

type SQS struct {
client *sqs.Client
Callback *func()
}

Yes, this is a good option to solve the problem

junghao

Let's use this as a start.

…messages received from the queue

CallumNZ

Looks good 👍

bpeng changed the title ~~add health check package~~ feat: add health check package Nov 4, 2024

bpeng requested a review from junghao November 4, 2024 03:15

bpeng force-pushed the health-check branch 4 times, most recently from 8767a00 to 56ad5fc Compare November 5, 2024 21:46

bpeng changed the title ~~feat: add health check package~~ feat: add health check package and update SQS receive func Nov 5, 2024

bpeng force-pushed the health-check branch from 56ad5fc to d61206e Compare November 5, 2024 22:19

bpeng requested a review from CallumNZ November 6, 2024 00:15

CallumNZ requested changes Nov 6, 2024

View reviewed changes

bpeng force-pushed the health-check branch from a3217d4 to ab20313 Compare November 6, 2024 21:24

bpeng force-pushed the health-check branch from ab20313 to d45900a Compare November 8, 2024 01:47

junghao reviewed Nov 8, 2024

View reviewed changes

bpeng force-pushed the health-check branch 6 times, most recently from d19647c to 3263b4f Compare November 12, 2024 20:40

junghao approved these changes Nov 12, 2024

View reviewed changes

bpeng added 4 commits November 13, 2024 11:57

add health check package for use in different projects

4688778

health check: add tests

b89eed0

changes per review: simplify code

c609cb0

feat!: update SQS receiveMessage to return no messages error when no …

8cd43a0

…messages received from the queue

bpeng force-pushed the health-check branch from 3263b4f to 8cd43a0 Compare November 12, 2024 22:58

CallumNZ approved these changes Nov 12, 2024

View reviewed changes

sue-h-gns merged commit 9ad1489 into main Nov 13, 2024
7 checks passed

sue-h-gns deleted the health-check branch November 13, 2024 02:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add health check package and update SQS receive func #178

feat: add health check package and update SQS receive func #178

bpeng commented Nov 4, 2024 •

edited

Loading

CallumNZ commented Nov 5, 2024

bpeng commented Nov 5, 2024 •

edited

Loading

CallumNZ Nov 5, 2024

bpeng Nov 6, 2024 •

edited

Loading

CallumNZ Nov 6, 2024

bpeng Nov 6, 2024

CallumNZ Nov 6, 2024

bpeng Nov 6, 2024

CallumNZ Nov 6, 2024

bpeng Nov 6, 2024

wilsonjord Nov 11, 2024

CallumNZ Nov 6, 2024

bpeng Nov 6, 2024

junghao commented Nov 8, 2024

junghao Nov 8, 2024 •

edited

Loading

bpeng Nov 8, 2024

junghao Nov 10, 2024

bpeng commented Nov 8, 2024 •

edited

Loading

junghao left a comment

CallumNZ left a comment

feat: add health check package and update SQS receive func #178

feat: add health check package and update SQS receive func #178

Conversation

bpeng commented Nov 4, 2024 • edited Loading

Proposed Changes

Production Changes

Review

Code Review Guide

CallumNZ commented Nov 5, 2024

bpeng commented Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

bpeng Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junghao commented Nov 8, 2024

junghao Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bpeng commented Nov 8, 2024 • edited Loading

junghao left a comment

Choose a reason for hiding this comment

CallumNZ left a comment

Choose a reason for hiding this comment

bpeng commented Nov 4, 2024 •

edited

Loading

bpeng commented Nov 5, 2024 •

edited

Loading

bpeng Nov 6, 2024 •

edited

Loading

junghao Nov 8, 2024 •

edited

Loading

bpeng commented Nov 8, 2024 •

edited

Loading