Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[azservicebus] infrequently called SendMessage is very slow #17182

Closed
ryepup opened this issue Feb 25, 2022 · 4 comments · Fixed by #17205
Closed

[azservicebus] infrequently called SendMessage is very slow #17182

ryepup opened this issue Feb 25, 2022 · 4 comments · Fixed by #17205
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Bus

Comments

@ryepup
Copy link

ryepup commented Feb 25, 2022

Bug Report

  • import path of package in question: github.com/Azure/azure-sdk-for-go/sdk/messaging/azservicebus
  • SDK version: v0.3.5
  • go version: 1.17.6

What happened?

My application pauses for 3-5s when calling Sender.SendMessage after a period of idle time.

I enabled azsb logs and saw this sequence in my logs:

  1. azsb.Retry: (SendMessage) Attempt 0 returned retryable error: link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: The link 'G5:19133466:fDtcYAzOxduFfIftHpSKdnD7KwUGBv8BwSNuZ49lxm8i8dzvEiFJGQ' is force detached. Code: publisher(link57166). Details: AmqpMessagePublisher.IdleTimerExpired: Idle timeout: 00:10:00., Info: map[]}
  2. azsb.Retry: (SendMessage) Attempt 1 sleeping for 3.319745863s
  3. azsb.Conn: Recovering link for error link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: The link 'G5:19133466:fDtcYAzOxduFfIftHpSKdnD7KwUGBv8BwSNuZ49lxm8i8dzvEiFJGQ' is force detached. Code: publisher(link57166). Details: AmqpMessagePublisher.IdleTimerExpired: Idle timeout: 00:10:00., Info: map[]}
  4. azsb.Conn: Recovered links

The sleep varies, but is consistently between 3-5s.

What did you expect or want to happen?

I want "idle timeout" errors to skip this sleep:

The sleep will help with transient network errors, but is ineffective for idle timeouts.

How can we reproduce it?

Here's a main.go to show the timings. Update the consts at the top and go run main.go

package main

import (
	"context"
	"fmt"
	"os"
	"os/signal"
	"syscall"
	"time"

	azlog "github.com/Azure/azure-sdk-for-go/sdk/azcore/log"
	"github.com/Azure/azure-sdk-for-go/sdk/azidentity"
	"github.com/Azure/azure-sdk-for-go/sdk/messaging/azservicebus"
)

const (
	// TODO: fill in your values
	namespace    = "TODO.servicebus.windows.net"
	queueOrTopic = "TODO"
)

func log(format string, a ...interface{}) {
	now := time.Now().Format("04:05.00000")
	fmt.Printf("[%s]: ", now)
	fmt.Printf(format, a...)
	fmt.Println()
}

func main() {
	azlog.SetListener(func(e azlog.Event, s string) {
		log("%v %s", e, s)
	})
	azlog.SetEvents(
		// connection/reconnect related events)
		"azsb.Conn",
		// receiver specific events
		"azsb.Receiver",
		// management link related events
		"azsb.Mgmt",
		// retry related events
		"azsb.Retry",
	)

	ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM, syscall.SIGINT)
	defer stop()
	credential, err := azidentity.NewDefaultAzureCredential(nil)
	if err != nil {
		panic(err)
	}
	client, err := azservicebus.NewClient(namespace, credential, &azservicebus.ClientOptions{})
	if err != nil {
		panic(err)
	}
	sender, err := client.NewSender(queueOrTopic, &azservicebus.NewSenderOptions{})
	if err != nil {
		panic(err)
	}
	msg := &azservicebus.Message{
		Body: []byte("hello world"),
	}
	err = sender.SendMessage(ctx, msg)
	if err != nil {
		panic(err)
	}
	log("message sent, waiting for idle timeout")
	time.Sleep(11 * time.Minute)
	log("waited, sending again")
	start := time.Now()
	err = sender.SendMessage(ctx, msg)
	if err != nil {
		panic(err)
	}
	duration := time.Now().Sub(start)
	log("took %v to send the second message", duration)
}

On my machine this produced the following outpiut:

$ go run main.go
[43:49.58115]: message sent, waiting for idle timeout
[54:49.63371]: waited, sending again
[54:49.63391]: azsb.Retry (SendMessage) Attempt 0 returned retryable error: link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: The link 'G28:5874838:AhTqtFXuW4nf6zQGix9AASCwRlhrnPxycGDiFvE8z7Bm_6GkecZSPQ' is force detached. Code: publisher(link27971). Details: AmqpMessagePublisher.IdleTimerExpired: Idle timeout: 00:10:00., Info: map[]}
[54:49.63392]: azsb.Retry (SendMessage) Attempt 1 sleeping for 4.409320575s
[54:54.04693]: azsb.Conn Recovering link for error link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: The link 'G28:5874838:AhTqtFXuW4nf6zQGix9AASCwRlhrnPxycGDiFvE8z7Bm_6GkecZSPQ' is force detached. Code: publisher(link27971). Details: AmqpMessagePublisher.IdleTimerExpired: Idle timeout: 00:10:00., Info: map[]}
[54:55.98979]: azsb.Conn Recovered links
[54:56.06519]: took 6.431364643s to send the second message

Anything we should know about your environment.

I have low traffic, so frequently have long idle periods. I call SendMessage from HTTP handlers. I have one Client and one Sender for my application, shared by all HTTP handlers.

I recently switched to azservicebus, and my request durations went from 0-1s to 5-6s, with this sleep accounting for most of that time:

image

This extra time is causing further timeouts in my API consumers. I'm pursuing two workarounds: get SendMessage out of my HTTP handlers, and sending a "heartbeat" message every 9m to keep the connection open.

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Feb 25, 2022
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Feb 28, 2022
@jhendrixMSFT jhendrixMSFT added Client This issue points to a problem in the data-plane of the library. needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. labels Feb 28, 2022
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Feb 28, 2022
@richardpark-msft
Copy link
Member

Hi @ryepup, thank you for submitting this.

I'll take a look. The trick here is mostly in the azservicebus code understanding that the error might be "old" and thus a reconnect immediately would be okay. I'll be brainstorming some ideas on this.

@ryepup
Copy link
Author

ryepup commented Mar 2, 2022

That PR looks pretty good to me.

I got some more logs and I'm seeing one other error that probably doesn't need a sleep:

(SendMessage) Attempt 0 returned retryable error: link detached, reason: *Error(nil)

I think #17205 handles this nil error too.

richardpark-msft added a commit that referenced this issue Mar 2, 2022
Fixing the 'stale' sender by immediately attempting to recover the link.
    
Note, this changes the semantics of how ResetAttempts() works in that now it resets us all the way back to the 0th attempt but the old behavior wasn't used anywhere that matters. 

I've added in a stress test to show demo the scenario and to validate that our recovery after that is fast and doesn't have a sleep in it.

Fixes #17182
@richardpark-msft
Copy link
Member

That PR looks pretty good to me.

I got some more logs and I'm seeing one other error that probably doesn't need a sleep:

(SendMessage) Attempt 0 returned retryable error: link detached, reason: *Error(nil)

I think #17205 handles this nil error too.

Yes, those should be fine too - all detach errors are treated as immediately retryable (for a single iteration), just to guard against the potential stale link.

There are still some potential scenarios where a SendMessage() can be slow - if the connection has to be recreated, or on first call (since the call itself is lazily initialized). Still, it should work better for your scenario, at the very least, and hopefully helps everyone get more even performance when there are large idle periods between calls.

Once again, thanks for reporting this, we really appreciate it. This scenario has also been added to our internal stress tests as well: https://github.com/Azure/azure-sdk-for-go/blob/main/sdk/messaging/azservicebus/internal/stress/tests/idle_fast_reconnect.go

@ryepup
Copy link
Author

ryepup commented Mar 8, 2022

Thanks! I re-ran my test from above with v0.3.6, the numbers are looking better:

$ go run .
[14:12.53974]: send took 3.218310974s
[14:12.54000]: message sent, waiting for idle timeout
[25:12.56155]: waited, sending again
[25:12.56175]: azsb.Conn Link was previously detached. Attempting quick reconnect to recover from error: link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: The link 'G7:37173493:Yzsn8jmXsb1qM_brSLavATtsg3_HY_fNAqPNh745u3H8eDuNS-JdnQ' is force detached. Code: publisher(link95224). Details: AmqpMessagePublisher.IdleTimerExpired: Idle timeout: 00:10:00., Info: map[]}
[25:12.56176]: azsb.Retry (SendMessage) Resetting attempts
[25:12.56177]: azsb.Retry (SendMessage) Attempt -1 returned retryable error: link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: The link 'G7:37173493:Yzsn8jmXsb1qM_brSLavATtsg3_HY_fNAqPNh745u3H8eDuNS-JdnQ' is force detached. Code: publisher(link95224). Details: AmqpMessagePublisher.IdleTimerExpired: Idle timeout: 00:10:00., Info: map[]}
[25:12.56178]: azsb.Conn Recovering link for error link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: The link 'G7:37173493:Yzsn8jmXsb1qM_brSLavATtsg3_HY_fNAqPNh745u3H8eDuNS-JdnQ' is force detached. Code: publisher(link95224). Details: AmqpMessagePublisher.IdleTimerExpired: Idle timeout: 00:10:00., Info: map[]}
[25:12.56327]: azsb.Conn Recovering link only
[25:14.57436]: azsb.Conn Recovered links
[25:14.66447]: send took 2.102870198s

@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Bus
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants