Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer full warning while under load #4715

Open
virtualguy opened this issue Oct 6, 2021 · 6 comments
Open

Buffer full warning while under load #4715

virtualguy opened this issue Oct 6, 2021 · 6 comments
Assignees
Labels
bug Something isn't working c/application server This is related to the Application Server

Comments

@virtualguy
Copy link
Contributor

Summary

When under significant load the application server reports dropping uplink packets as the buffer is full.

Steps to Reproduce

  1. Send lots of uplinks using the simulator OR subscribe to mqtt on a throttled connection

What do you see now?

WARN	Failed to publish message	{"device_uid": "halter.000-load49-00377", "error": "error:pkg/applicationserver/io:buffer_full (buffer is full)", "grpc.method": "HandleUplink", "grpc.service": "ttn.lorawan.v3.NsAs", "namespace": "applicationserver/distribution", "protocol": "applicationpackages", "request_id": "01FHAKXVRR3NECCP0D75DNBTHQ"}

 

What do you want to see instead?

No failed publish logs
...

Environment

TTS v3.14.2

How do you propose to implement this?

Is there any ability to tune or increase the buffers on mqtt?

...

How do you propose to test this?

Happy to run test branches on our load simulator

Can you do this yourself and submit a Pull Request?

Need guidance

@github-actions github-actions bot added the needs/triage We still need to triage this label Oct 6, 2021
@virtualguy
Copy link
Contributor Author

I wonder if the issue is the underlying MQTT server, perhaps the bigger size needs to be larger/configurable to handle higher packet volumes or high latency between AS and the subscribed client.

See: https://github.com/TheThingsIndustries/mystique/blob/24778eddf8e34b7da1abc83bc14011285cbb4a2b/pkg/session/session.go#L24

@adriansmares adriansmares self-assigned this Oct 12, 2021
@adriansmares adriansmares added bug Something isn't working c/application server This is related to the Application Server needs/triage We still need to triage this and removed needs/triage We still need to triage this labels Oct 12, 2021
@adriansmares
Copy link
Contributor

The "protocol": "applicationpackages" related buffer_full errors should have been reduced heavily during 3.15.0/3.15.1 - we've had quite a lot of related work such as #4607, #4609, #4680. You will still probably want to tune up the number of workers, and please let me know if you still see them occur (they will be called pool_full now, and you can track the pools using metrics)

On the "protocol":"mqtt" side I would just recommend against using MQTT with high traffic volumes in general. More buffering won't make the connection faster if the subscriber cannot keep up with the publisher.

@adriansmares adriansmares removed the needs/triage We still need to triage this label Oct 19, 2021
@virtualguy
Copy link
Contributor Author

virtualguy commented Oct 22, 2021

This is still occuring on 3.15.1. I have manually set the buffer size on mystque to 1024 and that resolved the majority of the buffer full messages. I'd appreciate if you have time to look at our metrics (attached) as we still get 'work dropped'. I can ship you a bunch of logs via email if needed.

We are running pretty two dedicated subscribers on fargate in the same region as the ec2 so it should be able to keep up. What would you recommend in terms of worker counts?

We did have some issues with the HTTP integration locking up on us some time ago hence moved to MQTT

image

@adriansmares
Copy link
Contributor

adriansmares commented Oct 26, 2021

I've checked our internal metrics and we don't really see _fanout drops. I'd recommend increasing --as.packages.workers, --as.webhooks.queue-size and --as.webhooks.workers. Feel free to go into thousands - the workers are spawned and kept alive only if they are needed.

Regarding webhook deadlocks - from what we know, it is this infamous beauty: golang/go#32388, which needs to be backported (golang/go#48650, golang/go#48649).

#4790 will allow you to be more aggressive with how traffic is dropped for different subscribers (such as MQTT), but I consider it to be rather abnormal that you see worker pool drops - we've never experienced work being dropped for webhooks_fanout.

Edit: Can you plot ttn_lw_workerpool_work_queue_size and ttn_lw_workerpool_workers_started - ttn_lw_workerpool_workers_stopped (worker count basically) ? Is the queue size > 0 consistently ?

@virtualguy
Copy link
Contributor Author

@adriansmares things have definitely been running better but we hit performance issues again yesterday and found that I had set the udp handlers to 32 (double the old default but significantly less than the new 1024..). That resolved the work dropped by the udp handler but there is a new spike on upstream_handlers. Looks like its fixed at 32 workers, based on the attached chart should we try bump that to 1024 too?
image

@adriansmares
Copy link
Contributor

adriansmares commented Jan 28, 2022

The 32 worker is limit is on a per gateway basis - each gateway has its own worker pool of 32 workers for submission to the upstream (which in this case is the Network Server - this is what the _cluster signifies).

Do you have an actual gateway that could see so many packets ? Is this some bridge that actually backs multiple gateways at once, but in TTS a singular one is used ? Is it always the same gateway ?

For reference, we don't see any drops whatsoever on that pool in any of our deployments - they occur when we may restart the Network Server and the peer is not available, but we don't see them at steady state in any case.

You may increase the 32 limit but it is very abnormal that one gateway can produce so much traffic - this may be a sign that the Network Server cannot keep up with the traffic and perhaps it should be scaled up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working c/application server This is related to the Application Server
Projects
None yet
Development

No branches or pull requests

2 participants