Buffer full warning while under load #4715

virtualguy · 2021-10-06T10:58:16Z

Summary

When under significant load the application server reports dropping uplink packets as the buffer is full.

Steps to Reproduce

Send lots of uplinks using the simulator OR subscribe to mqtt on a throttled connection

What do you see now?

WARN	Failed to publish message	{"device_uid": "halter.000-load49-00377", "error": "error:pkg/applicationserver/io:buffer_full (buffer is full)", "grpc.method": "HandleUplink", "grpc.service": "ttn.lorawan.v3.NsAs", "namespace": "applicationserver/distribution", "protocol": "applicationpackages", "request_id": "01FHAKXVRR3NECCP0D75DNBTHQ"}

What do you want to see instead?

No failed publish logs
...

Environment

TTS v3.14.2

How do you propose to implement this?

Is there any ability to tune or increase the buffers on mqtt?

...

How do you propose to test this?

Happy to run test branches on our load simulator

Can you do this yourself and submit a Pull Request?

Need guidance

The text was updated successfully, but these errors were encountered:

virtualguy · 2021-10-11T06:30:32Z

I wonder if the issue is the underlying MQTT server, perhaps the bigger size needs to be larger/configurable to handle higher packet volumes or high latency between AS and the subscribed client.

See: https://github.com/TheThingsIndustries/mystique/blob/24778eddf8e34b7da1abc83bc14011285cbb4a2b/pkg/session/session.go#L24

adriansmares · 2021-10-12T10:56:07Z

The "protocol": "applicationpackages" related buffer_full errors should have been reduced heavily during 3.15.0/3.15.1 - we've had quite a lot of related work such as #4607, #4609, #4680. You will still probably want to tune up the number of workers, and please let me know if you still see them occur (they will be called pool_full now, and you can track the pools using metrics)

On the "protocol":"mqtt" side I would just recommend against using MQTT with high traffic volumes in general. More buffering won't make the connection faster if the subscriber cannot keep up with the publisher.

virtualguy · 2021-10-22T04:24:28Z

This is still occuring on 3.15.1. I have manually set the buffer size on mystque to 1024 and that resolved the majority of the buffer full messages. I'd appreciate if you have time to look at our metrics (attached) as we still get 'work dropped'. I can ship you a bunch of logs via email if needed.

We are running pretty two dedicated subscribers on fargate in the same region as the ec2 so it should be able to keep up. What would you recommend in terms of worker counts?

We did have some issues with the HTTP integration locking up on us some time ago hence moved to MQTT

adriansmares · 2021-10-26T18:44:18Z

I've checked our internal metrics and we don't really see _fanout drops. I'd recommend increasing --as.packages.workers, --as.webhooks.queue-size and --as.webhooks.workers. Feel free to go into thousands - the workers are spawned and kept alive only if they are needed.

Regarding webhook deadlocks - from what we know, it is this infamous beauty: golang/go#32388, which needs to be backported (golang/go#48650, golang/go#48649).

#4790 will allow you to be more aggressive with how traffic is dropped for different subscribers (such as MQTT), but I consider it to be rather abnormal that you see worker pool drops - we've never experienced work being dropped for webhooks_fanout.

Edit: Can you plot ttn_lw_workerpool_work_queue_size and ttn_lw_workerpool_workers_started - ttn_lw_workerpool_workers_stopped (worker count basically) ? Is the queue size > 0 consistently ?

virtualguy · 2022-01-28T10:33:22Z

@adriansmares things have definitely been running better but we hit performance issues again yesterday and found that I had set the udp handlers to 32 (double the old default but significantly less than the new 1024..). That resolved the work dropped by the udp handler but there is a new spike on upstream_handlers. Looks like its fixed at 32 workers, based on the attached chart should we try bump that to 1024 too?

adriansmares · 2022-01-28T11:00:03Z

The 32 worker is limit is on a per gateway basis - each gateway has its own worker pool of 32 workers for submission to the upstream (which in this case is the Network Server - this is what the _cluster signifies).

Do you have an actual gateway that could see so many packets ? Is this some bridge that actually backs multiple gateways at once, but in TTS a singular one is used ? Is it always the same gateway ?

For reference, we don't see any drops whatsoever on that pool in any of our deployments - they occur when we may restart the Network Server and the peer is not available, but we don't see them at steady state in any case.

You may increase the 32 limit but it is very abnormal that one gateway can produce so much traffic - this may be a sign that the Network Server cannot keep up with the traffic and perhaps it should be scaled up.

github-actions bot added the needs/triage We still need to triage this label Oct 6, 2021

adriansmares self-assigned this Oct 12, 2021

adriansmares added bug Something isn't working c/application server This is related to the Application Server needs/triage We still need to triage this and removed needs/triage We still need to triage this labels Oct 12, 2021

adriansmares removed the needs/triage We still need to triage this label Oct 19, 2021

adriansmares mentioned this issue Oct 26, 2021

Improve Application Server distribution observability #4790

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer full warning while under load #4715

Buffer full warning while under load #4715

virtualguy commented Oct 6, 2021

virtualguy commented Oct 11, 2021

adriansmares commented Oct 12, 2021

virtualguy commented Oct 22, 2021 •

edited

Loading

adriansmares commented Oct 26, 2021 •

edited

Loading

virtualguy commented Jan 28, 2022

adriansmares commented Jan 28, 2022 •

edited

Loading

Buffer full warning while under load #4715

Buffer full warning while under load #4715

Comments

virtualguy commented Oct 6, 2021

Summary

Steps to Reproduce

What do you see now?

What do you want to see instead?

Environment

How do you propose to implement this?

How do you propose to test this?

Can you do this yourself and submit a Pull Request?

virtualguy commented Oct 11, 2021

adriansmares commented Oct 12, 2021

virtualguy commented Oct 22, 2021 • edited Loading

adriansmares commented Oct 26, 2021 • edited Loading

virtualguy commented Jan 28, 2022

adriansmares commented Jan 28, 2022 • edited Loading

virtualguy commented Oct 22, 2021 •

edited

Loading

adriansmares commented Oct 26, 2021 •

edited

Loading

adriansmares commented Jan 28, 2022 •

edited

Loading