Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fq_codel network packet scheduler algorithm by default #2203

Merged
merged 1 commit into from
Oct 26, 2022

Conversation

agners
Copy link
Member

@agners agners commented Oct 26, 2022

The fq_codel network scheduler is the de-facto standard nowadays in most distros. Systemd enables the scheduler by default if available. Make sure all boards have the necessary kernel module activated.

The fq_codel network scheduler is the de-facto standard nowadays in most
distros. Systemd enables the scheduler by default if available. Make
sure all boards have the necessary kernel module activated.
@unclehack
Copy link

These changes have worked for my build of the ova target. The PR looks good.

@agners agners merged commit 97e9a03 into home-assistant:dev Oct 26, 2022
jens-maus added a commit to jens-maus/RaspberryMatic that referenced this pull request Oct 28, 2022
standard in modern linux distros these days and should have better
schedulding capabilities as the previous pfifo_fast scheduler.
(cf. home-assistant/operating-system#2203)
@agners agners deleted the add-fq-codel-by-default branch November 30, 2022 09:38
agners added a commit that referenced this pull request Nov 30, 2022
The fq_codel network scheduler is the de-facto standard nowadays in most
distros. Systemd enables the scheduler by default if available. Make
sure all boards have the necessary kernel module activated.
@RubenKelevra
Copy link
Contributor

It kinda makes sense to use sch_cake instead, as it is not limited to making individual connections fair. Sch_cake can instead be instructed to group first the destination hosts, allowing fair bandwidth between them and then group the connections of each host and make those fair between each other.

This way hosts which open multiple connections and doing a transfer are not given more bandwidth than hosts opening just a single connection.

@agners
Copy link
Member Author

agners commented Dec 15, 2022

Not against chainging scheduler algorithm, but I'd prefer if we pick one which has at least some major distribution is using by default, to have some real world testing... 🤷‍♂️

It seems that even OpenWrt still defaults to fq_codel?

@unclehack
Copy link

fq_codel is enough for HA OS. sch_cake is likely to be far too heavy for smaller and slower systems.

@RubenKelevra
Copy link
Contributor

RubenKelevra commented Dec 15, 2022

@agners wrote:

It seems that even OpenWrt still defaults to fq_codel?

OpenWRT defaults follow a minimal-flash-usage approach, so no GUI for shaping by default. fq_codel is a good choice here, as mostly there's no queuing on Linux's network-device level going on here, in the supported hardware:

  • LAN-connections are usually handled by the built-in hardware switch – no fair queuing possible on them.
  • Wifi drivers implement their own queuing approaches to combine a lot of small Ethernet pks to large wifi transmissions. Some do already fair queuing based on airtime usage instead (see e.g. Toke's patch for the Ath9k to implement that). But, if queuing on the Linux device occurs sch_cake could still be useful here.
  • Internet connections are usually done by ethernet to an external modem. So the Ethernet device Linux sees won't run their queue full. Instead the external modem will buffer and drop as well as the provider's side modem. Here's software shaping necessary by user intervention.

OpenWRT's GUI module to help users configure shaping sets up sch_cake (see the docs).

So in situations where fair queuing makes sense, OpenWRT uses sch_cake by default.

@agners wrote:

Not against chainging scheduler algorithm, but I'd prefer if we pick one which has at least some major distribution is using by default, to have some real world testing... man_shrugging

Distributions will probably never default to sch_cake, as it does 'split-gso' by default. This setting disallows the Ethernet driver to create jumbo packages given to the hardware and instead gives individual Ethernet packages to the hardware. This works fine up to 1GE hardware, but has sometimes troubles 10GE hardware. Manpage for sch_cake states this too: 'At link speeds higher than 10 Gbps, setting the no-split-gso parameter can increase the maximum achievable throughput by retaining the full GSO packets.'

But without deactivating hardware offloading fair queuing can't work as expected – that's why sch_cake turns it off by default. fq_codel on the other hand has no such option and does not touch it – so without manual user intervention to turn hardware offloading off the result it suboptional.

@RubenKelevra
Copy link
Contributor

@unclehack wrote:

fq_codel is enough for HA OS. sch_cake is likely to be far too heavy for smaller and slower systems.

I'm using sch_cake for shaping my internet connection on my TP-Link C2600 which got an Qualcomm Atheros IPQ8064@1.4 GHz.

It runs DHCP, DNS and sch_cake. Load is 0.06, 0.09, 0.09.

sch_cake is really lightweight, there's hardly any difference to fq_codel and IME it's lighter than htb+fq_codel if you need shaping below the link-bandwidth.

@RubenKelevra
Copy link
Contributor

If you choose to use sch_cake it makes sense to modify the default settings:

Ethernet

I would recommend this settings: 'regional diffserv8 ethernet nat'.

  • regional due to having probably WLAN clients on the local network and clients on the internet – but the default setting 'internet' is likely too high.
  • diffserv8 to allow for more precise managing of QoS levels than the default.
  • ethernet for the normal Ethernet overhead/mpu.
  • nat because you're doing NAT between the add-ons and the network card.

Wifi

On Wifi cards using sch_cake makes even more sense, as the bandwidth to the AP may be the limiting factor on connections. Settings from Ethernet apply here as well.

Virtual interfaces

The virtual interfaces going to the add-ons do not need any kind of special handling, as the bandwidth is much higher than the hardware links on connections.

Btw: Diffserv/QoS

It makes sense to set a higher tag on traffic by the HTTP server serving the HA GUI, to get priority by sch_cake and other hardware on the way of the connection – like the user's router. CS4 would make sense here I guess.

@dtaht
Copy link

dtaht commented Dec 16, 2022

thx for summoning me!

A lot of the advice we've given out about cake is old - we developed it starting in 2014... on a 600Mhz single core mips processor which could barely crack 100Mbits, shaped, and these days people are regularly pumping 20Gbit or more through 10k instances of it (see libreqos.io for shining example). The luci-app-sqm tool (and the linux-compatible sqm-scripts) are now very overcomplicated if you just want to run cake, but it's established, and thus that's what we use. I wish we'd expose the "dangerous options" now in openwrt at least - as they aren't dangerous anymore, and make the default always be nat, since that's the most common problem we see in the field.

fq_codel and cake are very lightweight when used at line rate, with BQL-enabled ethernet, which is nearly all ethernet cards today. I run cake on everything. Yes, the default gso-splitting mechanism becomes a throughput limitation at 10Gbit but even with 2.5Gbit it's keeping up on things like the R6S, and I care more about low latency, all the time, than anything else. 42 packet GSO burps from IW10 enabled TCPs (nearly all of them) invoke a lot of jitter.

One of my sadnesses is we've never made cake multicore, using it with sch_mq (which is increasingly the top level default), means you get X instances of it, when just one cakemq (especially while shaping) would be more effective.

I have been trying to get folk to standardize on diffserv4, as that treats the diffserv bits as compatibly as all the semi-conflicting diffserv standards treat them. Notably it's the closest thing we have to have wifi maps things, as well as zoom's webrtc recommendations. I wish we'd not made the diffserv8 setting available at all, as the mechanisms have been depreciated since 2003 (tho still common). diffserv3 is still an ok default. Please don't use diffserv8.

No, you shouldn't use cake on the wifi unless it's bloated, and possibly not even then. Multiple wifi chips today already have a native implementation of fq_codel in them, (look for an aqm file in /sys/kernel/debug/iee*/phy*/aqm ) and the principal intent of the codel algorithm was to dynamically adjust the buffering to the bandwidth, which varies a lot based on the distance from the AP. See the CDF plot here: https://blog.cerowrt.org/post/real_results/ or the paper here: https://www.cs.kau.se/tohojo/airtime-fairness/ - I urge everyone to think that "shaping wifi" is an answer to read that...

(It's very frustrating that folk want to shape wifi, which, with movement of a millimeter or two, can have a 10x1 different bandwidth/buffering issue). someday more of cake will move into the wifi drivers...

I wish we'd left nat on as the default in cake. Ideally it should "just figure it out", but leaving it on eliminates a major mistake anyone with nat in that it makes the per host/per flow fq "just work" for both ipv4 and ipv6. The overhead of having the nat option "on" on a non-natted machine is below 2%.

As for the regional setting vs the internet setting... I don't know. I think we should have had a "continental" setting, closer to about 70ms. I see folk using regional especially when shaping in front of vpns.

So in short, my default would be

  1. sysctl -w net.core.default_qdisc=cake

Which will apply it on all ethernet interfaces, mq or not, at line rate, and you won't notice it's there on most hardware.

Cake works exactly the same shaped, or unshaped.

  1. somehow apply the nat option if natting on that interface, as well as diffserv4 throughout.

  2. And I'd make some tool available, be it luci-app-sqm, to shape traffic on an interface where needed.

I actually wouldn't mark your webserver traffic at all except for things that really needed low latency (voip, gaming), or should run in the background (the LE codepoint). See also, qosify.

I wish we'd made this less complicated over time, but I hope this helps.

@dtaht
Copy link

dtaht commented Dec 16, 2022

But without deactivating hardware offloading fair queuing can't work as expected – that's why sch_cake turns it off by default. fq_codel on the other hand has no such option and does not touch it – so without manual user intervention to turn hardware offloading off the result it suboptional.

To clarify: cake or fq_codel at line rate do NOT turn off hardware offloading, so it is safe to use those at line rate, even though it tends to not be as effective, as I just tried to unpack, above.

GSO-splitting is a response to misguided (I'm opininated) attempts by ethernet device makers to make single threaded iperf benchmarks look better, by bulking up packets into a GRO superpacket... and has nothing to do with other hardware offloads, so it's safe to have on all the time. In addition to doing better FQ, it also makes the "codel" portion of the aqm algorithm work better, as it was designed to work against single packets not (up to) 42... I often wish we'd put gso-splitting into fq_codel also (I actually have a version that does that, but never submitted it)! Watching GSO "burp" the latency, especially below 300Mbits, is no fun...

Anyway, another big motivation for GSO/GRO/TSO has faded, in that it was also developed to compensate for routing table lookups in linux being so slow prior to linux 4.2 or so, a single GRO packet only needs one routing table lookup. By the time it hits cake, that routing table lookup is already done...

Also, in the real world, on real traffic, GSO superpackets are rarely seen, and honestly if we could just rip out all that extra code doing that work, the devices would get faster in the first place... (not the case for TSO, where the ethernet card does the work), at least in the sub 2.5Gbit markets.

the sqm-scripts and luci-app-sqm attempt to turn hardware offloads off (and don't always succeed) when creating an instance of cake, shaped.

@RubenKelevra
Copy link
Contributor

RubenKelevra commented Dec 16, 2022

Hey @dtaht,

thanks for your recommendations!

I actually wouldn't mark your webserver traffic at all except for things that really needed low latency (voip, gaming), or should run in the background (the LE codepoint). See also, qosify.

Well, it kinda is:

Webserver here ships the web interface once and then pushes updates in real time to it. The latency here is important to make it feel snappy. It should also get priority bandwidth wise over other things which may run in the background.

As an example, things running here on my setup which do use upload bandwidth:

  • VPN-server sending stuff to clients out
  • ftp server sending large files
  • Backups of HA itself send out to Google Drive

@dtaht
Copy link

dtaht commented Dec 16, 2022

In general I prefer deprioritizing to prioritizing. I'd deprioritize the latter two using the LE codepoint. As for the vpn server, it depends on the vpn. Kernel wireguard and ipsec "do the right things" with cake or fq_codel in the loop. userspace vpn does not (which is why I keep seeing folk put yet another cake shaped instance in front of it, often with lower default rtt settings)

@dtaht
Copy link

dtaht commented Dec 16, 2022

There are other tunings for "snappy". Notably TCP_NOTSENT_LOWAT, and depending on the structure of the application, tcp_bbr - only useful for longrunning +10sec flows.

@jens-maus
Copy link
Contributor

Isn't this discussion slowly getting too academic? IMHO using fq_codel is perfectly fine and also perfectly in line with other linux distros as @agners pointed out. And just hunting for the last few percentages for improvement might be just a bit too much, especially considering the target OS/application and this is HomeAssistant and not a high throughput or low latency requiring application which might justify such long discussions on that topic.

@unclehack
Copy link

fq_codel has been used in many environments, including IoT. Packets of the same flow can experience reordering caused by the 8 way associative hashing when using cake. This isn't something I'd want in an IoT oriented system. Many devices have really low amounts of RAM and very simple network stacks.

There other improvements which can be made to an IoT oriented Linux distribution such as HA OS. These include stability improvements, the implementation of relevant features, disk IO performance improvements, file system optimizations, memory usage improvements, reducing SSD/microSD wear and many others.

fq_codel has already improved HA OS' network latency. Users who need extremely low latency for everything are probably not using the right OS. They probably want to set up the HA OS container image on their Linux distribution with the deeply customized networking configuration they desire. HA OS is meant to be used as it is by most users.

@dtaht
Copy link

dtaht commented Dec 16, 2022

Not true: Packets of the same flow can experience reordering caused by the 8 way associative hashing when using cake. Doesn't happen with fq_codel either.

I'm cool with y'all sticking with fq_codel, btw. Principal benefit to cake was in deprioritizing the flows I mentioned.

@unclehack
Copy link

Not true: Packets of the same flow can experience reordering caused by the 8 way associative hashing when using cake. Doesn't happen with fq_codel either.

I'm cool with y'all sticking with fq_codel, btw. Principal benefit to cake was in deprioritizing the flows I mentioned.

sch_cake attempts to reduce hash collisions by using the 8 way associative hashing. This is good in general. There's a particular scenario which can cause reordering. The packets of a flow can hash to another Cake flow if a collision has led to an overwrite of the tag on the previous cake flow. Doesn't this lead to the reordering of packets, depending on the order in which the two Cake queues with packets of the same flow are serviced? Packets are still stored in the same order they arrived in the two Cake queues.

@dtaht
Copy link

dtaht commented Dec 16, 2022

It is incredibly hard to create that scenario. we actually check for it to some extent with the way_cols statistic. Flows collide rarely enough in the first place (at 10Gbit, I think you can have 400 full size packets outstanding, typically, and there are 1024 queues in the besteffort queue scheme). Checking the biggest libreqos.io installation we have (10k subs, a week's wroth of data), only 63 had any way_cols at at all (and that's the first thing that has to happen before oo could possibly happen), the biggest one had only .1% way_cols relative to packets.

You are right in that fq_codel won't ever have this behavior, but the odds of it happening even once to deliver out of order packets, in cake, are astronomical, and rapidly compensated for well within a few packet deliveries. Someone could write a pretty good paper on making this pathology happen, I think, (stressing a ton of small packets, perhaps simulating 1k+ voip calls), or, say, 40Gbit of bandwidth flowing through a single instance of cake. We are aware that both fq_codel and cake seem to need more queues at > 10Gbit, and that's usually the case, 64 hardware queues are common, and 64 instances (which is way, way, too much, IMHO)

I kind of judge the "birthday problem" cake solves vs a vs fq_codel more important than the possible re-ordering problem.

@dtaht
Copy link

dtaht commented Dec 16, 2022

The expectation was that with cakemq (which remains unwritten) at the cpe and isp head ends, the per-host/per flow fq of cake would take off, and the diffserv treatments for videoconferencing also. As for this application, don't know. I am happy y'all adopted fq_codel at least. More work on the underlying transports might help, I already mentioned TCP_NOTSENT_LOWAT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants