Skip to content

Conversation

pvts-mat
Copy link
Contributor

@pvts-mat pvts-mat commented Sep 8, 2025

[LTS 9.4]
CVE-2024-26669
VULN-8198

Problem

https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=32f2a0afa95fae0d1ceec2ff06e0e816939964b8

net/sched: flower: Fix chain template offload

When a qdisc is deleted from a net device the stack instructs the
underlying driver to remove its flow offload callback from the
associated filter block using the 'FLOW_BLOCK_UNBIND' command. The stack
then continues to replay the removal of the filters in the block for
this driver by iterating over the chains in the block and invoking the
'reoffload' operation of the classifier being used. In turn, the
classifier in its 'reoffload' operation prepares and emits a
'FLOW_CLS_DESTROY' command for each filter.

However, the stack does not do the same for chain templates and the
underlying driver never receives a 'FLOW_CLS_TMPLT_DESTROY' command when
a qdisc is deleted. This results in a memory leak [1] which can be
reproduced using [2].

Affected: yes

The bug-affected (or rather fix-modified) files are:

  • net/sched/cls_flower.c
  • net/sched/cls_api.c
  • include/net/sch_generic.h

The "flower" traffic class implemented by net/sched/cls_flower.c is inabled with CONFIG_NET_CLS_FLOWER option, set to m in all LTS 9.4 configs:

$ grep 'CONFIG_NET_CLS_FLOWER\b' configs/*.config

configs/kernel-aarch64-64k-debug-rhel.config:CONFIG_NET_CLS_FLOWER=m
configs/kernel-aarch64-64k-rhel.config:CONFIG_NET_CLS_FLOWER=m
configs/kernel-aarch64-debug-rhel.config:CONFIG_NET_CLS_FLOWER=m
configs/kernel-aarch64-rhel.config:CONFIG_NET_CLS_FLOWER=m
configs/kernel-aarch64-rt-debug-rhel.config:CONFIG_NET_CLS_FLOWER=m
configs/kernel-aarch64-rt-rhel.config:CONFIG_NET_CLS_FLOWER=m
configs/kernel-ppc64le-debug-rhel.config:CONFIG_NET_CLS_FLOWER=m
configs/kernel-ppc64le-rhel.config:CONFIG_NET_CLS_FLOWER=m
configs/kernel-s390x-debug-rhel.config:CONFIG_NET_CLS_FLOWER=m
configs/kernel-s390x-rhel.config:CONFIG_NET_CLS_FLOWER=m
configs/kernel-s390x-zfcpdump-rhel.config:CONFIG_NET_CLS_FLOWER=m
configs/kernel-x86_64-debug-rhel.config:CONFIG_NET_CLS_FLOWER=m
configs/kernel-x86_64-rhel.config:CONFIG_NET_CLS_FLOWER=m
configs/kernel-x86_64-rt-debug-rhel.config:CONFIG_NET_CLS_FLOWER=m
configs/kernel-x86_64-rt-rhel.config:CONFIG_NET_CLS_FLOWER=m

Naturally, the CONFIG_NET_CLS option guarding the net/sched/cls_api.c file is enabled as well.

The mainline fix 32f2a0a identifies bbf7383 as the culprit - it's present in ciqlts9_4 history natively.

Solution

The mainline fix cherry-picks cleanly but breaks the kABI. From the check-kabi's message:

*** ERROR - ABI BREAKAGE WAS DETECTED ***

The following symbols have been changed (this will cause an ABI breakage):

flow_block_cb_alloc
flow_block_cb_free
flow_block_cb_lookup
flow_block_cb_setup_simple
flow_indr_block_cb_alloc
flow_indr_dev_register
flow_indr_dev_unregister
qdisc_reset

The breakage is caused by the introduction of tmplt_reoffload field to the tcf_proto_ops struct:
32f2a0a#diff-acd1ebb1376db4fecebf48e5007639bb90d4ae3e36fe459c2d8bb282fcd218f2R378-R381

The field was added nonetheless, although moved to the end of the struct and wrapped in the RH_KABI_EXTEND macro, which is the sole diff from the mainline fix. The rest of this section argues that it was safe to do so.

Modified-whitelisted symbols relation

Using the same method as in #475 the relation between the affected whitelisted symbols and the modified one can be established: kabi-break-chain.log. It can be summarized in a hierarchical list as:

  • [struct] tcf_proto_ops[struct] tcf_chain[struct] tcf_block[struct] Qdisc_class_ops[struct] Qdisc_ops[struct] Qdisc
    • [func] qdisc_reset
    • [func] flow_indr_block_cb_alloc
    • [typedef] flow_indr_block_bind_cb_t
      • [func] flow_indr_dev_unregister
      • [func] flow_indr_dev_register
    • [struct] flow_block_offload[func] flow_block_cb_setup_simple
    • [struct] flow_block_indr[struct] flow_block_cb
      • [func] flow_block_cb_lookup
      • [func] flow_block_cb_free
      • [func] flow_block_cb_alloc

How to read this summary:

  • Notation "XY" means "X is used in the definition of Y".

  • The same meaning has the child-parent relation, that is

    • X
      • Y

    is the same as "XY".

  • At the root of the hierarchy is the modified symbol tcf_proto_ops.

  • The leaves are the whitelisted symbols.

For example, the tcf_proto_ops influences the definition of flow_block_cb_lookup by the chain of [struct] tcf_proto_ops[struct] tcf_chain[struct] tcf_block[struct] Qdisc_class_ops[struct] Qdisc_ops[struct] Qdisc[struct] flow_block_indr[struct] flow_block_cb[func] flow_block_cb_lookup.

Here's an initial chain between [struct] tcf_proto_ops and [struct] Qdisc common for all whitelisted symbols:

[struct] tcf_proto_ops:

struct tcf_proto_ops {

[struct] tcf_chain:

const struct tcf_proto_ops *tmplt_ops;

[struct] tcf_block:

struct tcf_chain *chain;

[struct] Qdisc_class_ops:

struct tcf_block * (*tcf_block)(struct Qdisc *sch,
unsigned long arg,
struct netlink_ext_ack *extack);

[struct] Qdisc_ops:

const struct Qdisc_class_ops *cl_ops;

[struct] Qdisc:

const struct Qdisc_ops *ops;

Usage of the tcf_proto_ops struct

Although analyzing the symbol's usage within the patched kernel itself doesn't make sense kABI-preservation-wise (kABI will always be preserved by the virtue of a single cohesive build process) it paints the picture of how this symbol can be expected to be used in some downstream custom kernel branch.

Below is the complete list of tcf_proto_ops usages in ciqlts9_4, based on gtags, divided by the category of use

Observations and conclusion

  • Whenever tcf_proto_ops is passed as a function argument it's always through a pointer. This ensures preservation of stack addresses.

  • Whenever tcf_proto_ops is used as a local variable it's always by a pointer. This ensures preservation of stack addresses.

  • Whenever tcf_proto_ops is used as a struct field it's always through a pointer. This allows the containing struct to be used without any kABI-related restraints.

  • Whenever tcf_proto_ops is returned from a function it's always through a pointer. In fact this use case showcases how a tcf_proto_ops object can be obtained in general - through the use of tcf_proto_lookup_ops(…) function exclusively. This defines tcf_proto_ops "creation" API excluding automatic allocation on the stack. However, it does internally depend on static allocation, which is addressed in the last point.

  • The tcf_proto_ops struct is used as a static variable in each and every one of the traffic classes modules. This usage defines all allocations of tcf_proto_ops struct. Theoretically it's possible that a user defined his own custom network traffic class myprecious in a way analogous to those defined in the mainline kernel u32, route, etc., like

    static struct tcf_proto_ops cls_myprecious_ops __read_mostly = {
    

    This binary module could then be incompatible with the patched kernel. However, this scenario is highly unlikely, and if tcf_proto_ops was ever used in such way then the user would request to put the tcf_proto_ops symbol on the whitelist directly, yet it's missing.

Considering that the tcf_proto_ops struct is allocated statically, but in a highly controlled and limited fashion, closely related to the definitions of kernel-native traffic classes and unlikely to ever be replicated by an out-of-tree code, and that it's deeply buried, exclusively through pointers, in the substructure of the Qdisc struct that the users actually come to contact with in the whitelisted functions, it was determined that modifying it won't cause binary compatibility problems.

kABI check: passed

$ DESCR_TARGET=1 DEBUG=1 RELAXED_DEPS=1 CVE=CVE-2024-26669 ./ninja.sh -d explain _kabi_checked__x86_64--test--ciqlts9_4-CVE-2024-26669

ninja explain: output state/kernels/ciqlts9_4-CVE-2024-26669/x86_64/kabi_checked doesn't exist
ninja explain: state/kernels/ciqlts9_4-CVE-2024-26669/x86_64/kabi_checked is dirty
[0/1] 	Check ABI of kernel [ciqlts9_4-CVE-2024-26669]	_kabi_checked__x86_64--test--ciqlts9_4-CVE-2024-26669
++ uname -m
+ python3 /data/src/ctrliq-github/kernel-dist-git-el-9.4/SOURCES/check-kabi -k /data/src/ctrliq-github/kernel-dist-git-el-9.4/SOURCES/Module.kabi_x86_64 -s vms/x86_64--build--ciqlts9_4/build_files/kernel-src-tree-ciqlts9_4-CVE-2024-26669/Module.symvers
kABI check passed
+ touch state/kernels/ciqlts9_4-CVE-2024-26669/x86_64/kabi_checked

Boot test: passed

boot-test.log

Kselftests: passed relative

Coverage

Only the net-specific tests were run (collections net, net/forwarding, net/mptcp, netfilter)

Reference

kselftests–ciqlts9_4–run1.log

Patch

kselftests–ciqlts9_4-CVE-2024-26669–run1.log
kselftests–ciqlts9_4-CVE-2024-26669–run2.log

Comparison

The tests results for the reference and patched kernel are the same.

$ ktests.xsh diff  kselftests*.log --where 'tests.TestCase LIKE "net%"'

Column    File
--------  ----------------------------------------------
Status0   kselftests--ciqlts9_4--run1.log
Status1   kselftests--ciqlts9_4-CVE-2024-26669--run1.log
Status2   kselftests--ciqlts9_4-CVE-2024-26669--run2.log

TestCase                                          Status0  Status1  Status2  Summary
net/forwarding:bridge_locked_port.sh              pass     pass     pass     same
net/forwarding:bridge_mdb.sh                      skip     skip     skip     same
net/forwarding:bridge_mdb_host.sh                 pass     pass     pass     same
net/forwarding:bridge_mdb_max.sh                  skip     skip     skip     same
net/forwarding:bridge_mdb_port_down.sh            pass     pass     pass     same
net/forwarding:bridge_mld.sh                      pass     pass     pass     same
net/forwarding:bridge_port_isolation.sh           pass     pass     pass     same
net/forwarding:bridge_sticky_fdb.sh               pass     pass     pass     same
net/forwarding:bridge_vlan_aware.sh               pass     pass     pass     same
net/forwarding:bridge_vlan_mcast.sh               pass     pass     pass     same
net/forwarding:bridge_vlan_unaware.sh             pass     pass     pass     same
net/forwarding:custom_multipath_hash.sh           fail     fail     fail     same
net/forwarding:ethtool.sh                         skip     skip     skip     same
net/forwarding:ethtool_extended_state.sh          skip     skip     skip     same
net/forwarding:gre_custom_multipath_hash.sh       fail     fail     fail     same
net/forwarding:gre_inner_v4_multipath.sh          pass     pass     pass     same
net/forwarding:gre_multipath.sh                   pass     pass     pass     same
net/forwarding:gre_multipath_nh.sh                fail     fail     fail     same
net/forwarding:gre_multipath_nh_res.sh            fail     fail     fail     same
net/forwarding:hw_stats_l3.sh                     skip     skip     skip     same
net/forwarding:hw_stats_l3_gre.sh                 skip     skip     skip     same
net/forwarding:ip6_forward_instats_vrf.sh         skip     skip     skip     same
net/forwarding:ip6gre_custom_multipath_hash.sh    fail     fail     fail     same
net/forwarding:ip6gre_flat.sh                     pass     pass     pass     same
net/forwarding:ip6gre_flat_key.sh                 pass     pass     pass     same
net/forwarding:ip6gre_flat_keys.sh                pass     pass     pass     same
net/forwarding:ip6gre_hier.sh                     pass     pass     pass     same
net/forwarding:ip6gre_hier_key.sh                 pass     pass     pass     same
net/forwarding:ip6gre_hier_keys.sh                pass     pass     pass     same
net/forwarding:ip6gre_inner_v4_multipath.sh       pass     pass     pass     same
net/forwarding:ipip_flat_gre.sh                   pass     pass     pass     same
net/forwarding:ipip_flat_gre_key.sh               pass     pass     pass     same
net/forwarding:ipip_flat_gre_keys.sh              pass     pass     pass     same
net/forwarding:ipip_hier_gre.sh                   pass     pass     pass     same
net/forwarding:ipip_hier_gre_key.sh               pass     pass     pass     same
net/forwarding:local_termination.sh               skip     skip     skip     same
net/forwarding:loopback.sh                        skip     skip     skip     same
net/forwarding:mirror_gre.sh                      pass     pass     pass     same
net/forwarding:mirror_gre_bound.sh                pass     pass     pass     same
net/forwarding:mirror_gre_bridge_1d.sh            pass     pass     pass     same
net/forwarding:mirror_gre_bridge_1q.sh            pass     pass     pass     same
net/forwarding:mirror_gre_bridge_1q_lag.sh        pass     pass     pass     same
net/forwarding:mirror_gre_changes.sh              pass     pass     pass     same
net/forwarding:mirror_gre_flower.sh               pass     pass     pass     same
net/forwarding:mirror_gre_lag_lacp.sh             pass     pass     pass     same
net/forwarding:mirror_gre_neigh.sh                pass     pass     pass     same
net/forwarding:mirror_gre_nh.sh                   pass     pass     pass     same
net/forwarding:mirror_gre_vlan.sh                 pass     pass     pass     same
net/forwarding:mirror_vlan.sh                     pass     pass     pass     same
net/forwarding:no_forwarding.sh                   pass     pass     pass     same
net/forwarding:pedit_dsfield.sh                   pass     pass     pass     same
net/forwarding:pedit_ip.sh                        pass     pass     pass     same
net/forwarding:pedit_l4port.sh                    pass     pass     pass     same
net/forwarding:q_in_vni_ipv6.sh                   pass     pass     pass     same
net/forwarding:router.sh                          skip     skip     skip     same
net/forwarding:router_bridge.sh                   pass     pass     pass     same
net/forwarding:router_bridge_1d.sh                pass     pass     pass     same
net/forwarding:router_bridge_pvid_vlan_upper.sh   pass     pass     pass     same
net/forwarding:router_bridge_vlan.sh              pass     pass     pass     same
net/forwarding:router_bridge_vlan_upper.sh        pass     pass     pass     same
net/forwarding:router_bridge_vlan_upper_pvid.sh   pass     pass     pass     same
net/forwarding:router_broadcast.sh                pass     pass     pass     same
net/forwarding:router_mpath_nh.sh                 fail     fail     fail     same
net/forwarding:router_mpath_nh_res.sh             pass     pass     pass     same
net/forwarding:router_multicast.sh                skip     skip     skip     same
net/forwarding:router_multipath.sh                fail     fail     fail     same
net/forwarding:router_nh.sh                       pass     pass     pass     same
net/forwarding:router_vid_1.sh                    pass     pass     pass     same
net/forwarding:skbedit_priority.sh                pass     pass     pass     same
net/forwarding:tc_chains.sh                       pass     pass     pass     same
net/forwarding:tc_flower.sh                       pass     pass     pass     same
net/forwarding:tc_flower_cfm.sh                   fail     fail     fail     same
net/forwarding:tc_flower_l2_miss.sh               fail     fail     fail     same
net/forwarding:tc_flower_router.sh                pass     pass     pass     same
net/forwarding:tc_mpls_l2vpn.sh                   pass     pass     pass     same
net/forwarding:tc_shblocks.sh                     pass     pass     pass     same
net/forwarding:tc_tunnel_key.sh                   skip     skip     skip     same
net/forwarding:tc_vlan_modify.sh                  pass     pass     pass     same
net/forwarding:vxlan_asymmetric.sh                pass     pass     pass     same
net/forwarding:vxlan_asymmetric_ipv6.sh           pass     pass     pass     same
net/forwarding:vxlan_bridge_1d.sh                 pass     pass     pass     same
net/forwarding:vxlan_bridge_1d_port_8472.sh       pass     pass     pass     same
net/forwarding:vxlan_bridge_1d_port_8472_ipv6.sh  pass     pass     pass     same
net/forwarding:vxlan_bridge_1q.sh                 pass     pass     pass     same
net/forwarding:vxlan_bridge_1q_ipv6.sh            pass     pass     pass     same
net/forwarding:vxlan_bridge_1q_port_8472.sh       pass     pass     pass     same
net/forwarding:vxlan_bridge_1q_port_8472_ipv6.sh  pass     pass     pass     same
net/forwarding:vxlan_symmetric.sh                 pass     pass     pass     same
net/forwarding:vxlan_symmetric_ipv6.sh            pass     pass     pass     same
net/hsr:hsr_ping.sh                               fail     fail     fail     same
net/mptcp:diag.sh                                 pass     pass     pass     same
net/mptcp:mptcp_connect.sh                        pass     pass     pass     same
net/mptcp:mptcp_sockopt.sh                        pass     pass     pass     same
net/mptcp:pm_netlink.sh                           pass     pass     pass     same
net:altnames.sh                                   pass     pass     pass     same
net:bareudp.sh                                    pass     pass     pass     same
net:big_tcp.sh                                    skip     skip     skip     same
net:cmsg_so_mark.sh                               pass     pass     pass     same
net:devlink_port_split.py                         skip     skip     skip     same
net:drop_monitor_tests.sh                         skip     skip     skip     same
net:fcnal-test.sh                                 skip     skip     skip     same
net:fib-onlink-tests.sh                           pass     pass     pass     same
net:fib_nexthop_multiprefix.sh                    pass     pass     pass     same
net:fib_nexthop_nongw.sh                          pass     pass     pass     same
net:fib_rule_tests.sh                             pass     pass     pass     same
net:fib_tests.sh                                  fail     fail     fail     same
net:fin_ack_lat.sh                                pass     pass     pass     same
net:gre_gso.sh                                    skip     skip     skip     same
net:icmp.sh                                       fail     fail     fail     same
net:icmp_redirect.sh                              pass     pass     pass     same
net:io_uring_zerocopy_tx.sh                       fail     fail     fail     same
net:ip6_gre_headroom.sh                           pass     pass     pass     same
net:ipv6_flowlabel.sh                             pass     pass     pass     same
net:l2_tos_ttl_inherit.sh                         skip     skip     skip     same
net:l2tp.sh                                       pass     pass     pass     same
net:msg_zerocopy.sh                               pass     pass     pass     same
net:netdevice.sh                                  pass     pass     pass     same
net:pmtu.sh                                       fail     fail     fail     same
net:psock_snd.sh                                  pass     pass     pass     same
net:reuseaddr_ports_exhausted.sh                  pass     pass     pass     same
net:reuseport_bpf                                 pass     pass     pass     same
net:reuseport_bpf_cpu                             pass     pass     pass     same
net:reuseport_bpf_numa                            pass     pass     pass     same
net:reuseport_dualstack                           pass     pass     pass     same
net:route_localnet.sh                             pass     pass     pass     same
net:rps_default_mask.sh                           pass     pass     pass     same
net:rtnetlink.sh                                  skip     skip     skip     same
net:run_afpackettests                             pass     pass     pass     same
net:run_netsocktests                              pass     pass     pass     same
net:rxtimestamp.sh                                pass     pass     pass     same
net:so_txtime.sh                                  pass     pass     pass     same
net:srv6_end_next_csid_l3vpn_test.sh              pass     pass     pass     same
net:srv6_hencap_red_l3vpn_test.sh                 pass     pass     pass     same
net:srv6_hl2encap_red_l2vpn_test.sh               pass     pass     pass     same
net:stress_reuseport_listen.sh                    pass     pass     pass     same
net:tcp_fastopen_backup_key.sh                    pass     pass     pass     same
net:test_blackhole_dev.sh                         fail     fail     fail     same
net:test_bpf.sh                                   pass     pass     pass     same
net:test_bridge_neigh_suppress.sh                 skip     skip     skip     same
net:test_vxlan_fdb_changelink.sh                  pass     pass     pass     same
net:test_vxlan_under_vrf.sh                       pass     pass     pass     same
net:tls                                           pass     pass     pass     same
net:traceroute.sh                                 pass     pass     pass     same
net:udpgro.sh                                     fail     fail     fail     same
net:udpgro_bench.sh                               fail     fail     fail     same
net:udpgso.sh                                     pass     pass     pass     same
net:unicast_extensions.sh                         pass     pass     pass     same
net:veth.sh                                       fail     fail     fail     same
net:vrf-xfrm-tests.sh                             pass     pass     pass     same
net:vrf_route_leaking.sh                          pass     pass     pass     same
net:vrf_strict_mode_test.sh                       pass     pass     pass     same
netfilter:bridge_brouter.sh                       skip     skip     skip     same
netfilter:conntrack_icmp_related.sh               fail     fail     fail     same
netfilter:conntrack_tcp_unreplied.sh              fail     fail     fail     same
netfilter:conntrack_vrf.sh                        skip     skip     skip     same
netfilter:ipip-conntrack-mtu.sh                   skip     skip     skip     same
netfilter:ipvs.sh                                 skip     skip     skip     same
netfilter:nf_nat_edemux.sh                        skip     skip     skip     same
netfilter:nft_audit.sh                            fail     fail     fail     same
netfilter:nft_concat_range.sh                     fail     fail     fail     same
netfilter:nft_conntrack_helper.sh                 skip     skip     skip     same
netfilter:nft_fib.sh                              skip     skip     skip     same
netfilter:nft_flowtable.sh                        fail     fail     fail     same
netfilter:nft_meta.sh                             pass     pass     pass     same
netfilter:nft_nat.sh                              skip     skip     skip     same
netfilter:nft_queue.sh                            skip     skip     skip     same
netfilter:rpath.sh                                pass     pass     pass     same

Specific tests: skipped

Comment on lines 397 to 401

RH_KABI_EXTEND(void (*tmplt_reoffload)(struct tcf_chain *chain,
bool add,
flow_setup_cb_t *cb,
void *cb_priv))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding new fields to struct tcf_proto_ops is not kABI-safe because this struct is allocated from within drivers themselves.

Example: net/sched/cls_fw.c:423:static struct tcf_proto_ops cls_fw_ops __read_mostly = {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but aren't these drivers bundled with the kernel? The way I understand the kABI issue the binary incompatibility can arise if a driver is developed out-of-tree. Is anyone implementing a custom traffic class?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just realized you address this above:

Considering that the tcf_proto_ops struct is allocated statically, but in a highly controlled and limited fashion, closely related to the definitions of kernel-native traffic classes and unlikely to ever be replicated by an out-of-tree code, and that it's deeply buried, exclusively through pointers, in the substructure of the Qdisc struct that the users actually come to contact with in the whitelisted functions, it was determined that modifying it won't cause binary compatibility problems.

While I don't disagree, we have no way of proving this. If this assumption is wrong, there will be a clear out-of-bounds access and potentially exploitable indirect branch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is anyone implementing a custom traffic class?

I wouldn't be surprised. There are plenty of experimental Linux networking modules out there that exist out-of-tree.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like RH just decided to break the kABI with this fix, taking advantage of the 9.4 to 9.5 transition to do so. So we can't look to RH for guidance on this one it seems.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let me dig a bit more, not arguing for the patch in the current form, just wanted to better understand this topic - how about this part:

if tcf_proto_ops was ever used in such way then the user would request to put the tcf_proto_ops symbol on the whitelist directly, yet it's missing.

Is it wrong to assume that someone implementing a custom traffic class would ask for tcf_proto_ops to be whitelisted explicitly, as part of driver's API? I'm assuming here we only care about the whitelisted symbols and not trying to keep ABI with everyone.

Oh! Good point. Since register_tcf_proto_ops isn't in the kABI stablelist, and it is the only way a a tcf_proto_ops pointer can be reached through struct Qdisc, there's no kABI breakage in the strictest sense.

I think that adding the new member onto the end of struct tcf_proto_ops could be misleading and make it seem like struct tcf_proto_ops needs kABI stability. Instead, I think the new member should be added to the same place as in the upstream patch, and then use RH_KABI_EXCLUDE to appease check-kabi.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the KABI checker is mad with this :\

Copy link
Contributor Author

@pvts-mat pvts-mat Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the KABI checker is mad with this :\

It seems like - from the descriptions of RH_* macros alone - the RH_KABI_BROKEN_INSERT would be most appropriate

* RH_KABI_BROKEN_INSERT
* RH_KABI_BROKEN_REMOVE
*   Insert a field to the middle of a struct / delete a field from a struct.
*   Note that this breaks kABI! It can be done only when it's certain that
*   no 3rd party driver can validly reach into the struct.  A typical
*   example is a struct that is:  both (a) referenced only through a long
*   chain of pointers from another struct that is part of a whitelisted
*   symbol and (b) kernel internal only, it should have never been visible
*   to genksyms in the first place.

I just tested it and check-kabi is cool with it. (I actually checked the RH_KABI_EXCLUDE before commiting too - turns out I checked it for another CVE by mistake, sorry)

This doesn't answer the question of why RH_KABI_EXCLUDE didn't work while it should though…

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would have worked if we stuck the inserted code at the end like you originally did.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically shouldn't be different since those two macros are defined the same, which is to make the code disappear when the genksyms pass looks at the header.

@pvts-mat pvts-mat force-pushed the ciqlts9_4-CVE-2024-26669 branch from b0f5ac1 to 1d1f18a Compare September 9, 2025 15:43
@pvts-mat pvts-mat requested a review from kerneltoast September 9, 2025 15:51
kerneltoast
kerneltoast previously approved these changes Sep 9, 2025
Copy link
Collaborator

@kerneltoast kerneltoast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

jira VULN-8198
cve CVE-2024-26669
commit-author Ido Schimmel <idosch@nvidia.com>
commit 32f2a0a
upstream-diff |
  Adding `tmplt_reoffload' field to the `tcf_proto_ops' struct breaks kABI
  for the whitelisted symbols:
  - flow_block_cb_alloc
  - flow_block_cb_free
  - flow_block_cb_lookup
  - flow_block_cb_setup_simple
  - flow_indr_block_cb_alloc
  - flow_indr_dev_register
  - flow_indr_dev_unregister
  - qdisc_reset
  Added it with the `RH_KABI_BROKEN_INSERT' tag anyway because `tcf_proto_ops'
  was not put on the whitelist directly and changing it affects kABI only
  through the `Qdisc' struct used as parameter in the functions listed
  above. Since `register_tcf_proto_ops' isn't in the kABI stablelist, and
  it is the only way a `tcf_proto_ops' pointer can be reached through
  struct `Qdisc', there's no kABI breakage in the strictest sense.

When a qdisc is deleted from a net device the stack instructs the
underlying driver to remove its flow offload callback from the
associated filter block using the 'FLOW_BLOCK_UNBIND' command. The stack
then continues to replay the removal of the filters in the block for
this driver by iterating over the chains in the block and invoking the
'reoffload' operation of the classifier being used. In turn, the
classifier in its 'reoffload' operation prepares and emits a
'FLOW_CLS_DESTROY' command for each filter.

However, the stack does not do the same for chain templates and the
underlying driver never receives a 'FLOW_CLS_TMPLT_DESTROY' command when
a qdisc is deleted. This results in a memory leak [1] which can be
reproduced using [2].

Fix by introducing a 'tmplt_reoffload' operation and have the stack
invoke it with the appropriate arguments as part of the replay.
Implement the operation in the sole classifier that supports chain
templates (flower) by emitting the 'FLOW_CLS_TMPLT_{CREATE,DESTROY}'
command based on whether a flow offload callback is being bound to a
filter block or being unbound from one.

As far as I can tell, the issue happens since cited commit which
reordered tcf_block_offload_unbind() before tcf_block_flush_all_chains()
in __tcf_block_put(). The order cannot be reversed as the filter block
is expected to be freed after flushing all the chains.

[1]
unreferenced object 0xffff888107e28800 (size 2048):
  comm "tc", pid 1079, jiffies 4294958525 (age 3074.287s)
  hex dump (first 32 bytes):
    b1 a6 7c 11 81 88 ff ff e0 5b b3 10 81 88 ff ff  ..|......[......
    01 00 00 00 00 00 00 00 e0 aa b0 84 ff ff ff ff  ................
  backtrace:
    [<ffffffff81c06a68>] __kmem_cache_alloc_node+0x1e8/0x320
    [<ffffffff81ab374e>] __kmalloc+0x4e/0x90
    [<ffffffff832aec6d>] mlxsw_sp_acl_ruleset_get+0x34d/0x7a0
    [<ffffffff832bc195>] mlxsw_sp_flower_tmplt_create+0x145/0x180
    [<ffffffff832b2e1a>] mlxsw_sp_flow_block_cb+0x1ea/0x280
    [<ffffffff83a10613>] tc_setup_cb_call+0x183/0x340
    [<ffffffff83a9f85a>] fl_tmplt_create+0x3da/0x4c0
    [<ffffffff83a22435>] tc_ctl_chain+0xa15/0x1170
    [<ffffffff838a863c>] rtnetlink_rcv_msg+0x3cc/0xed0
    [<ffffffff83ac87f0>] netlink_rcv_skb+0x170/0x440
    [<ffffffff83ac6270>] netlink_unicast+0x540/0x820
    [<ffffffff83ac6e28>] netlink_sendmsg+0x8d8/0xda0
    [<ffffffff83793def>] ____sys_sendmsg+0x30f/0xa80
    [<ffffffff8379d29a>] ___sys_sendmsg+0x13a/0x1e0
    [<ffffffff8379d50c>] __sys_sendmsg+0x11c/0x1f0
    [<ffffffff843b9ce0>] do_syscall_64+0x40/0xe0
unreferenced object 0xffff88816d2c0400 (size 1024):
  comm "tc", pid 1079, jiffies 4294958525 (age 3074.287s)
  hex dump (first 32 bytes):
    40 00 00 00 00 00 00 00 57 f6 38 be 00 00 00 00  @.......W.8.....
    10 04 2c 6d 81 88 ff ff 10 04 2c 6d 81 88 ff ff  ..,m......,m....
  backtrace:
    [<ffffffff81c06a68>] __kmem_cache_alloc_node+0x1e8/0x320
    [<ffffffff81ab36c1>] __kmalloc_node+0x51/0x90
    [<ffffffff81a8ed96>] kvmalloc_node+0xa6/0x1f0
    [<ffffffff82827d03>] bucket_table_alloc.isra.0+0x83/0x460
    [<ffffffff82828d2b>] rhashtable_init+0x43b/0x7c0
    [<ffffffff832aed48>] mlxsw_sp_acl_ruleset_get+0x428/0x7a0
    [<ffffffff832bc195>] mlxsw_sp_flower_tmplt_create+0x145/0x180
    [<ffffffff832b2e1a>] mlxsw_sp_flow_block_cb+0x1ea/0x280
    [<ffffffff83a10613>] tc_setup_cb_call+0x183/0x340
    [<ffffffff83a9f85a>] fl_tmplt_create+0x3da/0x4c0
    [<ffffffff83a22435>] tc_ctl_chain+0xa15/0x1170
    [<ffffffff838a863c>] rtnetlink_rcv_msg+0x3cc/0xed0
    [<ffffffff83ac87f0>] netlink_rcv_skb+0x170/0x440
    [<ffffffff83ac6270>] netlink_unicast+0x540/0x820
    [<ffffffff83ac6e28>] netlink_sendmsg+0x8d8/0xda0
    [<ffffffff83793def>] ____sys_sendmsg+0x30f/0xa80

[2]
 # tc qdisc add dev swp1 clsact
 # tc chain add dev swp1 ingress proto ip chain 1 flower dst_ip 0.0.0.0/32
 # tc qdisc del dev swp1 clsact
 # devlink dev reload pci/0000:06:00.0

Fixes: bbf7383 ("net: sched: traverse chains in block with tcf_get_next_chain()")
	Signed-off-by: Ido Schimmel <idosch@nvidia.com>
	Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 32f2a0a)
	Signed-off-by: Marcin Wcisło <marcin.wcislo@conclusive.pl>
Copy link
Collaborator

@kerneltoast kerneltoast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So many kABI macros that all do the same thing @_@

Copy link
Collaborator

@PlaidCat PlaidCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@PlaidCat PlaidCat merged commit 1e2ed2a into ctrliq:ciqlts9_4 Sep 10, 2025
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants