Intercept and serve memcached
requests from eBPF.
Warning
This project was made for learning purposes and is not meant for production usage.
Install bpf-linker:
cargo install bpf-linker
Start memcached-ebpf-proxy-cache
:
RUST_LOG=debug cargo xtask run -- --iface lo
memcached-ebpf-proxy-cache maintains a smaller cache in the eBPF layer in front of memcached to service memcached requests right from the eBPF layer. In theory, for small GET requests (< 250 bytes), the networking stack adds a substantial overhead while the request is processed. eBPF allows a way to intercept and respond to the request even before it enters the networking stack.
Note: The phrase In theory is important here.
memcached-ebpf-proxy-cache
works with the memcached
binary protocol.
In NIC's which natively support the eBPF bytecode, these packets can be serviced directly from the NIC.
Fig: Dataflow through memcached-ebpf-proxy-cache
We maintain a BPF_MAP_TYPE_ARRAY for storing the cache entries in our eBPF program. We use it to implement hashmap as follows:
CACHE[ fnv_1_a_hash(KEY) % CACHE_SIZE ] = (extra, KEY, VAL)
where extra
refers to the extra
bytes in a memcached GET
response.
Let's discuss the ingress and egress path seperately.
The ingress path involves the following eBPF programs:
rx_filter
: Ingress entrypoint - filters memcached packets and lets other kind of packets pass through. For memcached GET packets, it makes as tail call tohash_key
. For set packets, it makes a tail call toinvalidate_cache
.hash_key
: Hashes the KEY in packet with the the fnv_1_a_hash algorithm. Looks up the entry athash % CACHE_SIZE
. If cache-hit, then tail call towrite_reply
, else let the packet pass throught with xdp actionXDP_PASS
write_reply
: Modify packet: swap Eth header source dest, ip header source dest, UDP header source dest and rewrite body with (extra, KEY, VAL) tuple. Send the packet back to the requesting NIC withXDP_TX
. This effectively processes the memcached request packet completey within our eBPF program.invalidate_cache
: Reached through tail call fromrx_filter
on SET request. Hash key, lookup cache entry at indexhash % CACHE_SIZE
and mark it as invalid. Invalid cache entries are updated on the egress path in theupdate_cache
eBPF program.
The egress path involves the following eBPF programs:
tx_filter
: Egress entrypoint - filters memcached GET response packets and lets other kinds packets pass through withTC_ACT_OK
. Make a tail call toupdate_cache
for GET response packets.update_cache
: Hashes KEY in reponse and looks up entry athash % CACHE_SIZE
. If entry is invalid, update cache entry with (extra, KEY, VAL) tuple. Once done updating, let the packet pass through withTC_ACT_OK
.
Start tracing UDP packet traffic on port 11211 with the following command:
sudo tcpdump -Xi lo -n udp port 11211
First, test the control behaviour by running the test:
cargo test
Which runs the following rust code:
let get_client = memcache::connect(memcached_get_endpoint).unwrap();
let set_client = memcache::connect(memcached_set_endpoint).unwrap();
const KEY: &str = "key";
const VAL: &str = "val";
// SET once. GET twice
set_client.set(KEY, VAL, 10).unwrap();
assert_eq!(get_client.get::<String>(KEY).unwrap().unwrap(), VAL);
assert_eq!(get_client.get::<String>(KEY).unwrap().unwrap(), VAL);
Running the test should yield a similar packet dump to the following packet dump:
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on lo, link-type EN10MB (Ethernet), snapshot length 262144 bytes
21:20:40.642266 IP 127.0.0.1.58390 > 127.0.0.1.11211: UDP, length 32
0x0000: 4500 003c 82b7 4000 4011 b9f7 7f00 0001 E..<..@.@.......
0x0010: 7f00 0001 e416 2bcb 0028 fe3b 7152 0000 ......+..(.;qR..
0x0020: 0001 0000 800b 0000 0000 0000 0000 0000 ................
0x0030: 0000 0000 0000 0000 0000 0000 ............
21:20:40.642572 IP 127.0.0.1.11211 > 127.0.0.1.58390: UDP, length 38
0x0000: 4500 0042 42f5 4000 4011 f9b3 7f00 0001 E..BB.@.@.......
0x0010: 7f00 0001 2bcb e416 002e fe41 7152 0000 ....+......AqR..
0x0020: 0001 0000 810b 0000 0000 0000 0000 0006 ................
0x0030: 0000 0000 0000 0000 0000 0000 312e 362e ............1.6.
0x0040: 3134 14
21:20:40.642829 IP 127.0.0.1.58390 > 127.0.0.1.11211: UDP, length 35
0x0000: 4500 003f 82b8 4000 4011 b9f3 7f00 0001 E..?..@.@.......
0x0010: 7f00 0001 e416 2bcb 002b fe3e 7153 0000 ......+..+.>qS..
0x0020: 0001 0000 8000 0003 0000 0000 0000 0003 ................
0x0030: 0000 0000 0000 0000 0000 0000 6b65 79 ............key
21:20:40.642951 IP 127.0.0.1.11211 > 127.0.0.1.58390: UDP, length 39
0x0000: 4500 0043 42f6 4000 4011 f9b1 7f00 0001 E..CB.@.@.......
0x0010: 7f00 0001 2bcb e416 002f fe42 7153 0000 ....+..../.BqS..
0x0020: 0001 0000 8100 0000 0400 0000 0000 0007 ................
0x0030: 0000 0000 0000 0000 0000 0115 0000 0000 ................
0x0040: 7661 6c val
21:20:40.643061 IP 127.0.0.1.58390 > 127.0.0.1.11211: UDP, length 32
0x0000: 4500 003c 82b9 4000 4011 b9f5 7f00 0001 E..<..@.@.......
0x0010: 7f00 0001 e416 2bcb 0028 fe3b 7154 0000 ......+..(.;qT..
0x0020: 0001 0000 800b 0000 0000 0000 0000 0000 ................
0x0030: 0000 0000 0000 0000 0000 0000 ............
21:20:40.643153 IP 127.0.0.1.11211 > 127.0.0.1.58390: UDP, length 38
0x0000: 4500 0042 42f7 4000 4011 f9b1 7f00 0001 E..BB.@.@.......
0x0010: 7f00 0001 2bcb e416 002e fe41 7154 0000 ....+......AqT..
0x0020: 0001 0000 810b 0000 0000 0000 0000 0006 ................
0x0030: 0000 0000 0000 0000 0000 0000 312e 362e ............1.6.
0x0040: 3134 14
21:20:40.643187 IP 127.0.0.1.58390 > 127.0.0.1.11211: UDP, length 35
0x0000: 4500 003f 82ba 4000 4011 b9f1 7f00 0001 E..?..@.@.......
0x0010: 7f00 0001 e416 2bcb 002b fe3e 7155 0000 ......+..+.>qU..
0x0020: 0001 0000 8000 0003 0000 0000 0000 0003 ................
0x0030: 0000 0000 0000 0000 0000 0000 6b65 79 ............key
21:20:40.643309 IP 127.0.0.1.11211 > 127.0.0.1.58390: UDP, length 39
0x0000: 4500 0043 42f8 4000 4011 f9af 7f00 0001 E..CB.@.@.......
0x0010: 7f00 0001 2bcb e416 002f fe42 7155 0000 ....+..../.BqU..
0x0020: 0001 0000 8100 0000 0400 0000 0000 0007 ................
0x0030: 0000 0000 0000 0000 0000 0115 0000 0000 ................
0x0040: 7661 6c val
Notice that there are two GET request packets with "key" payload.
Now start memcached-ebpf-proxy-cache
:
RUST_LOG=debug cargo xtask run -- --iface lo
The first GET
should trigger an update_cache
while the second GET
should lead
to a cache hit and trigger write_reply
. Let's see if we can reproduce this behaviour.
Now run cargo test
and inspect the traffic again.
Running the test should yield a similar packet dump to the following packet dump:
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on lo, link-type EN10MB (Ethernet), snapshot length 262144 bytes
21:25:21.492337 IP 127.0.0.1.51924 > 127.0.0.1.11211: UDP, length 32
0x0000: 4500 003c 92c7 4000 4011 a9e7 7f00 0001 E..<..@.@.......
0x0010: 7f00 0001 cad4 2bcb 0028 fe3b 8dbe 0000 ......+..(.;....
0x0020: 0001 0000 800b 0000 0000 0000 0000 0000 ................
0x0030: 0000 0000 0000 0000 0000 0000 ............
21:25:21.492539 IP 127.0.0.1.11211 > 127.0.0.1.51924: UDP, length 38
0x0000: 4500 0042 a004 4000 4011 9ca4 7f00 0001 E..B..@.@.......
0x0010: 7f00 0001 2bcb cad4 002e fe41 8dbe 0000 ....+......A....
0x0020: 0001 0000 810b 0000 0000 0000 0000 0006 ................
0x0030: 0000 0000 0000 0000 0000 0000 312e 362e ............1.6.
0x0040: 3134 14
21:25:21.492749 IP 127.0.0.1.51924 > 127.0.0.1.11211: UDP, length 35
0x0000: 4500 003f 92c8 4000 4011 a9e3 7f00 0001 E..?..@.@.......
0x0010: 7f00 0001 cad4 2bcb 002b fe3e 8dbf 0000 ......+..+.>....
0x0020: 0001 0000 800c 0003 0000 0000 0000 0003 ................
0x0030: 0000 0000 0000 0000 0000 0000 6b65 79 ............key
21:25:21.492874 IP 127.0.0.1.11211 > 127.0.0.1.51924: UDP, length 42
0x0000: 4500 0046 a005 4000 4011 9c9f 7f00 0001 E..F..@.@.......
0x0010: 7f00 0001 2bcb cad4 0032 fe45 8dbf 0000 ....+....2.E....
0x0020: 0001 0000 810c 0003 0400 0000 0000 000a ................
0x0030: 0000 0000 0000 0000 0000 0117 0000 0000 ................
0x0040: 6b65 7976 616c keyval
21:25:21.493021 IP 127.0.0.1.51924 > 127.0.0.1.11211: UDP, length 32
0x0000: 4500 003c 92c9 4000 4011 a9e5 7f00 0001 E..<..@.@.......
0x0010: 7f00 0001 cad4 2bcb 0028 fe3b 8dc0 0000 ......+..(.;....
0x0020: 0001 0000 800b 0000 0000 0000 0000 0000 ................
0x0030: 0000 0000 0000 0000 0000 0000 ............
21:25:21.493178 IP 127.0.0.1.11211 > 127.0.0.1.51924: UDP, length 38
0x0000: 4500 0042 a006 4000 4011 9ca2 7f00 0001 E..B..@.@.......
0x0010: 7f00 0001 2bcb cad4 002e fe41 8dc0 0000 ....+......A....
0x0020: 0001 0000 810b 0000 0000 0000 0000 0006 ................
0x0030: 0000 0000 0000 0000 0000 0000 312e 362e ............1.6.
0x0040: 3134 14
21:25:21.493441 IP 127.0.0.1.11211 > 127.0.0.1.51924: UDP, length 42
0x0000: 4500 0046 92ca 4000 4011 a9da 7f00 0001 E..F..@.@.......
0x0010: 7f00 0001 2bcb cad4 0032 0000 8dc1 0000 ....+....2......
0x0020: 0001 0000 810c 0003 0400 0000 0000 000a ................
0x0030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
0x0040: 6b65 7976 616c keyval
Notice that there is only one GET request and two GETK responses. This is our proof that the last request was completely handled by the eBPF layer.
You might also notice that we are getting both key and value in the reponse. That's because we modify GET requests to GETK requests. This is important because we need the key in the reponse packet in order to be able to update the cache.
Now a word of honesty: this doesn't actually lead to a speedup as we expected, even though we are
able to reduce the number of packets. Actually, we notice an overhead when we run memcached-ebpf-proxy-cache
.
This project was originally inspired by the paper: "BMC: Accelerating Memcached using Safe In-kernel Caching and Pre-stack Processing"
The original paper used the memcached ASCII protocol. Also it made some enhancements to memcached to
be able to utilize SO_REUSEPORT
to enhance multi-threaded performance.
In the project, I wanted to explore how far I could go using only eBPF and the stock memcached
package
that's available in Ubuntu 22.04 or Arch Linux package repository. I also opted for the memcached binary
protocol since that's the one that's mostly used in production. So any lack of performance increase is solely
on this particular implementation. A faithful implementation of the paper should definitely be able to achieve
the published speedup.
The original paper authors also provided their own implementation here: https://github.com/Orange-OpenSource/bmc-cache/
Now back to our benchmarks. In order the run the benchmarks checkout to the feat/memcached-bench
branch.
git checkout feat/memcached-bench
Next compare the performance with and without memcached-ebpf-proxy-cache
.
First run memcached-bench
while memcached-ebpf-proxy-cache
is not running with the following command:
cargo run -p memcached-bench
Next run this command while memcached-ebpf-proxy-cache
is running.
RUST_LOG=debug cargo xtask run -- --iface lo
# in a different terminal
cargo run -p memcached-bench
Doing this yields the following results on my machine (HP Pavilion x360 Convertible 14-ba0xx running Endeavour OS):
# without memcached-ebpf-proxy-cache
100.00% |█████████████████████████████████████████████████████████████████▏| 91.00/91.00 [00:00:00] (11.58 it/s)
Time spent in SET loop: 7.856820009s
100.00% |███████████████████████████████████████████████████████████████▏| 10.00K/10.00K [00:00:00] (8.20K it/s)
Time spent in GET loop: 1.219417s
# with memcached-ebpf-proxy-cache
100.00% |█████████████████████████████████████████████████████████████████▏| 91.00/91.00 [00:00:00] (11.56 it/s)
Time spent in SET loop: 7.874415797s
100.00% |███████████████████████████████████████████████████████████████▏| 10.00K/10.00K [00:00:00] (5.37K it/s)
Time spent in GET loop: 1.861354611s
There's a slight decrease in performance instead of the increase in performance we were expecting.
There can be a couple of reasons for this:
- The binary protocol may be inherently much more efficient than the ASCII protocol, where the overhead due to
the networking stack doesn't matter. Rather the
memcpy()
in the eBPF layer might be incurring an overhead. - Lack of proper
bpf_spin_lock
support in aya-rs - aya-rs currently lacks support for bpf_spin_lock due to this issue as of 22-09-2024. So I improvised and implemented my own spinlock using atomic intrinsicatomic_xchg_seqcst
. My implementation may not be as efficient as the real thing. - We are sending back both the KEY and VAL in GET requests. This can incur a data transfer overhead. (Although, we are still on localhost.)
Regardless this was a fun learning exercise. I learned a lot about:
- XDP and TC packet filtering and processing
- Packet unpacking and restructuring at different protocol levels
- Tail calls
- Different map types: BPF Map type Array, Program Array, Per CPU array etc.
- Atomic intrinsics
- Satisfying the eBPF verifier with proper loop range and memory acccess bounds
I have more or less achieved what I wanted to - which was to understand how to write eBPF programs. So I'll stop here for now. Regardless, all contributions to improve performance are very much welcome.
This repository is licensed under the MIT License. See LICENSE for more details.