-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buffer Occupancy #55
Buffer Occupancy #55
Changes from 11 commits
3771848
10355fd
fe346e4
6f4c8e8
2e253b0
34b55ff
54c9158
3289ea4
c32c475
3c0cdd3
84f520a
63ce414
bc3d729
7131207
748ca28
ae57749
5b3a6f5
19a6456
6f725b4
5f7ac99
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,7 +2,7 @@ Title : In-band Network Telemetry (INT) Dataplane Specification | |
Title Note : Working draft. Note: consider using tagged versions for implementation. | ||
Title Footer: 2018-05-08 | ||
Author : The P4.org Applications Working Group. Contributions from | ||
Affiliation : *Alibaba, Arista, Barefoot Networks, Dell, Intel, Marvell, Netronome, VMware* | ||
Affiliation : *Alibaba, Arista, Barefoot Networks, Cisco Systems, Dell, Intel, Marvell, Netronome, VMware* | ||
Heading depth: 5 | ||
Pdf Latex: xelatex | ||
Document Class: [11pt]article | ||
|
@@ -318,6 +318,11 @@ simply leaves those decisions to device vendors. | |
- The build-up of traffic in the queue (in bytes, cells, or packets) that the | ||
INT packet observes in the device while being forwarded. | ||
|
||
* Buffer occupancy | ||
- The build-up of traffic in the buffer (in bytes, cells, or packets) that the | ||
INT packet observes in the device while being forwarded. Use case is when buffer is | ||
shared between multiple queues. | ||
|
||
# INT Headers | ||
|
||
This section specifies the format and location of INT Headers. | ||
|
@@ -752,12 +757,12 @@ hop-by-hop INT header must fit in a single Geneve option. | |
In this section, we define the format for INT hop-by-hop metadata headers, | ||
and the metadata itself. | ||
|
||
INT Metadata Header and Metadata Stack: | ||
INT Metadata Header and Metadata Stack (Version = 1): | ||
rsivakolundu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
` | ||
0 1 2 3 | ||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | ||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| Ver |Rep|C|E|M| Reserved | Hop ML |RemainingHopCnt| | ||
|Ver = 1|Rep|C|E|M| Reserved | Hop ML |RemainingHopCnt| | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| Instruction Bitmap | Reserved | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
|
@@ -821,7 +826,7 @@ The original packet must have C bit set to 0. | |
switch(es) set the M bit based on knowledge of the network topology | ||
and "Switch ID, Ingress port ID, Egress port ID" tuples in the INT | ||
metadata stack. | ||
- R: Reserved bits. | ||
- R (10b): Reserved bits. | ||
- Hop ML (5b): Per-hop Metadata Length, the length of metadata in 4-Byte words | ||
to be inserted at each INT hop. | ||
- While the largest value of Per-hop Metadata Length is 31, an INT-capable | ||
|
@@ -910,6 +915,170 @@ from (shim header length \* 4). | |
For INT over Geneve it is 8 bytes subtracted from (length in Geneve tunnel | ||
option header \* 4). | ||
|
||
|
||
INT Metadata Header and Metadata Stack (Version = 2): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mhira1 I was meaning to propose make a 1.x version cut before merging these changes, and move INT.mdk to version2 draft. Then we don't need to list both v1 and v2 in the same doc. How do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense. Let's remove this second paragraph on v2 header then. |
||
` | ||
0 1 2 3 | ||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
|Ver = 2|Rep|C|E|M| Reserved | Hop ML |RemainingHopCnt| | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| Instruction Bitmap | Reserved | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| INT Metadata Stack (Each hop inserts Hop ML * 4B of metadata) | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| . . . | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
| Last INT metadata | | ||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||
` | ||
|
||
* INT metadata header is 8 bytes long followed by a stack of INT metadata. | ||
Each metadata is either 4 bytes or 8 bytes in length. Each INT hop adds | ||
the same length of metadata. The total length of the metadata stack is | ||
variable as different packets may traverse different paths and hence | ||
different number of INT hops. | ||
|
||
* The fields in the INT metadata header are interpreted the following way: | ||
- Ver (4b): INT metadata header version. Should be 2 for this version. | ||
- Rep (2b): Replication requested. Support for this request is optional. If | ||
this value is non-zero, the device may replicate the INT packet. This is useful | ||
to explore all the valid physical forwarding paths when multi-path forwarding | ||
techniques (e.g., ECMP, LAG) are used in the network. Note the Rep bits should | ||
be used judiciously (e.g., only for probe packets, not for every data packet). | ||
While we recommend that Rep bits be set only for probe packets, the INT | ||
architecture does not (and perhaps cannot) disallow use of the Rep bits for real | ||
data packets. | ||
- 0: No replication requested. | ||
- 1: Port-level (L2-level) replication requested. If the INT packet is | ||
forwarded through a logical port that is a port-channel (LAG), then replicate | ||
the packet on each physical port in the port-channel and send a single copy per | ||
physical port. | ||
- 2: Next-hop-level (L3-level) replication requested. Forward the packet | ||
to each L3 ECMP next-hop valid for the destination address, with INT headers | ||
replicated in each forwarded copy. | ||
- 3: Port-level and Next-hop-level replication requested. | ||
- C (1b): Copy. | ||
- If replication is requested for data packets, the INT Sink must be | ||
able to distinguish the original packet from replicas so that it can forward | ||
only original packets up the protocol stack, and drop all the replicas. The C | ||
bit must be set to 1 on each copy, whenever an INT hop replicates a packet. | ||
The original packet must have C bit set to 0. | ||
- C bit must be set to 0 in the original packet by INT source | ||
- E (1b): Max Hop Count exceeded. | ||
- This flag must be set if a device cannot prepend its own metadata due to | ||
the Remaining Hop Count reaching zero. | ||
- E bit must be set to 0 by INT source | ||
- M (1b): MTU exceeded | ||
- This flag must be set if a device cannot add all of the requested metadata | ||
because doing so will cause the packet length to exceed egress link MTU. | ||
In this case, the device must not add any metadata to the packet, and set | ||
the M bit in the INT header. Note that it is possible for egress MTU | ||
limitation to prevent INT metadata insertion at multiple hops along a | ||
path. The M bit simply serves as an indication that INT metadata was not | ||
inserted at one or more hops and corrective action such as reconfiguring | ||
MTU at some links may be needed, particularly when INT switches are not | ||
participating in path MTU discovery. The M bit is not aimed at readily | ||
identifying which switch(es) did not insert INT metadata due to egress MTU | ||
limitation. In theory, if this does not occur at consecutive hops, | ||
it may be possible for the monitoring system to derive which | ||
switch(es) set the M bit based on knowledge of the network topology | ||
and "Switch ID, Ingress port ID, Egress port ID" tuples in the INT | ||
metadata stack. | ||
- R (9b): Reserved bits. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Number of reserved bits is still 10 correct? I think this text with 9 bits is coming from your other patch in which you are using one of the reserved bits for "Sink" bit. There is no change in number of reserved bits between v1 and v2 in this patch. |
||
- Hop ML (5b): Per-hop Metadata Length, the length of metadata in 4-Byte words | ||
to be inserted at each INT hop. | ||
- While the largest value of Per-hop Metadata Length is 31, an INT-capable | ||
device may be limited in the maximum number of instructions it can process | ||
and/or maximum length of metadata it can insert in data packets. An INT | ||
hop that cannot process all instructions must still insert Per-hop | ||
Metadata Length \* 4 bytes, with all-ones reserved value (4 or 8 bytes | ||
of 0xFF depending on the length of metadata) for the metadata | ||
corresponding to instructions it cannot process. An INT hop that | ||
cannot insert Per-hop Metadata Length \* 4 bytes must skip INT | ||
processing altogether and not insert any metadata in the packet. | ||
- Remaining Hop Count (8b): The remaining number of hops that are allowed to | ||
add their metadata to the packet. | ||
- Upon creation of an INT metadata header, the INT Source must set this | ||
value to the maximum number of hops that are allowed to add metadata | ||
instance(s) to the packet. Each INT-capable device on the path, including | ||
the INT Source as well as INT Transit Hops, must decrement the | ||
Remaining Hop Count if and when it pushes its local metadata onto the | ||
stack. | ||
- When a packet is received with the Remaining Hop Count equal to 0, the | ||
device must ignore the INT instruction, pushing no new metadata onto | ||
the stack, and the device must set the E bit. | ||
* INT instructions are encoded as a bitmap in the 16-bit INT Instruction field: | ||
each bit corresponds to a specific standard metadata as specified in Section 3. | ||
- bit0 (MSB): Switch ID | ||
- bit1: Level 1 Ingress Port ID (16 bits) + Egress Port ID (16 bits) | ||
- bit2: Hop latency | ||
- bit3: Queue ID (8 bits) + Queue occupancy (24 bits) | ||
- bit4: Ingress timestamp | ||
- bit5: Egress timestamp | ||
- bit6: Level 2 Ingress Port ID + Egress Port ID (4 bytes each) | ||
- bit7: Egress port Tx utilization | ||
- bit8: Buffer Occupancy (32 bits) | ||
- bit15: Checksum Complement | ||
- The remaining bits are reserved. | ||
Each instruction requests 4 bytes of metadata to be inserted at each hop, | ||
except if bit 6 and bit 14 is set. If bit 6 is set, the instruction requires | ||
8 bytes of metadata. If bit 14 is set, the instruction requires a domain specific | ||
metadata of n bytes, n being a multiple of 4 bytes. Per-hop metadata length is | ||
set accordingly at the INT source. | ||
* Each INT Transit device along the path that supports INT adds its own metadata | ||
values as specified in the instruction bitmap immediately after the INT metadata | ||
header. | ||
- When adding a new metadata, each device must prepend its metadata in | ||
front of the metadata that are already added by the upstream devices. | ||
This is similar to the push operation on a stack. Hence, the most recently | ||
added metadata appears at the top of the stack. The device must add | ||
metadata in the order of bits set in the instruction bitmap. | ||
- If a device is unable to provide a metadata value specified in the | ||
instruction bitmap because its value is not available, it must add a special | ||
all-ones reserved value indicating "invalid" (4 or 8 bytes of 0xFF | ||
depending on metadata length). | ||
- If a device cannot add all the metadata required by the instruction bitmap | ||
(irrespective of the availability of the metadata values that are asked | ||
for), it must skip processing that particular INT packet entirely. This | ||
ensures that each INT Transit device adds either zero bytes or | ||
Per-hop Metadata Length\*4 bytes to the packet. | ||
- Reserved bits in the instruction bitmap are to be handled similarly. If an | ||
INT transit hop receives a reserved bit set in the instruction bitmap (e.g. | ||
set by a INT source that is running a newer version), the transit hop must | ||
either add corresponding metadata filled with the reserved value 0xFFFFFFFF | ||
or must not add any INT metadata to the packet. This means that an | ||
instruction bit marked reserved in this specification may be | ||
used for a 4B metadata in a subsequent minor version while still being | ||
backward compatible with this specification. However, an instruction bit | ||
marked reserved in this specification may be used for a 8B metadata only | ||
in the next major version, breaking backward compatibility and requring all | ||
INT switches to be upgraded to the new major version. For example | ||
a version 1.0 INT switch cannot operate alongside version 2.0 INT switches | ||
if a new 8B metadata is introduced in version 2.0, as the version 1.0 | ||
INT switch could insert 0xFFFFFFFF reserved value for a 8B metadata field, | ||
thus breaking the metadata stack length invariance - the length of | ||
metadata stack will not be a multiple of Per-Hop Metadata length \* 4 | ||
in this case. | ||
- If an INT transit hop does not add metadata to a packet due to any of the | ||
above reasons, it must not decrement the remaining INT hop count in the INT | ||
metadata header. | ||
* Summary of the field usage | ||
- The INT Source must set the following fields: | ||
- Ver, Rep, C, M, Per-hop Metadata Length, Remaining Hop Count, | ||
and Instruction Bitmap. | ||
- INT Source must set all reserved bits to zero. | ||
- Intermediate devices can set the following fields: | ||
- C, E, M, Remaining Hop Count | ||
* The length (in bytes) of the INT metadata stack must always | ||
be a multiple of (Per-hop Metadata Length \* 4). This length can be determined | ||
by subtracting the total INT fixed header sizes (12 bytes) | ||
from (shim header length \* 4). | ||
For INT over Geneve it is 8 bytes subtracted from (length in Geneve tunnel | ||
option header \* 4). | ||
|
||
|
||
|
||
# Examples | ||
|
||
This section shows example INT Headers with two hosts (Host1 and Host2), | ||
|
@@ -1219,6 +1388,10 @@ header int_egress_port_tx_util_t { | |
bit<32> egress_port_tx_util; | ||
} | ||
|
||
header int_b_occupancy_t { | ||
bit<32> q_occupancy; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please make it consistent with the (example) layout provided in line 866, which has 8b buffer id. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let us discuss the semantics of Queue and Buffer occupancy. I will make it consistent after the discussion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added Buffer ID (8 bits) and Buffer Occupancy (24 bits) |
||
} | ||
|
||
/* standard ethernet/ip/tcp headers */ | ||
header ethernet_t { | ||
bit<48> dstAddr; | ||
|
@@ -1279,6 +1452,7 @@ struct headers { | |
int_egress_tstamp_t int_egress_tstamp; | ||
int_level2_port_ids_t int_level2_port_ids; | ||
int_egress_port_tx_util_t int_egress_port_tx_util; | ||
int_b_occupancy_t int_b_occupancy; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: fix indent There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
} | ||
|
||
struct empty_metadata_t { | ||
|
@@ -1610,6 +1784,11 @@ control EgressDeparserImpl(packet_out packet, | |
ck.add({hdr.int_egress_port_tx_util.egress_port_tx_util}); | ||
} | ||
|
||
if (hdr.int_b_occupancy.isValid()) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: fix indent There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed |
||
ck.add({ | ||
hdr.int_b_occupancy.b_occupancy | ||
rsivakolundu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
}); | ||
} | ||
if (hdr.tcp.isValid()) { | ||
ck.add({ | ||
hdr.tcp.srcPort, | ||
|
@@ -1656,6 +1835,7 @@ control EgressDeparserImpl(packet_out packet, | |
packet.emit(hdr.int_egress_tstamp); | ||
packet.emit(hdr.int_level2_port_ids); | ||
packet.emit(hdr.int_egress_port_tx_util); | ||
packet.emit(hdr.int_b_occupancy); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: fix indent There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
} | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps remove the word "shared". Reporting both queue occupancy and buffer occupancy makes sense only if the buffer is shared, but technically the buffer could be a per-queue buffer, just that in this case, both values reported will be the same. INT source does not know whether downstream switches have shared buffers or not. It can set both bits. Switches that have per-queue buffers (unlikely, switches typically use shared buffers) will just report the same value for both instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be word this differently, to say this instruction is to get buffer occupancy. Use-case is when buffer is shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the reference to shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor editorial rewording:
The use case is when the buffer is shared between multiple queues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the minor editorial change suggested.