Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer Occupancy #55

Merged
merged 20 commits into from
Dec 8, 2018
190 changes: 185 additions & 5 deletions telemetry/specs/INT.mdk
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Title : In-band Network Telemetry (INT) Dataplane Specification
Title Note : Working draft. Note: consider using tagged versions for implementation.
Title Footer: 2018-05-08
Author : The P4.org Applications Working Group. Contributions from
Affiliation : *Alibaba, Arista, Barefoot Networks, Dell, Intel, Marvell, Netronome, VMware*
Affiliation : *Alibaba, Arista, Barefoot Networks, Cisco Systems, Dell, Intel, Marvell, Netronome, VMware*
Heading depth: 5
Pdf Latex: xelatex
Document Class: [11pt]article
Expand Down Expand Up @@ -318,6 +318,11 @@ simply leaves those decisions to device vendors.
- The build-up of traffic in the queue (in bytes, cells, or packets) that the
INT packet observes in the device while being forwarded.

* Buffer occupancy
- The build-up of traffic in the buffer (in bytes, cells, or packets) that the
INT packet observes in the device while being forwarded. Use case is when buffer is
shared between multiple queues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps remove the word "shared". Reporting both queue occupancy and buffer occupancy makes sense only if the buffer is shared, but technically the buffer could be a per-queue buffer, just that in this case, both values reported will be the same. INT source does not know whether downstream switches have shared buffers or not. It can set both bits. Switches that have per-queue buffers (unlikely, switches typically use shared buffers) will just report the same value for both instructions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be word this differently, to say this instruction is to get buffer occupancy. Use-case is when buffer is shared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the reference to shared.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor editorial rewording:
The use case is when the buffer is shared between multiple queues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the minor editorial change suggested.

# INT Headers

This section specifies the format and location of INT Headers.
Expand Down Expand Up @@ -752,12 +757,12 @@ hop-by-hop INT header must fit in a single Geneve option.
In this section, we define the format for INT hop-by-hop metadata headers,
and the metadata itself.

INT Metadata Header and Metadata Stack:
INT Metadata Header and Metadata Stack (Version = 1):
`
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Ver |Rep|C|E|M| Reserved | Hop ML |RemainingHopCnt|
|Ver = 1|Rep|C|E|M| Reserved | Hop ML |RemainingHopCnt|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Instruction Bitmap | Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Expand Down Expand Up @@ -821,7 +826,7 @@ The original packet must have C bit set to 0.
switch(es) set the M bit based on knowledge of the network topology
and "Switch ID, Ingress port ID, Egress port ID" tuples in the INT
metadata stack.
- R: Reserved bits.
- R (10b): Reserved bits.
- Hop ML (5b): Per-hop Metadata Length, the length of metadata in 4-Byte words
to be inserted at each INT hop.
- While the largest value of Per-hop Metadata Length is 31, an INT-capable
Expand Down Expand Up @@ -910,6 +915,170 @@ from (shim header length \* 4).
For INT over Geneve it is 8 bytes subtracted from (length in Geneve tunnel
option header \* 4).


INT Metadata Header and Metadata Stack (Version = 2):
Copy link
Contributor

@jklr jklr Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhira1 I was meaning to propose make a 1.x version cut before merging these changes, and move INT.mdk to version2 draft. Then we don't need to list both v1 and v2 in the same doc. How do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Let's remove this second paragraph on v2 header then.

`
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Ver = 2|Rep|C|E|M| Reserved | Hop ML |RemainingHopCnt|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Instruction Bitmap | Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| INT Metadata Stack (Each hop inserts Hop ML * 4B of metadata) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| . . . |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Last INT metadata |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
`

* INT metadata header is 8 bytes long followed by a stack of INT metadata.
Each metadata is either 4 bytes or 8 bytes in length. Each INT hop adds
the same length of metadata. The total length of the metadata stack is
variable as different packets may traverse different paths and hence
different number of INT hops.

* The fields in the INT metadata header are interpreted the following way:
- Ver (4b): INT metadata header version. Should be 2 for this version.
- Rep (2b): Replication requested. Support for this request is optional. If
this value is non-zero, the device may replicate the INT packet. This is useful
to explore all the valid physical forwarding paths when multi-path forwarding
techniques (e.g., ECMP, LAG) are used in the network. Note the Rep bits should
be used judiciously (e.g., only for probe packets, not for every data packet).
While we recommend that Rep bits be set only for probe packets, the INT
architecture does not (and perhaps cannot) disallow use of the Rep bits for real
data packets.
- 0: No replication requested.
- 1: Port-level (L2-level) replication requested. If the INT packet is
forwarded through a logical port that is a port-channel (LAG), then replicate
the packet on each physical port in the port-channel and send a single copy per
physical port.
- 2: Next-hop-level (L3-level) replication requested. Forward the packet
to each L3 ECMP next-hop valid for the destination address, with INT headers
replicated in each forwarded copy.
- 3: Port-level and Next-hop-level replication requested.
- C (1b): Copy.
- If replication is requested for data packets, the INT Sink must be
able to distinguish the original packet from replicas so that it can forward
only original packets up the protocol stack, and drop all the replicas. The C
bit must be set to 1 on each copy, whenever an INT hop replicates a packet.
The original packet must have C bit set to 0.
- C bit must be set to 0 in the original packet by INT source
- E (1b): Max Hop Count exceeded.
- This flag must be set if a device cannot prepend its own metadata due to
the Remaining Hop Count reaching zero.
- E bit must be set to 0 by INT source
- M (1b): MTU exceeded
- This flag must be set if a device cannot add all of the requested metadata
because doing so will cause the packet length to exceed egress link MTU.
In this case, the device must not add any metadata to the packet, and set
the M bit in the INT header. Note that it is possible for egress MTU
limitation to prevent INT metadata insertion at multiple hops along a
path. The M bit simply serves as an indication that INT metadata was not
inserted at one or more hops and corrective action such as reconfiguring
MTU at some links may be needed, particularly when INT switches are not
participating in path MTU discovery. The M bit is not aimed at readily
identifying which switch(es) did not insert INT metadata due to egress MTU
limitation. In theory, if this does not occur at consecutive hops,
it may be possible for the monitoring system to derive which
switch(es) set the M bit based on knowledge of the network topology
and "Switch ID, Ingress port ID, Egress port ID" tuples in the INT
metadata stack.
- R (9b): Reserved bits.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of reserved bits is still 10 correct? I think this text with 9 bits is coming from your other patch in which you are using one of the reserved bits for "Sink" bit. There is no change in number of reserved bits between v1 and v2 in this patch.

- Hop ML (5b): Per-hop Metadata Length, the length of metadata in 4-Byte words
to be inserted at each INT hop.
- While the largest value of Per-hop Metadata Length is 31, an INT-capable
device may be limited in the maximum number of instructions it can process
and/or maximum length of metadata it can insert in data packets. An INT
hop that cannot process all instructions must still insert Per-hop
Metadata Length \* 4 bytes, with all-ones reserved value (4 or 8 bytes
of 0xFF depending on the length of metadata) for the metadata
corresponding to instructions it cannot process. An INT hop that
cannot insert Per-hop Metadata Length \* 4 bytes must skip INT
processing altogether and not insert any metadata in the packet.
- Remaining Hop Count (8b): The remaining number of hops that are allowed to
add their metadata to the packet.
- Upon creation of an INT metadata header, the INT Source must set this
value to the maximum number of hops that are allowed to add metadata
instance(s) to the packet. Each INT-capable device on the path, including
the INT Source as well as INT Transit Hops, must decrement the
Remaining Hop Count if and when it pushes its local metadata onto the
stack.
- When a packet is received with the Remaining Hop Count equal to 0, the
device must ignore the INT instruction, pushing no new metadata onto
the stack, and the device must set the E bit.
* INT instructions are encoded as a bitmap in the 16-bit INT Instruction field:
each bit corresponds to a specific standard metadata as specified in Section 3.
- bit0 (MSB): Switch ID
- bit1: Level 1 Ingress Port ID (16 bits) + Egress Port ID (16 bits)
- bit2: Hop latency
- bit3: Queue ID (8 bits) + Queue occupancy (24 bits)
- bit4: Ingress timestamp
- bit5: Egress timestamp
- bit6: Level 2 Ingress Port ID + Egress Port ID (4 bytes each)
- bit7: Egress port Tx utilization
- bit8: Buffer Occupancy (32 bits)
- bit15: Checksum Complement
- The remaining bits are reserved.
Each instruction requests 4 bytes of metadata to be inserted at each hop,
except if bit 6 and bit 14 is set. If bit 6 is set, the instruction requires
8 bytes of metadata. If bit 14 is set, the instruction requires a domain specific
metadata of n bytes, n being a multiple of 4 bytes. Per-hop metadata length is
set accordingly at the INT source.
* Each INT Transit device along the path that supports INT adds its own metadata
values as specified in the instruction bitmap immediately after the INT metadata
header.
- When adding a new metadata, each device must prepend its metadata in
front of the metadata that are already added by the upstream devices.
This is similar to the push operation on a stack. Hence, the most recently
added metadata appears at the top of the stack. The device must add
metadata in the order of bits set in the instruction bitmap.
- If a device is unable to provide a metadata value specified in the
instruction bitmap because its value is not available, it must add a special
all-ones reserved value indicating "invalid" (4 or 8 bytes of 0xFF
depending on metadata length).
- If a device cannot add all the metadata required by the instruction bitmap
(irrespective of the availability of the metadata values that are asked
for), it must skip processing that particular INT packet entirely. This
ensures that each INT Transit device adds either zero bytes or
Per-hop Metadata Length\*4 bytes to the packet.
- Reserved bits in the instruction bitmap are to be handled similarly. If an
INT transit hop receives a reserved bit set in the instruction bitmap (e.g.
set by a INT source that is running a newer version), the transit hop must
either add corresponding metadata filled with the reserved value 0xFFFFFFFF
or must not add any INT metadata to the packet. This means that an
instruction bit marked reserved in this specification may be
used for a 4B metadata in a subsequent minor version while still being
backward compatible with this specification. However, an instruction bit
marked reserved in this specification may be used for a 8B metadata only
in the next major version, breaking backward compatibility and requring all
INT switches to be upgraded to the new major version. For example
a version 1.0 INT switch cannot operate alongside version 2.0 INT switches
if a new 8B metadata is introduced in version 2.0, as the version 1.0
INT switch could insert 0xFFFFFFFF reserved value for a 8B metadata field,
thus breaking the metadata stack length invariance - the length of
metadata stack will not be a multiple of Per-Hop Metadata length \* 4
in this case.
- If an INT transit hop does not add metadata to a packet due to any of the
above reasons, it must not decrement the remaining INT hop count in the INT
metadata header.
* Summary of the field usage
- The INT Source must set the following fields:
- Ver, Rep, C, M, Per-hop Metadata Length, Remaining Hop Count,
and Instruction Bitmap.
- INT Source must set all reserved bits to zero.
- Intermediate devices can set the following fields:
- C, E, M, Remaining Hop Count
* The length (in bytes) of the INT metadata stack must always
be a multiple of (Per-hop Metadata Length \* 4). This length can be determined
by subtracting the total INT fixed header sizes (12 bytes)
from (shim header length \* 4).
For INT over Geneve it is 8 bytes subtracted from (length in Geneve tunnel
option header \* 4).



# Examples

This section shows example INT Headers with two hosts (Host1 and Host2),
Expand Down Expand Up @@ -1219,6 +1388,10 @@ header int_egress_port_tx_util_t {
bit<32> egress_port_tx_util;
}

header int_b_occupancy_t {
bit<32> q_occupancy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make it consistent with the (example) layout provided in line 866, which has 8b buffer id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us discuss the semantics of Queue and Buffer occupancy. I will make it consistent after the discussion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Buffer ID (8 bits) and Buffer Occupancy (24 bits)

}

/* standard ethernet/ip/tcp headers */
header ethernet_t {
bit<48> dstAddr;
Expand Down Expand Up @@ -1279,6 +1452,7 @@ struct headers {
int_egress_tstamp_t int_egress_tstamp;
int_level2_port_ids_t int_level2_port_ids;
int_egress_port_tx_util_t int_egress_port_tx_util;
int_b_occupancy_t int_b_occupancy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fix indent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

struct empty_metadata_t {
Expand Down Expand Up @@ -1610,6 +1784,11 @@ control EgressDeparserImpl(packet_out packet,
ck.add({hdr.int_egress_port_tx_util.egress_port_tx_util});
}

if (hdr.int_b_occupancy.isValid()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fix indent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

ck.add({
hdr.int_b_occupancy.b_occupancy
});
}
if (hdr.tcp.isValid()) {
ck.add({
hdr.tcp.srcPort,
Expand Down Expand Up @@ -1656,6 +1835,7 @@ control EgressDeparserImpl(packet_out packet,
packet.emit(hdr.int_egress_tstamp);
packet.emit(hdr.int_level2_port_ids);
packet.emit(hdr.int_egress_port_tx_util);
packet.emit(hdr.int_b_occupancy);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fix indent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}

Expand Down