Skip to content

Commit 803c41a

Browse files
qiuminxustephanwlee
authored andcommitted
Add a pod viewer tool for analyzing TPU pod performance. (#2111)
As TPU pod is provided to cloud TPU users, we need to tool to provide a global visualization of the performance of each core, as well as the communication across the cores for your system. The pod viewer UI shows: - A step slider, which allows you to select which step you want to dive into. - A topology graph, which interactively visualizes your TPU core in the whole TPU system. - A step breakdown chart, which visualizes a breakdown of a step for all cores. This can be used to track where the bottleneck of the system is and whether there is a particular core slows down the system. - A chart showing the latency of all the send-recv channels. - A chart showing the latency of all-reduce ops. - Communication links, which visualizes the send and recv channels in the topology graph. A channel details card also shows up on the right, providing detailed information of the channel, such as size of data transferred, latency and bandwidth.
1 parent 90a386d commit 803c41a

File tree

21 files changed

+1944
-1
lines changed

21 files changed

+1944
-1
lines changed

tensorboard/plugins/profile/BUILD

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@ py_binary(
9090
"profile_demo.google_chart_demo.json",
9191
"profile_demo.memory_viewer.json",
9292
"profile_demo.op_profile.json",
93+
"profile_demo.pod_viewer.json",
9394
],
9495
srcs_version = "PY2AND3",
9596
deps = [
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
package(default_visibility = ["//tensorboard:internal"])
2+
3+
load("//tensorboard/defs:web.bzl", "tf_web_library")
4+
5+
licenses(["notice"]) # Apache 2.0
6+
7+
tf_web_library(
8+
name = "details_card",
9+
srcs = [
10+
"details-card.html",
11+
"details-card.ts",
12+
],
13+
path = "/pod-viewer",
14+
deps = [
15+
"//tensorboard/components/tf_imports:polymer",
16+
"//tensorboard/plugins/profile/pod_viewer/pod_viewer_common",
17+
"@org_polymer_paper_card",
18+
],
19+
)
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
<!--
2+
@license
3+
Copyright 2019 The TensorFlow Authors. All Rights Reserved.
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
http://www.apache.org/licenses/LICENSE-2.0
8+
Unless required by applicable law or agreed to in writing, software
9+
distributed under the License is distributed on an "AS IS" BASIS,
10+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11+
See the License for the specific language governing permissions and
12+
limitations under the License.
13+
==============================================================================
14+
-->
15+
16+
<!--
17+
details-card is a card that highlights detailed information of a selection.
18+
-->
19+
20+
<link rel="import" href="../polymer/polymer.html">
21+
<link rel="import" href="../paper-card/paper-card.html">
22+
<link rel="import" href="pod-viewer-common.html">
23+
24+
<dom-module id="details-card">
25+
<style>
26+
paper-card {
27+
max-height: 50vh;
28+
overflow-y: auto;
29+
--paper-card-header-color: white;
30+
--paper-card-header: {
31+
background-color: rgb(178,34,34);
32+
}
33+
width: 100%;
34+
}
35+
36+
.card-content > div {
37+
margin-bottom: 1em;
38+
top: -5px;
39+
}
40+
41+
.value {
42+
font-weight: normal;
43+
text-align: right;
44+
}
45+
46+
.info {
47+
font-size: 20px;
48+
font-weight: bold;
49+
}
50+
51+
.code-style {
52+
font-size: 14px;
53+
font-weight: normal;
54+
}
55+
</style>
56+
<template>
57+
<paper-card id="card" heading="[[_name]]" hidden="[[!_name]]" elevation="2">
58+
<template is="dom-repeat" items=[[nodes]] as="node">
59+
<div class="card-content info">
60+
<div hidden="[[!_isChannel(node)]]">
61+
<p>Replica Id: <span class="value">[[node.replicaId]]</span></p>
62+
</div>
63+
<div hidden="[[_isStep(node)]]">
64+
<p>Data Transferred: <span class="value">[[_sizeMiB(node.dataSize)]] MiB</span></p>
65+
<p>Latency: <span class="value">[[_format(node.durationUs)]] µs</span></p>
66+
<p>BW: <span class="value">
67+
[[_bandwidth(node.dataSize, node.durationUs)]] GiB/s</span></p>
68+
</div>
69+
<div hidden="[[!_isChannel(node)]]">
70+
<p>Send Delay: <span class="value">[[_format(node.sendDelayUs)]] µs</span></p>
71+
<p>From: <span class="value">Chip[[_chipId(node.srcCoreId)]], Core[[_nodeId(node.srcCoreId)]]</span></p>
72+
<p>To: <span class="value">Chip[[_chipId(node.dstCoreId)]], Core[[_nodeId(node.dstCoreId)]]</span></p>
73+
<p>Hlo Names: </p>
74+
<code class="code-style">
75+
<template is="dom-repeat" items=[[node.hloNames]]>
76+
"[[item]]"
77+
</template>
78+
</code>
79+
</div>
80+
<div hidden="[[!_hasReplicaGroups(node)]]">
81+
<p>Replica Groups</p>
82+
<code class="code-style">
83+
<template is="dom-repeat" items=[[node.replicaGroups]]>
84+
{[[item.replicaIds]]}<br>
85+
</template>
86+
</code>
87+
</div>
88+
<div hidden="[[!_isStep(node)]]">
89+
<template is="dom-repeat" items=[[stepBreakdownLayers]]>
90+
<p>[[item.label]]: <span class="value">
91+
[[_getStepBreakdownValue(node, item.key)]] µs
92+
[[_getStepBreakdownPct(node, item.key)]]</span></p>
93+
</template>
94+
</div>
95+
</div>
96+
</template>
97+
</paper-card>
98+
</template>
99+
<script src="details-card.js"></script>
100+
</dom-module>
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
/* Copyright 2019 The TensorFlow Authors. All Rights Reserved.
2+
Licensed under the Apache License, Version 2.0 (the "License");
3+
you may not use this file except in compliance with the License.
4+
You may obtain a copy of the License at
5+
http://www.apache.org/licenses/LICENSE-2.0
6+
Unless required by applicable law or agreed to in writing, software
7+
distributed under the License is distributed on an "AS IS" BASIS,
8+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
9+
See the License for the specific language governing permissions and
10+
limitations under the License.
11+
==============================================================================*/
12+
13+
namespace pod_viewer_details_card {
14+
15+
type DetailNode = podviewer.proto.ChannelInfo | podviewer.proto.PodStatsRecord
16+
| podviewer.proto.AllReduceOpInfo;
17+
18+
Polymer({
19+
is: 'details-card',
20+
properties: {
21+
nodes: {
22+
type: Array,
23+
},
24+
_name: {
25+
type: String,
26+
computed: '_computeName(nodes)',
27+
},
28+
stepBreakdownLayers: {
29+
type: Array,
30+
value: () => [
31+
{key: 'highFlopsComputeUs', label: 'High flops compute'},
32+
{key: 'lowFlopsComputeUs', label: 'Low flops compute'},
33+
{key: 'hostInfeedDurationUs', label: 'Infeed'},
34+
{key: 'hostOutfeedDurationUs', label: 'Outfeed'},
35+
{key: 'crsDurationUs', label: 'All reduce'},
36+
{key: 'sendDurationUs', label: 'Send'},
37+
{key: 'recvDurationUs', label: 'Recv'},
38+
],
39+
},
40+
},
41+
_isAllReduce(node: DetailNode): node is podviewer.proto.AllReduceOpInfo {
42+
return (<podviewer.proto.AllReduceOpInfo>node).replicaGroups != undefined;
43+
},
44+
_isChannel(node: DetailNode): node is podviewer.proto.ChannelInfo {
45+
return (<podviewer.proto.ChannelInfo>node).channelId != undefined;
46+
},
47+
_isStep(node: DetailNode): node is podviewer.proto.PodStatsRecord {
48+
return (<podviewer.proto.PodStatsRecord>node).hostName != undefined;
49+
},
50+
_hasReplicaGroups(node: podviewer.proto.AllReduceOpInfo): boolean {
51+
return node.replicaGroups && node.replicaGroups.length > 0;
52+
},
53+
_computeName: function(nodes: Array<DetailNode>): string|undefined {
54+
if (!nodes || nodes.length == 0) return;
55+
const node = nodes[0];
56+
if (this._isChannel(node)){
57+
return 'Channel # ' + (<podviewer.proto.ChannelInfo>node).channelId;
58+
} else if (this._isAllReduce(node)) {
59+
return (<podviewer.proto.AllReduceOpInfo>node).name;
60+
} else if (this._isStep(node)) {
61+
return 'Step breakdown of chip '
62+
+ (<podviewer.proto.PodStatsRecord>node).chipId
63+
+ ', core ' + (<podviewer.proto.PodStatsRecord>node).nodeId;
64+
}
65+
return;
66+
},
67+
/**
68+
* Converts from number of bytes to MiB.
69+
*/
70+
_bytesToMiB: function(numBytes: number): number {
71+
return numBytes / 1048576;
72+
},
73+
/**
74+
* Return the formatted data size in MiB.
75+
*/
76+
_sizeMiB: function(dataSize: undefined|number): string|undefined {
77+
if (!dataSize) return;
78+
return this._format(this._bytesToMiB(dataSize));
79+
},
80+
/**
81+
* Return the formatted link bandwidth in GiB/s.
82+
* The link bandwidth here is defined by the data size transferred over the
83+
* duration between the start of the send operation to the end of the
84+
* recv-done operation.
85+
*/
86+
_bandwidth: function(
87+
dataSize: undefined|number, duration: undefined|number):
88+
string|undefined {
89+
if (!dataSize || !duration) return;
90+
return this._format(dataSize / duration / 1073.74);
91+
},
92+
/**
93+
* Return the chip id given the global core id.
94+
*/
95+
_chipId: function(coreId: number): number {
96+
return Math.floor(coreId / 2);
97+
},
98+
/**
99+
* Return the node ordinal given the global core id.
100+
*/
101+
_nodeId: function(coreId: number): number {
102+
return coreId & 1;
103+
},
104+
/**
105+
* Format a number with two digits after the decimal point.
106+
*/
107+
_format: function(number: undefined|number): string {
108+
return number == null ? '' : number.toFixed(2);
109+
},
110+
/**
111+
* Return a formatted value associated with a specific breakdown.
112+
*/
113+
_getStepBreakdownValue:
114+
function(node: undefined|podviewer.proto.PodStatsRecord,
115+
key: undefined|string): string|undefined {
116+
if (!key || !node) return;
117+
return this._format(node[key]);
118+
},
119+
/**
120+
* Return a the percentage of a specific breakdown.
121+
*/
122+
_getStepBreakdownPct:
123+
function(node: undefined|podviewer.proto.PodStatsRecord,
124+
key: undefined|string): string|undefined {
125+
if (!key || !node || !node.totalDurationUs) return;
126+
return (node[key] / node.totalDurationUs * 100).toFixed(2) + '%';
127+
},
128+
});
129+
130+
} // namespace pod_viewer_details_card
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
package(default_visibility = ["//tensorboard:internal"])
2+
3+
load("//tensorboard/defs:web.bzl", "tf_web_library")
4+
5+
licenses(["notice"]) # Apache 2.0
6+
7+
tf_web_library(
8+
name = "pod_viewer_common",
9+
srcs = [
10+
"pod-viewer-common.html",
11+
"proto.ts",
12+
],
13+
path = "/pod-viewer",
14+
)
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
<!--
2+
@license
3+
Copyright 2019 The TensorFlow Authors. All Rights Reserved.
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
18+
<script src="proto.js"></script>

0 commit comments

Comments
 (0)