data-loader-behavior: add fine-grained batching support #4045

wchargin · 2020-08-18T09:48:26Z

Summary:
The data loader behavior maintains a fine-grained cache of key-value
pairs. This is useful because when the set of requested keys expands,
only the new keys need to be fetched. But, until now, the behavior has
been hard-coded to fire a separate request for each key-value pair.
Clients can have either fine-grained cache invalidation or efficient
batched requests, but not both at the same time. This patch enriches the
behavior to support just that.

In a future change, the scalars dashboard will take advantage of this to
batch requests for multiple runs and a single tag. This will only
require changing the requestData function in tf-scalar-card.

In doing so, we need to shuffle around the API a bit. Instead of asking
clients to provide getDataLoadUrl: (Item) => string plus a separate
function requestData: (url: string) => Promise<Data> (where “Item”
and “Data” are the keys and values, respectively), clients now provide
a single function requestData that takes raw Items (now plural),
performs the request(s), and returns the data. The function provides a
stream of key-value pairs, which is represented in callback style for
convenience. (We don’t want to drag Observable into this.)

The purpose of this approach, as opposed to a perhaps more natural
approach that simply adapts getDataLoadUrl to return some kind of
request struct with a callback to map a response into key-value pairs,
is to accommodate the variety of existing clients. The structures get
pretty wild: e.g., tf-line-chart-data-loader mixes in the behavior but
doesn’t actually provide the needed properties; they’re provided instead
by tf-scalar-card, but then tf-hparams-session-group-details further
overrides some of those properties of tf-scalar-card with an entirely
different backend. It’s a bit wordier for clients, but at least there
are fewer moving pieces to keep track of.

Test Plan:
The scalars, custom scalars, distributions, histograms, PR curves, and
hparams dashboards all work. The fine-grained invalidation on the
scalars dashboard works: e.g., set the tag filter to mnist and then to
mnist|hparams, and watch only the hparams demo data load; then, set it
to hparams and watch the MNIST charts disappear without any repaints
to the hparams demo charts. The post-load callback properly causes
scalar charts’ domains to adjust. The refresh button in the dashboard UI
properly invalidates and re-fetches data.

(Make sure to run with TB_POLYMER3=1 until that’s the default.)

wchargin-branch: dlb-batch-finegrained

Summary: The data loader behavior maintains a fine-grained cache of key-value pairs. This is useful because when the set of requested keys expands, only the new keys need to be fetched. But, until now, the behavior has been hard-coded to fire a separate request for each key-value pair. Clients can have either fine-grained cache invalidation or efficient batched requests, but not both at the same time. This patch enriches the behavior to support just that. In a future change, the scalars dashboard will take advantage of this to batch requests for multiple runs and a single tag. In doing so, we need to shuffle around the API a bit. Instead of asking clients to provide `getDataLoadUrl: (Item) => string` plus a separate function `requestData: (url: string) => Promise<Data>` (where “`Item`” and “`Data`” are the keys and values, respectively), clients now provide a single function `requestData` that takes raw `Item`s (now plural), performs the request(s), and returns the data. The function provides a stream of key-value pairs, which is represented in callback style for convenience. (We don’t want to drag Observable into this.) The purpose of this approach, as opposed to a perhaps more natural approach that simply adapts `getDataLoadUrl` to return some kind of request struct with a callback to map a response into key-value pairs, is to accommodate the variety of existing clients. The structures get pretty wild: e.g., `tf-line-chart-data-loader` mixes in the behavior but doesn’t actually provide the needed properties; they’re provided instead by `tf-scalar-card`, but then `tf-hparams-session-group-details` further overrides some of those properties of `tf-scalar-card` with an entirely different backend. It’s a bit wordier for clients, but at least there are fewer moving pieces to keep track of. Test Plan: The scalars, custom scalars, distributions, histograms, and hparams dashboards all work. The fine-grained invalidation on the scalars dashboard works: e.g., set the tag filter to `mnist` and then to `mnist|hparams`, and watch only the hparams demo data load; then, set it to `hparams` and watch the MNIST charts disappear without any repaints to the hparams demo charts. The post-load callback properly causes scalar charts’ domains to adjust. The refresh button in the dashboard UI properly invalidates and re-fetches data. (Make sure to run with `TB_POLYMER3=1` until that’s the default.) wchargin-branch: dlb-batch-finegrained wchargin-source: 14ec8abde36c4563a4922209d361fc5bd16b6061

wchargin · 2020-08-18T17:12:13Z

If helpful, cf. #4050 for the actual multiplexing, which I’ve done some
testing of but haven’t properly documented yet, and so isn’t yet ready
for offical review.

stephanwlee

99% looks good. I just want to see the reply on the condition for the dirtyItems (should inflight item be requested again?)

stephanwlee · 2020-08-18T17:32:19Z

tensorboard/components_polymer3/tf_dashboard_common/data-loader-behavior.ts

-    // A function that takes a datum and returns a string URL for fetching
-    // data.
-    getDataLoadUrl!: (datum: Item) => string;
+    loadDataCallback!: (component: this, item: Item, data: Data) => void;


Do you think there is a benefit in "batching" this API, too? i.e., call the callback once per all items.

Good question. I considered it, and thought that the commitChanges
batching that we already do would be sufficient (for scalars, and it’s a
non-issue for histograms/distributions). But on further inspection I see
that the data load-data callback actually calls commitChanges every
time. I feel like I’m missing something; what’s the point of the
batching then?

I’m happy to consider adding batching here, but by default I’ll probably
opt to defer that into another PR if we decide that it’s worth doing.
Could be convinced otherwise, though.

Ah, I think I made a mistake. I definitely meant to only commit the changes on onLoadFinish. I will send a separate PR shortly.

Forgot to comment here: I did play around with this and everything still
works, and I see a significant improvement in paint time.

It’d be slightly easier for me if I could send that PR after these two
land, just so that I don’t have to futz with conflicts; is that okay
with you?

Awesome. It is awesome that it improves the paint time, too.

Please feel free to work on it.

tensorboard/components_polymer3/tf_dashboard_common/data-loader-behavior.ts

stephanwlee · 2020-08-18T17:39:29Z

tensorboard/components_polymer3/tf_dashboard_common/data-loader-behavior.ts

-              this._canceller.cancellable(({cancelled}) => {
-                if (cancelled) {
-                  return;
+          const dirtyItems = this.dataToLoad.filter((datum) => {


Sorry about pre-existing naming I created. Can we, in a follow up, rename these (or create an alias) to say item instead of data? i.e., we use both right now with dataToLoad, getDataLoadName, and dataLoadState but use item in requestData`.

np :-) Yeah, I was planning to rename them to just K and V
throughout (with a slight potential point of confusion around the cache
keys, but I think that (K) => string is a pretty self-explanatory
signature). In this PR, I cleaned up a couple that I was touching
anyway, but tried to avoid changing the load-bearing API names.
Definitely happy to follow up. Could also do that as pre-work, though
I’m not super inclined to surgere that at this point.

stephanwlee · 2020-08-18T17:54:32Z

tensorboard/components_polymer3/tf_dashboard_common/data-loader-behavior.ts

-              this._canceller.cancellable(({cancelled}) => {
-                if (cancelled) {
-                  return;
+          const dirtyItems = this.dataToLoad.filter((datum) => {


I do not remember your plan so I will ask here: what is our plan with the maximum number of runs we batch?

On the reload, we expunge the cache (maybe we should only expunge already loaded so we do not double fetch) and invoke _loadData. If user has thousands of runs, one batch request can ask for thousand entries which I don't think will scale nicely (if you want some anecdote, please ask offline). Do we plan on breaking the request down into chunks even in this model?

Yeah, fair question. The data loader behavior shouldn’t impose a max
batch size, but I was considering adding a batch size of ~100 in the
requestData implementation that calls /scalars_multirun. This is
just based off some napkin math: per time series, 1000 points, with
JSON-encoded wall time (~18 bytes), step (~5 bytes), value (~18 bytes),
and overhead (~8 bytes), for about 48 bytes per datum and thus 4.8 KB
per time series. A response size of 500 KB seems entirely reasonable,
and capping it somewhere seems wise to avoid unbounded server load.

Do we plan on breaking the request down into chunks even in this
model?

I think that it’s fine to call requestData with all the items and let
requestData decide if/how to break that up into smaller requests.

That plan sounds good. Thanks for sharing your plans.

...gins/custom_scalar/polymer3/tf_custom_scalar_dashboard/tf-custom-scalar-margin-chart-card.ts

wchargin

Thanks for the quick review!

wchargin · 2020-08-18T20:41:06Z

tensorboard/components_polymer3/tf_dashboard_common/data-loader-behavior.ts

-    // A function that takes a datum and returns a string URL for fetching
-    // data.
-    getDataLoadUrl!: (datum: Item) => string;
+    loadDataCallback!: (component: this, item: Item, data: Data) => void;


Good question. I considered it, and thought that the commitChanges
batching that we already do would be sufficient (for scalars, and it’s a
non-issue for histograms/distributions). But on further inspection I see
that the data load-data callback actually calls commitChanges every
time. I feel like I’m missing something; what’s the point of the
batching then?

I’m happy to consider adding batching here, but by default I’ll probably
opt to defer that into another PR if we decide that it’s worth doing.
Could be convinced otherwise, though.

wchargin · 2020-08-18T20:44:10Z

tensorboard/components_polymer3/tf_dashboard_common/data-loader-behavior.ts

-              this._canceller.cancellable(({cancelled}) => {
-                if (cancelled) {
-                  return;
+          const dirtyItems = this.dataToLoad.filter((datum) => {


np :-) Yeah, I was planning to rename them to just K and V
throughout (with a slight potential point of confusion around the cache
keys, but I think that (K) => string is a pretty self-explanatory
signature). In this PR, I cleaned up a couple that I was touching
anyway, but tried to avoid changing the load-bearing API names.
Definitely happy to follow up. Could also do that as pre-work, though
I’m not super inclined to surgere that at this point.

tensorboard/components_polymer3/tf_dashboard_common/data-loader-behavior.ts

wchargin · 2020-08-18T21:05:46Z

tensorboard/components_polymer3/tf_dashboard_common/data-loader-behavior.ts

-              this._canceller.cancellable(({cancelled}) => {
-                if (cancelled) {
-                  return;
+          const dirtyItems = this.dataToLoad.filter((datum) => {


Yeah, fair question. The data loader behavior shouldn’t impose a max
batch size, but I was considering adding a batch size of ~100 in the
requestData implementation that calls /scalars_multirun. This is
just based off some napkin math: per time series, 1000 points, with
JSON-encoded wall time (~18 bytes), step (~5 bytes), value (~18 bytes),
and overhead (~8 bytes), for about 48 bytes per datum and thus 4.8 KB
per time series. A response size of 500 KB seems entirely reasonable,
and capping it somewhere seems wise to avoid unbounded server load.

Do we plan on breaking the request down into chunks even in this
model?

I think that it’s fine to call requestData with all the items and let
requestData decide if/how to break that up into smaller requests.

...gins/custom_scalar/polymer3/tf_custom_scalar_dashboard/tf-custom-scalar-margin-chart-card.ts

wchargin · 2020-08-19T03:53:54Z

Response to your action item here:
#4045 (comment)

PTAL; if you like, I’m happy to take a stab at any of the refactorings
that I mentioned, but I haven’t started any yet.

wchargin-branch: dlb-batch-finegrained wchargin-source: f734c80647350bbd68573a116fb3ff0fd7f62079

stephanwlee · 2020-08-19T07:03:43Z

tensorboard/components_polymer3/tf_dashboard_common/data-loader-behavior.ts

-    // A function that takes a datum and returns a string URL for fetching
-    // data.
-    getDataLoadUrl!: (datum: Item) => string;
+    loadDataCallback!: (component: this, item: Item, data: Data) => void;


Ah, I think I made a mistake. I definitely meant to only commit the changes on onLoadFinish. I will send a separate PR shortly.

tensorboard/components_polymer3/tf_dashboard_common/data-loader-behavior.ts

stephanwlee · 2020-08-19T07:04:26Z

tensorboard/components_polymer3/tf_dashboard_common/data-loader-behavior.ts

-              this._canceller.cancellable(({cancelled}) => {
-                if (cancelled) {
-                  return;
+          const dirtyItems = this.dataToLoad.filter((datum) => {


That plan sounds good. Thanks for sharing your plans.

...gins/custom_scalar/polymer3/tf_custom_scalar_dashboard/tf-custom-scalar-margin-chart-card.ts

Summary: As of this patch, a `tf-scalar-card` can make just one network request to fetch its data, instead of one request per time series (i.e., one request per run, since each scalar chart renders a single tag). This reduces network overhead, improves throughput due to higher request concurrency, and offers the opportunity for data providers to more efficiently request the data in batch. This is implemented with a new POST route `/scalars_multirun`, since the list of run names may be long. The frontend is configured to batch requests to at most 64 runs at once, so the multiplexing is bounded. This only affects the scalars plugin. Other scalar chart sources, like PR curves, custom scalars, and the hparams metrics views, are not affected. Supersedes #3835, with the same idea and same general backend approach, but using the frontend APIs enabled by #4045. Test Plan: On the hparams demo with four charts showing, each fetching 50 runs, we now make only four requests as opposed to 200. On a Google-internal networked data provider, this improves end-to-end time (measured from “first spinner appears” to “last spinner disappears”) by about 50%, from 22±1 seconds to 11±1 seconds. (Before this patch, the network requests were being queued all the way to the end of the test period.) Changing the batch size to 4 and then running on a dataset with 14 runs shows that the requests are properly batched, including the last one with just 2 runs. Testing hparams, custom scalars, and PR curves shows that they continue to work, even when multiple time series are requested. wchargin-branch: scalars-mux-runs

wchargin added the core:frontend label Aug 18, 2020

googlebot added the cla: yes label Aug 18, 2020

wchargin requested a review from stephanwlee August 18, 2020 09:53

stephanwlee reviewed Aug 18, 2020

View reviewed changes

wchargin commented Aug 18, 2020

View reviewed changes

wchargin requested a review from stephanwlee August 19, 2020 03:53

[update patch]

e8bf2fe

wchargin-branch: dlb-batch-finegrained wchargin-source: f734c80647350bbd68573a116fb3ff0fd7f62079

stephanwlee approved these changes Aug 19, 2020

View reviewed changes

wchargin added the theme:performance Performance, scalability, large data sizes, slowness, etc. label Aug 19, 2020

wchargin merged commit 9d31562 into master Aug 19, 2020

wchargin deleted the wchargin-dlb-batch-finegrained branch August 19, 2020 17:18

wchargin mentioned this pull request Aug 19, 2020

scalars: multiplex data fetches within a tag #4050

Merged

data-loader-behavior: add fine-grained batching support #4045

data-loader-behavior: add fine-grained batching support #4045

Uh oh!

Conversation

wchargin commented Aug 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wchargin commented Aug 18, 2020

Uh oh!

stephanwlee left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wchargin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wchargin commented Aug 19, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wchargin commented Aug 18, 2020 •

edited

Loading