-
Notifications
You must be signed in to change notification settings - Fork 1.7k
data-loader-behavior: add fine-grained batching support #4045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: The data loader behavior maintains a fine-grained cache of key-value pairs. This is useful because when the set of requested keys expands, only the new keys need to be fetched. But, until now, the behavior has been hard-coded to fire a separate request for each key-value pair. Clients can have either fine-grained cache invalidation or efficient batched requests, but not both at the same time. This patch enriches the behavior to support just that. In a future change, the scalars dashboard will take advantage of this to batch requests for multiple runs and a single tag. In doing so, we need to shuffle around the API a bit. Instead of asking clients to provide `getDataLoadUrl: (Item) => string` plus a separate function `requestData: (url: string) => Promise<Data>` (where “`Item`” and “`Data`” are the keys and values, respectively), clients now provide a single function `requestData` that takes raw `Item`s (now plural), performs the request(s), and returns the data. The function provides a stream of key-value pairs, which is represented in callback style for convenience. (We don’t want to drag Observable into this.) The purpose of this approach, as opposed to a perhaps more natural approach that simply adapts `getDataLoadUrl` to return some kind of request struct with a callback to map a response into key-value pairs, is to accommodate the variety of existing clients. The structures get pretty wild: e.g., `tf-line-chart-data-loader` mixes in the behavior but doesn’t actually provide the needed properties; they’re provided instead by `tf-scalar-card`, but then `tf-hparams-session-group-details` further overrides some of those properties of `tf-scalar-card` with an entirely different backend. It’s a bit wordier for clients, but at least there are fewer moving pieces to keep track of. Test Plan: The scalars, custom scalars, distributions, histograms, and hparams dashboards all work. The fine-grained invalidation on the scalars dashboard works: e.g., set the tag filter to `mnist` and then to `mnist|hparams`, and watch only the hparams demo data load; then, set it to `hparams` and watch the MNIST charts disappear without any repaints to the hparams demo charts. The post-load callback properly causes scalar charts’ domains to adjust. The refresh button in the dashboard UI properly invalidates and re-fetches data. (Make sure to run with `TB_POLYMER3=1` until that’s the default.) wchargin-branch: dlb-batch-finegrained wchargin-source: 14ec8abde36c4563a4922209d361fc5bd16b6061
|
If helpful, cf. #4050 for the actual multiplexing, which I’ve done some |
stephanwlee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
99% looks good. I just want to see the reply on the condition for the dirtyItems (should inflight item be requested again?)
| // A function that takes a datum and returns a string URL for fetching | ||
| // data. | ||
| getDataLoadUrl!: (datum: Item) => string; | ||
| loadDataCallback!: (component: this, item: Item, data: Data) => void; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think there is a benefit in "batching" this API, too? i.e., call the callback once per all items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I considered it, and thought that the commitChanges
batching that we already do would be sufficient (for scalars, and it’s a
non-issue for histograms/distributions). But on further inspection I see
that the data load-data callback actually calls commitChanges every
time. I feel like I’m missing something; what’s the point of the
batching then?
I’m happy to consider adding batching here, but by default I’ll probably
opt to defer that into another PR if we decide that it’s worth doing.
Could be convinced otherwise, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think I made a mistake. I definitely meant to only commit the changes on onLoadFinish. I will send a separate PR shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to comment here: I did play around with this and everything still
works, and I see a significant improvement in paint time.
It’d be slightly easier for me if I could send that PR after these two
land, just so that I don’t have to futz with conflicts; is that okay
with you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. It is awesome that it improves the paint time, too.
Please feel free to work on it.
tensorboard/components_polymer3/tf_dashboard_common/data-loader-behavior.ts
Show resolved
Hide resolved
| this._canceller.cancellable(({cancelled}) => { | ||
| if (cancelled) { | ||
| return; | ||
| const dirtyItems = this.dataToLoad.filter((datum) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about pre-existing naming I created. Can we, in a follow up, rename these (or create an alias) to say item instead of data? i.e., we use both right now with dataToLoad, getDataLoadName, and dataLoadState but use item in requestData`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np :-) Yeah, I was planning to rename them to just K and V
throughout (with a slight potential point of confusion around the cache
keys, but I think that (K) => string is a pretty self-explanatory
signature). In this PR, I cleaned up a couple that I was touching
anyway, but tried to avoid changing the load-bearing API names.
Definitely happy to follow up. Could also do that as pre-work, though
I’m not super inclined to surgere that at this point.
| this._canceller.cancellable(({cancelled}) => { | ||
| if (cancelled) { | ||
| return; | ||
| const dirtyItems = this.dataToLoad.filter((datum) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not remember your plan so I will ask here: what is our plan with the maximum number of runs we batch?
On the reload, we expunge the cache (maybe we should only expunge already loaded so we do not double fetch) and invoke _loadData. If user has thousands of runs, one batch request can ask for thousand entries which I don't think will scale nicely (if you want some anecdote, please ask offline). Do we plan on breaking the request down into chunks even in this model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, fair question. The data loader behavior shouldn’t impose a max
batch size, but I was considering adding a batch size of ~100 in the
requestData implementation that calls /scalars_multirun. This is
just based off some napkin math: per time series, 1000 points, with
JSON-encoded wall time (~18 bytes), step (~5 bytes), value (~18 bytes),
and overhead (~8 bytes), for about 48 bytes per datum and thus 4.8 KB
per time series. A response size of 500 KB seems entirely reasonable,
and capping it somewhere seems wise to avoid unbounded server load.
Do we plan on breaking the request down into chunks even in this
model?
I think that it’s fine to call requestData with all the items and let
requestData decide if/how to break that up into smaller requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That plan sounds good. Thanks for sharing your plans.
...gins/custom_scalar/polymer3/tf_custom_scalar_dashboard/tf-custom-scalar-margin-chart-card.ts
Show resolved
Hide resolved
wchargin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick review!
| // A function that takes a datum and returns a string URL for fetching | ||
| // data. | ||
| getDataLoadUrl!: (datum: Item) => string; | ||
| loadDataCallback!: (component: this, item: Item, data: Data) => void; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I considered it, and thought that the commitChanges
batching that we already do would be sufficient (for scalars, and it’s a
non-issue for histograms/distributions). But on further inspection I see
that the data load-data callback actually calls commitChanges every
time. I feel like I’m missing something; what’s the point of the
batching then?
I’m happy to consider adding batching here, but by default I’ll probably
opt to defer that into another PR if we decide that it’s worth doing.
Could be convinced otherwise, though.
| this._canceller.cancellable(({cancelled}) => { | ||
| if (cancelled) { | ||
| return; | ||
| const dirtyItems = this.dataToLoad.filter((datum) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np :-) Yeah, I was planning to rename them to just K and V
throughout (with a slight potential point of confusion around the cache
keys, but I think that (K) => string is a pretty self-explanatory
signature). In this PR, I cleaned up a couple that I was touching
anyway, but tried to avoid changing the load-bearing API names.
Definitely happy to follow up. Could also do that as pre-work, though
I’m not super inclined to surgere that at this point.
tensorboard/components_polymer3/tf_dashboard_common/data-loader-behavior.ts
Show resolved
Hide resolved
| this._canceller.cancellable(({cancelled}) => { | ||
| if (cancelled) { | ||
| return; | ||
| const dirtyItems = this.dataToLoad.filter((datum) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, fair question. The data loader behavior shouldn’t impose a max
batch size, but I was considering adding a batch size of ~100 in the
requestData implementation that calls /scalars_multirun. This is
just based off some napkin math: per time series, 1000 points, with
JSON-encoded wall time (~18 bytes), step (~5 bytes), value (~18 bytes),
and overhead (~8 bytes), for about 48 bytes per datum and thus 4.8 KB
per time series. A response size of 500 KB seems entirely reasonable,
and capping it somewhere seems wise to avoid unbounded server load.
Do we plan on breaking the request down into chunks even in this
model?
I think that it’s fine to call requestData with all the items and let
requestData decide if/how to break that up into smaller requests.
...gins/custom_scalar/polymer3/tf_custom_scalar_dashboard/tf-custom-scalar-margin-chart-card.ts
Show resolved
Hide resolved
|
Response to your action item here: PTAL; if you like, I’m happy to take a stab at any of the refactorings |
wchargin-branch: dlb-batch-finegrained wchargin-source: f734c80647350bbd68573a116fb3ff0fd7f62079
| // A function that takes a datum and returns a string URL for fetching | ||
| // data. | ||
| getDataLoadUrl!: (datum: Item) => string; | ||
| loadDataCallback!: (component: this, item: Item, data: Data) => void; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think I made a mistake. I definitely meant to only commit the changes on onLoadFinish. I will send a separate PR shortly.
tensorboard/components_polymer3/tf_dashboard_common/data-loader-behavior.ts
Show resolved
Hide resolved
| this._canceller.cancellable(({cancelled}) => { | ||
| if (cancelled) { | ||
| return; | ||
| const dirtyItems = this.dataToLoad.filter((datum) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That plan sounds good. Thanks for sharing your plans.
...gins/custom_scalar/polymer3/tf_custom_scalar_dashboard/tf-custom-scalar-margin-chart-card.ts
Show resolved
Hide resolved
Summary: As of this patch, a `tf-scalar-card` can make just one network request to fetch its data, instead of one request per time series (i.e., one request per run, since each scalar chart renders a single tag). This reduces network overhead, improves throughput due to higher request concurrency, and offers the opportunity for data providers to more efficiently request the data in batch. This is implemented with a new POST route `/scalars_multirun`, since the list of run names may be long. The frontend is configured to batch requests to at most 64 runs at once, so the multiplexing is bounded. This only affects the scalars plugin. Other scalar chart sources, like PR curves, custom scalars, and the hparams metrics views, are not affected. Supersedes #3835, with the same idea and same general backend approach, but using the frontend APIs enabled by #4045. Test Plan: On the hparams demo with four charts showing, each fetching 50 runs, we now make only four requests as opposed to 200. On a Google-internal networked data provider, this improves end-to-end time (measured from “first spinner appears” to “last spinner disappears”) by about 50%, from 22±1 seconds to 11±1 seconds. (Before this patch, the network requests were being queued all the way to the end of the test period.) Changing the batch size to 4 and then running on a dataset with 14 runs shows that the requests are properly batched, including the last one with just 2 runs. Testing hparams, custom scalars, and PR curves shows that they continue to work, even when multiple time series are requested. wchargin-branch: scalars-mux-runs
Summary:
The data loader behavior maintains a fine-grained cache of key-value
pairs. This is useful because when the set of requested keys expands,
only the new keys need to be fetched. But, until now, the behavior has
been hard-coded to fire a separate request for each key-value pair.
Clients can have either fine-grained cache invalidation or efficient
batched requests, but not both at the same time. This patch enriches the
behavior to support just that.
In a future change, the scalars dashboard will take advantage of this to
batch requests for multiple runs and a single tag. This will only
require changing the
requestDatafunction intf-scalar-card.In doing so, we need to shuffle around the API a bit. Instead of asking
clients to provide
getDataLoadUrl: (Item) => stringplus a separatefunction
requestData: (url: string) => Promise<Data>(where “Item”and “
Data” are the keys and values, respectively), clients now providea single function
requestDatathat takes rawItems (now plural),performs the request(s), and returns the data. The function provides a
stream of key-value pairs, which is represented in callback style for
convenience. (We don’t want to drag Observable into this.)
The purpose of this approach, as opposed to a perhaps more natural
approach that simply adapts
getDataLoadUrlto return some kind ofrequest struct with a callback to map a response into key-value pairs,
is to accommodate the variety of existing clients. The structures get
pretty wild: e.g.,
tf-line-chart-data-loadermixes in the behavior butdoesn’t actually provide the needed properties; they’re provided instead
by
tf-scalar-card, but thentf-hparams-session-group-detailsfurtheroverrides some of those properties of
tf-scalar-cardwith an entirelydifferent backend. It’s a bit wordier for clients, but at least there
are fewer moving pieces to keep track of.
Test Plan:
The scalars, custom scalars, distributions, histograms, PR curves, and
hparams dashboards all work. The fine-grained invalidation on the
scalars dashboard works: e.g., set the tag filter to
mnistand then tomnist|hparams, and watch only the hparams demo data load; then, set itto
hparamsand watch the MNIST charts disappear without any repaintsto the hparams demo charts. The post-load callback properly causes
scalar charts’ domains to adjust. The refresh button in the dashboard UI
properly invalidates and re-fetches data.
(Make sure to run with
TB_POLYMER3=1until that’s the default.)wchargin-branch: dlb-batch-finegrained