Skip to content

Conversation

@wchargin
Copy link
Contributor

@wchargin wchargin commented Aug 18, 2020

Summary:
The data loader behavior maintains a fine-grained cache of key-value
pairs. This is useful because when the set of requested keys expands,
only the new keys need to be fetched. But, until now, the behavior has
been hard-coded to fire a separate request for each key-value pair.
Clients can have either fine-grained cache invalidation or efficient
batched requests, but not both at the same time. This patch enriches the
behavior to support just that.

In a future change, the scalars dashboard will take advantage of this to
batch requests for multiple runs and a single tag. This will only
require changing the requestData function in tf-scalar-card.

In doing so, we need to shuffle around the API a bit. Instead of asking
clients to provide getDataLoadUrl: (Item) => string plus a separate
function requestData: (url: string) => Promise<Data> (where “Item
and “Data” are the keys and values, respectively), clients now provide
a single function requestData that takes raw Items (now plural),
performs the request(s), and returns the data. The function provides a
stream of key-value pairs, which is represented in callback style for
convenience. (We don’t want to drag Observable into this.)

The purpose of this approach, as opposed to a perhaps more natural
approach that simply adapts getDataLoadUrl to return some kind of
request struct with a callback to map a response into key-value pairs,
is to accommodate the variety of existing clients. The structures get
pretty wild: e.g., tf-line-chart-data-loader mixes in the behavior but
doesn’t actually provide the needed properties; they’re provided instead
by tf-scalar-card, but then tf-hparams-session-group-details further
overrides some of those properties of tf-scalar-card with an entirely
different backend. It’s a bit wordier for clients, but at least there
are fewer moving pieces to keep track of.

Test Plan:
The scalars, custom scalars, distributions, histograms, PR curves, and
hparams dashboards all work. The fine-grained invalidation on the
scalars dashboard works: e.g., set the tag filter to mnist and then to
mnist|hparams, and watch only the hparams demo data load; then, set it
to hparams and watch the MNIST charts disappear without any repaints
to the hparams demo charts. The post-load callback properly causes
scalar charts’ domains to adjust. The refresh button in the dashboard UI
properly invalidates and re-fetches data.

(Make sure to run with TB_POLYMER3=1 until that’s the default.)

wchargin-branch: dlb-batch-finegrained

Summary:
The data loader behavior maintains a fine-grained cache of key-value
pairs. This is useful because when the set of requested keys expands,
only the new keys need to be fetched. But, until now, the behavior has
been hard-coded to fire a separate request for each key-value pair.
Clients can have either fine-grained cache invalidation or efficient
batched requests, but not both at the same time. This patch enriches the
behavior to support just that.

In a future change, the scalars dashboard will take advantage of this to
batch requests for multiple runs and a single tag.

In doing so, we need to shuffle around the API a bit. Instead of asking
clients to provide `getDataLoadUrl: (Item) => string` plus a separate
function `requestData: (url: string) => Promise<Data>` (where “`Item`”
and “`Data`” are the keys and values, respectively), clients now provide
a single function `requestData` that takes raw `Item`s (now plural),
performs the request(s), and returns the data. The function provides a
stream of key-value pairs, which is represented in callback style for
convenience. (We don’t want to drag Observable into this.)

The purpose of this approach, as opposed to a perhaps more natural
approach that simply adapts `getDataLoadUrl` to return some kind of
request struct with a callback to map a response into key-value pairs,
is to accommodate the variety of existing clients. The structures get
pretty wild: e.g., `tf-line-chart-data-loader` mixes in the behavior but
doesn’t actually provide the needed properties; they’re provided instead
by `tf-scalar-card`, but then `tf-hparams-session-group-details` further
overrides some of those properties of `tf-scalar-card` with an entirely
different backend. It’s a bit wordier for clients, but at least there
are fewer moving pieces to keep track of.

Test Plan:
The scalars, custom scalars, distributions, histograms, and hparams
dashboards all work. The fine-grained invalidation on the scalars
dashboard works: e.g., set the tag filter to `mnist` and then to
`mnist|hparams`, and watch only the hparams demo data load; then, set it
to `hparams` and watch the MNIST charts disappear without any repaints
to the hparams demo charts. The post-load callback properly causes
scalar charts’ domains to adjust. The refresh button in the dashboard UI
properly invalidates and re-fetches data.

(Make sure to run with `TB_POLYMER3=1` until that’s the default.)

wchargin-branch: dlb-batch-finegrained
wchargin-source: 14ec8abde36c4563a4922209d361fc5bd16b6061
@wchargin
Copy link
Contributor Author

If helpful, cf. #4050 for the actual multiplexing, which I’ve done some
testing of but haven’t properly documented yet, and so isn’t yet ready
for offical review.

Copy link
Contributor

@stephanwlee stephanwlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

99% looks good. I just want to see the reply on the condition for the dirtyItems (should inflight item be requested again?)

// A function that takes a datum and returns a string URL for fetching
// data.
getDataLoadUrl!: (datum: Item) => string;
loadDataCallback!: (component: this, item: Item, data: Data) => void;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think there is a benefit in "batching" this API, too? i.e., call the callback once per all items.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I considered it, and thought that the commitChanges
batching that we already do would be sufficient (for scalars, and it’s a
non-issue for histograms/distributions). But on further inspection I see
that the data load-data callback actually calls commitChanges every
time. I feel like I’m missing something; what’s the point of the
batching then?

I’m happy to consider adding batching here, but by default I’ll probably
opt to defer that into another PR if we decide that it’s worth doing.
Could be convinced otherwise, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I made a mistake. I definitely meant to only commit the changes on onLoadFinish. I will send a separate PR shortly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to comment here: I did play around with this and everything still
works, and I see a significant improvement in paint time.

It’d be slightly easier for me if I could send that PR after these two
land, just so that I don’t have to futz with conflicts; is that okay
with you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. It is awesome that it improves the paint time, too.

Please feel free to work on it.

this._canceller.cancellable(({cancelled}) => {
if (cancelled) {
return;
const dirtyItems = this.dataToLoad.filter((datum) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about pre-existing naming I created. Can we, in a follow up, rename these (or create an alias) to say item instead of data? i.e., we use both right now with dataToLoad, getDataLoadName, and dataLoadState but use item in requestData`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np :-) Yeah, I was planning to rename them to just K and V
throughout (with a slight potential point of confusion around the cache
keys, but I think that (K) => string is a pretty self-explanatory
signature). In this PR, I cleaned up a couple that I was touching
anyway, but tried to avoid changing the load-bearing API names.
Definitely happy to follow up. Could also do that as pre-work, though
I’m not super inclined to surgere that at this point.

this._canceller.cancellable(({cancelled}) => {
if (cancelled) {
return;
const dirtyItems = this.dataToLoad.filter((datum) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not remember your plan so I will ask here: what is our plan with the maximum number of runs we batch?

On the reload, we expunge the cache (maybe we should only expunge already loaded so we do not double fetch) and invoke _loadData. If user has thousands of runs, one batch request can ask for thousand entries which I don't think will scale nicely (if you want some anecdote, please ask offline). Do we plan on breaking the request down into chunks even in this model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fair question. The data loader behavior shouldn’t impose a max
batch size, but I was considering adding a batch size of ~100 in the
requestData implementation that calls /scalars_multirun. This is
just based off some napkin math: per time series, 1000 points, with
JSON-encoded wall time (~18 bytes), step (~5 bytes), value (~18 bytes),
and overhead (~8 bytes), for about 48 bytes per datum and thus 4.8 KB
per time series. A response size of 500 KB seems entirely reasonable,
and capping it somewhere seems wise to avoid unbounded server load.

Do we plan on breaking the request down into chunks even in this
model?

I think that it’s fine to call requestData with all the items and let
requestData decide if/how to break that up into smaller requests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That plan sounds good. Thanks for sharing your plans.

Copy link
Contributor Author

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick review!

// A function that takes a datum and returns a string URL for fetching
// data.
getDataLoadUrl!: (datum: Item) => string;
loadDataCallback!: (component: this, item: Item, data: Data) => void;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I considered it, and thought that the commitChanges
batching that we already do would be sufficient (for scalars, and it’s a
non-issue for histograms/distributions). But on further inspection I see
that the data load-data callback actually calls commitChanges every
time. I feel like I’m missing something; what’s the point of the
batching then?

I’m happy to consider adding batching here, but by default I’ll probably
opt to defer that into another PR if we decide that it’s worth doing.
Could be convinced otherwise, though.

this._canceller.cancellable(({cancelled}) => {
if (cancelled) {
return;
const dirtyItems = this.dataToLoad.filter((datum) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np :-) Yeah, I was planning to rename them to just K and V
throughout (with a slight potential point of confusion around the cache
keys, but I think that (K) => string is a pretty self-explanatory
signature). In this PR, I cleaned up a couple that I was touching
anyway, but tried to avoid changing the load-bearing API names.
Definitely happy to follow up. Could also do that as pre-work, though
I’m not super inclined to surgere that at this point.

this._canceller.cancellable(({cancelled}) => {
if (cancelled) {
return;
const dirtyItems = this.dataToLoad.filter((datum) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fair question. The data loader behavior shouldn’t impose a max
batch size, but I was considering adding a batch size of ~100 in the
requestData implementation that calls /scalars_multirun. This is
just based off some napkin math: per time series, 1000 points, with
JSON-encoded wall time (~18 bytes), step (~5 bytes), value (~18 bytes),
and overhead (~8 bytes), for about 48 bytes per datum and thus 4.8 KB
per time series. A response size of 500 KB seems entirely reasonable,
and capping it somewhere seems wise to avoid unbounded server load.

Do we plan on breaking the request down into chunks even in this
model?

I think that it’s fine to call requestData with all the items and let
requestData decide if/how to break that up into smaller requests.

@wchargin wchargin requested a review from stephanwlee August 19, 2020 03:53
@wchargin
Copy link
Contributor Author

Response to your action item here:
#4045 (comment)

PTAL; if you like, I’m happy to take a stab at any of the refactorings
that I mentioned, but I haven’t started any yet.

wchargin-branch: dlb-batch-finegrained
wchargin-source: f734c80647350bbd68573a116fb3ff0fd7f62079
// A function that takes a datum and returns a string URL for fetching
// data.
getDataLoadUrl!: (datum: Item) => string;
loadDataCallback!: (component: this, item: Item, data: Data) => void;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I made a mistake. I definitely meant to only commit the changes on onLoadFinish. I will send a separate PR shortly.

this._canceller.cancellable(({cancelled}) => {
if (cancelled) {
return;
const dirtyItems = this.dataToLoad.filter((datum) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That plan sounds good. Thanks for sharing your plans.

@wchargin wchargin added the theme:performance Performance, scalability, large data sizes, slowness, etc. label Aug 19, 2020
@wchargin wchargin merged commit 9d31562 into master Aug 19, 2020
@wchargin wchargin deleted the wchargin-dlb-batch-finegrained branch August 19, 2020 17:18
wchargin added a commit that referenced this pull request Aug 20, 2020
Summary:
As of this patch, a `tf-scalar-card` can make just one network request
to fetch its data, instead of one request per time series (i.e., one
request per run, since each scalar chart renders a single tag). This
reduces network overhead, improves throughput due to higher request
concurrency, and offers the opportunity for data providers to more
efficiently request the data in batch.

This is implemented with a new POST route `/scalars_multirun`, since the
list of run names may be long. The frontend is configured to batch
requests to at most 64 runs at once, so the multiplexing is bounded.

This only affects the scalars plugin. Other scalar chart sources, like
PR curves, custom scalars, and the hparams metrics views, are not
affected.

Supersedes #3835, with the same idea and same general backend approach,
but using the frontend APIs enabled by #4045.

Test Plan:
On the hparams demo with four charts showing, each fetching 50 runs, we
now make only four requests as opposed to 200. On a Google-internal
networked data provider, this improves end-to-end time (measured from
“first spinner appears” to “last spinner disappears”) by about 50%, from
22±1 seconds to 11±1 seconds. (Before this patch, the network requests
were being queued all the way to the end of the test period.)

Changing the batch size to 4 and then running on a dataset with 14 runs
shows that the requests are properly batched, including the last one
with just 2 runs.

Testing hparams, custom scalars, and PR curves shows that they continue
to work, even when multiple time series are requested.

wchargin-branch: scalars-mux-runs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes core:frontend theme:performance Performance, scalability, large data sizes, slowness, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants