Skip to content

Conversation

@davidsoergel
Copy link
Member

@davidsoergel davidsoergel commented Dec 3, 2019

This is one of those PRs where pulling on one thread unraveled the sweater. I could break it apart into a chain of 3 PRs if you want, but if you're OK with this as is, there'll be less friction with CI and whatnot.

The original goal was to teach the graph plugin to read from a data provider, mirroring the approach used for scalars. However it turned out that the tests for that use the logdir-based data provider (including multiplexer and accumulator stuff), so I implemented the new data provider blob API there too, largely mirroring code you've already seen internally. Some duplication results (which I can resolve after this is in).

Copy link
Contributor

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good at a high level/design level; thanks! Just a bunch
of local questions.

return graph
raise ValueError('There is no graph in this EventAccumulator')

def SerializedGraph(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the non-plugin event_accumulator and event_multiplexer are
retained only for compatibility with certain internal clients; it’s
perfectly fine to add new functionality to plugin_event_multiplexer
without backporting it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks for the clarification. I was wondering about the duplication but didn't investigate. Will leave this alone for now, anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to avoid changing the non-plugin event accumulator, since it's basically only maintained for legacy purposes, and we should avoid suggesting that it's still getting new features.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, reverted this file and its test.

def run_metadata_route(self, request):
"""Given a tag and a run, return the session.run() metadata."""
if self._data_provider:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will 500 if it’s ever hit:

>>> wrappers.Request.application(lambda req: None)({"REQUEST_METHOD": "GET"}, print)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/google/home/wchargin/virtualenv/tf-nightly-20191202-py3.7/lib/python3.7/site-packages/werkzeug/wrappers/base_request.py", line 240, in application
    return resp(*args[-2:])
TypeError: 'NoneType' object is not callable

What’s the right thing to do here? If this isn’t intended to be
reachable from the normal client-side app flows, maybe 400/404 instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Let's send a 404 for now, but really it looks like we should discuss plumbing the run metadata through the data provider. Looks like it might be a step-indexed blob, so I hope we can hook it up using the existing APIs. (That would also resolve the weirdness that is_active() checks for metadata in the bare multiplexer case, but not in the data provider case).

Comment on lines 235 to 237
1) Tuple: (123, "graphs", "train", "graph_def", 2, 0)
2) JSON: "[123,"graphs","train","graph_def",2,0]"
3) base64: WzEyMywiZ3JhcGhzIiwidHJhaW4iLCJncmFwaF9kZWYiLDIsMF0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Experiment IDs are like "123", not 123, no?

      1)  Tuple: ("123", "graphs", "train", "graph_def", 2, 0)
      2)   JSON: "["123","graphs","train","graph_def",2,0]"
      3) base64: IjEyMyIsImdyYXBocyIsInRyYWluIiwiZ3JhcGhfZGVmIiwyLDA

(Thanks for including this example—it’s quite helpful.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep; now using "some_id" to make really clear that it's a string.

@davidsoergel
Copy link
Member Author

Thanks for the careful review!

Also: I resolved my confusion about the tag=None case (it's fine), and all tests pass locally now.

@davidsoergel davidsoergel requested a review from wchargin December 3, 2019 17:23
Copy link
Contributor

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI still fast-failing; missed a commit, maybe?

Copy link
Contributor

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great; thank you! :D

@davidsoergel davidsoergel merged commit 9d9b34b into master Dec 4, 2019
@wchargin
Copy link
Contributor

wchargin commented Dec 5, 2019

This doesn’t actually appear to work for me. When I run

$ bazel run //tensorboard -- --generic_data=true --logdir ~/tensorboard_data/mnist

and navigate to the graphs dashboard, I get JS console errors:

Uncaught TypeError: Cannot destructure property 'run' of 'd' as it is null.
    at HTMLElement._load (tf-tensorboard.html.js:16576)
    at HTMLElement.<anonymous> (tf-tensorboard.html.js:16576)
    at tf-tensorboard.html.js:7036
    at MutationObserver.microtaskFlush (tf-tensorboard.html.js:1687)

The result from /data/plugin/graphs/info (formatted) used to be:

{
  "lr_1E-03,conv=1,fc=2": {
    "run": "lr_1E-03,conv=1,fc=2",
    "tags": {},
    "run_graph": true
  },
  "lr_1E-04,conv=2,fc=2": {
    "run": "lr_1E-04,conv=2,fc=2",
    "tags": {},
    "run_graph": true
  },
  "lr_1E-04,conv=1,fc=2": {
    "run": "lr_1E-04,conv=1,fc=2",
    "tags": {},
    "run_graph": true
  },
  "lr_1E-03,conv=2,fc=2": {
    "run": "lr_1E-03,conv=2,fc=2",
    "tags": {},
    "run_graph": true
  }
}

but is now:

{
  "lr_1E-03,conv=1,fc=2": {
    "run": "lr_1E-03,conv=1,fc=2",
    "tags": {},
    "run_graph": false,
    "op_graph": true
  },
  "lr_1E-04,conv=2,fc=2": {
    "run": "lr_1E-04,conv=2,fc=2",
    "tags": {},
    "run_graph": false,
    "op_graph": true
  },
  "lr_1E-04,conv=1,fc=2": {
    "run": "lr_1E-04,conv=1,fc=2",
    "tags": {},
    "run_graph": false,
    "op_graph": true
  },
  "lr_1E-03,conv=2,fc=2": {
    "run": "lr_1E-03,conv=2,fc=2",
    "tags": {},
    "run_graph": false,
    "op_graph": true
  }
}

The failure is new as of this commit, and also disappears if I run with
--generic_data=false or --generic_data=auto.

Could you please investigate? (Maybe consider rolling back in the
meantime, too.)

@wchargin
Copy link
Contributor

wchargin commented Dec 5, 2019

…and in addition to the JS console errors, the graph display is blank.

wchargin added a commit that referenced this pull request Dec 5, 2019
Summary:
Experiment IDs must be passed to the data provider, but the multiplexer
provider doesn’t actually need them, so code that forgets to pass the ID
will be silently broken on other providers. This commit updates the
multiplexer provider to explicitly fail when the ID is omitted.

This would have caught bugs in earlier drafts of both #2981 and #2991,
which were written by different people, so clearly the mistake is easy
to make. :-)

Test Plan:
Running with `--generic_data true`, the scalars, histograms, and
distributions dashboards all still work. The graphs dashboard has the
same error as prior to this change (see comments on #2991). Unit tests
have been updated to always pass experiment IDs.

wchargin-branch: muxprovider-safety-eid
@davidsoergel
Copy link
Member Author

Yikes. Fixing presently.

wchargin added a commit that referenced this pull request Dec 6, 2019
Summary:
Experiment IDs must be passed to the data provider, but the multiplexer
provider doesn’t actually need them, so code that forgets to pass the ID
will be silently broken on other providers. This commit updates the
multiplexer provider to explicitly fail when the ID is omitted.

This would have caught bugs in earlier drafts of both #2981 and #2991,
which were written by different people, so clearly the mistake is easy
to make. :-)

Test Plan:
Running with `--generic_data true`, the scalars, histograms, and
distributions dashboards all still work. The graphs dashboard has the
same error as prior to this change (see comments on #2991). Unit tests
have been updated to always pass experiment IDs.

wchargin-branch: muxprovider-safety-eid
wchargin added a commit that referenced this pull request Mar 26, 2020
Summary:
Follow-up to #2991. Fixes #3434.

Test Plan:
Tests pass as written.

wchargin-branch: data-blob-sequence-tests
wchargin-source: fbd3302933cb0c50609df970edf137202723c769
wchargin added a commit that referenced this pull request Mar 26, 2020
Summary:
Follow-up to #2991. Fixes #3434.

Test Plan:
Tests pass as written.

wchargin-branch: data-blob-sequence-tests
bileschi pushed a commit to bileschi/tensorboard that referenced this pull request Apr 15, 2020
Summary:
Follow-up to tensorflow#2991. Fixes tensorflow#3434.

Test Plan:
Tests pass as written.

wchargin-branch: data-blob-sequence-tests
bileschi pushed a commit that referenced this pull request Apr 15, 2020
Summary:
Follow-up to #2991. Fixes #3434.

Test Plan:
Tests pass as written.

wchargin-branch: data-blob-sequence-tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants