manager: implement `start` #1809

wchargin · 2019-02-05T04:56:00Z

Summary:
This function starts a new TensorBoard process with the given arguments,
or reuses an existing compatible process. It returns a TensorboardInfo
object describing how to reach the resulting TensorBoard process
(whether new or reused). See docs for more details.

Test Plan:
End-to-end tests included. These appear to be lightly flaky: I ran

bazel test //tensorboard:manager_e2e_test --runs_per_test=100

six times on each of Python 2 and 3, and experienced three total
failures on Python 2 and zero on Python 3. On my machine, the test takes
14.7±0.9s to run on Python 2, and 17.9±1.0s to run on Python 3.

To test manually, run bazel build //tensorboard, then add that binary
to your path and head over to a Python REPL:

$ export PATH="$(readlink -e ./bazel-bin/tensorboard):$PATH"
$ python
>>> from tensorboard import manager
>>> r1 = manager.start(["--logdir", "~/tensorboard_data", "--port", "0"])
>>> type(r1)
<class 'tensorboard.manager.StartLaunched'>
>>> r2 = manager.start(["--logdir", "~/tensorboard_data", "--port", "0"])
>>> type(r2)
<class 'tensorboard.manager.StartReused'>
>>> r1.info == r2.info
True
>>> r1.info.port
39081
>>> import os
>>> os.system("curl --silent localhost:39081 | tail -c 64")

<tf-tensorboard use-hash brand="TensorBoard"></tf-tensorboard>
0
>>> manager.get_all() == [r1.info]
True
>>> os.kill(r1.info.pid, 15)
>>> manager.get_all() == []
True

wchargin-branch: manager-start

wchargin · 2019-02-05T06:24:24Z

Will probably add some more tests tomorrow, for the “start new because
incompatible with existing instance” case and maybe also the
“stdout/stderr files could not be read” edge case and the behavior that
if there are multiple matching instances then the one with the lower
port is picked.

(edit: done.)

stephanwlee

Awesome tests!

I missed it from previous code reviews but can server go up without info file (no permission or something went wrong?)

Also, can we add one more test when info file got deleted while server is correctly up? (and manager.start the second instance)

stephanwlee · 2019-02-05T23:19:11Z

tensorboard/BUILD

+    name = "manager_e2e_test",
+    size = "large",  # spawns subprocesses, sleeps, makes requests to localhost
+    timeout = "short",  # about 15 seconds on my machine
+    flaky = True,  # on Python 2, fails about 0.5% of the time


What is the cause? a bug capturing context would be ideal but a comment would suffice.

It’s just a timeout (TensorBoard doesn’t start within the expected
period of 10 seconds). Worth noting that I’ve only seen this happening
when running with factor-12 parallelism, so it’s plausible that the
machine really was under enough load and/or IO thrashing as to not be
able to fulfill in time.

I’ll see if I can trigger it again and add a comment.

tensorboard/BUILD

stephanwlee · 2019-02-05T23:21:16Z

tensorboard/BUILD

+        "@org_pythonhosted_six",
+    ],
+    data = [
+        ":tensorboard",


Is this required? Is it because manager does not declare dependency on :tensorboard or something?

Yes, it’s required. This isn’t a Python dependency; it’s a dependency on
the built tensorboard(1) binary, which we execute as a subprocess.

I was assuming that it was indirectly part of :manager's data. Don't see the test directly invoking the binary so I thought it was a bit weird.

:manager invokes whatever tensorboard(1) is on the system path.
:manager can’t depend on :tensorboard, because :tensorboard
depends on it! But the test can depend on :tensorboard and add that to
the system path so that :manager picks up the binary:

# Add our Bazel-provided `tensorboard` to the system path. tensorboard_binary_dir = os.path.realpath("./tensorboard/") path_environ = { "PATH": os.pathsep.join((tensorboard_binary_dir, os.environ["PATH"])), } path_environ_patcher = mock.patch.dict(os.environ, path_environ) path_environ_patcher.start()

It doesn’t directly invoke it, but it does directly consume it (by
adding it to the path as listed above).

stephanwlee · 2019-02-05T23:27:58Z

tensorboard/manager.py

+  infos = get_all()
+  candidates = [info for info in infos if info.cache_key == cache_key]
+  for candidate in sorted(candidates, key=lambda x: x.port):
+    # TODO(@wchargin): Check here that the provided port is still live.


No action required: what do you mean here? The port can be occupied by other program. Do you mean check whether TensorBoard is alive?

Yes, check whether TensorBoard is alive. If the TensorBoard process is
killed ungracefully (e.g., with SIGKILL or SIGQUIT), then it won’t get a
chance to clean up its info file, so the files in this directory may
be out of date. We can easily check whether the server is alive by just
sending an HTTP request to its /data/logdir.

stephanwlee · 2019-02-05T23:36:30Z

tensorboard/manager_e2e_test.py

+
+    # Spy on subprocesses spawned so that we can kill them.
+    self.popens = []
+    class PopenSpy(subprocess.Popen):


I might be wrong but isn't this a mock, not a spy?

Maybe? I don’t know. “Spy” seems more appropriate to me because I am
trying to spy on the behavior but I’m not actually mocking anything
out (we delegate to the real implementation, and actually do spawn
subprocesses). If it’d make you happy for this to say PopenMock
instead, I can change it.

Frankly, I have never carefully looked into the Official Terminology™ of
“mocks, stubs, and spies” or “libraries and frameworks” or similar
distinctions because I don’t see how they’re important or useful. :-)

stephanwlee · 2019-02-05T23:42:11Z

tensorboard/manager_e2e_test.py

+          # processes. Continue killing; fail the test later.
+          failed_kills.append(e)
+    for p in self.popens:
+      p.wait()


Is there a way to forcibly kill the process? Does this mean, in the worst case scenario, we will wait ~15s (except for the timeout test)?

The preceding for-loop will kill the processes. On Unix it sends
SIGKILL; on Windows it sends TerminateProcess. You can’t ignore,
block, or recover from those signals. I don’t see any way that these
kills could fail (and I never saw the failed_kills check actually
fail).

Reaping them with wait is just good hygiene; it frees up the child’s
slot in the process table.

Ok, thanks for the explanation!

stephanwlee · 2019-02-05T23:54:29Z

tensorboard/manager_e2e_test.py

+    )
+    self.assertEqual(manager.get_all(), [])
+
+  def test_timeout(self):


what happens if timeout is negative? 🙃

Heh. As written, if timeout is negative, it’ll always time out; it won’t
even atempt to read the info files to see if the process launched.

This behavior seems fine enough to me, but if you prefer I’d be happy to
raise ValueError if the timeout is negative (and test for that).

I was poking to see if we can add a test (without ValueError) but not sure how interesting it is anymore.

wchargin · 2019-02-06T00:18:38Z

Awesome tests!

I missed it from previous code reviews but can server go up without
info file (no permission or something went wrong?)

In principle, it could, yes. This is the same as timing out, and is
covered by the test_timeout test case.

Also, can we add one more test when info file got deleted while server
is correctly up? (and manager.start the second instance)

Sure; that sounds like a good idea.

wchargin · 2019-02-06T00:38:32Z

Test added.

The flaky test failure looks like

FAIL: test_simple_start (__main__.ManagerEndToEndTest)
test_simple_start (__main__.ManagerEndToEndTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/google/home/wchargin/virtualenv/tf-nightly-20190205-py2.7/local/lib/python2.7/site-packages/absl/third_party/unittest3_backport/case.py", line 37, in testPartExecutor
    yield
  File "/usr/local/google/home/wchargin/virtualenv/tf-nightly-20190205-py2.7/local/lib/python2.7/site-packages/absl/third_party/unittest3_backport/case.py", line 162, in run
    testMethod()
  File "/usr/local/google/home/wchargin/.bazel-output-base/sandbox/linux-sandbox/99/execroot/org_tensorflow_tensorboard/bazel-out/k8-fastbuild/bin/tensorboard/manager_e2e_test.runfiles/org_tensorflow_tensorboard/tensorboard/manager_e2e_test.py", line 150, in test_simple_start
    self.assertIsInstance(start_result, manager.StartLaunched)
AssertionError: StartTimedOut(pid=173) is not an instance of <class 'tensorboard.manager.StartLaunched'>

(but I’ve seen it occur in test cases other than test_simple_start).
Would you like me to add a comment like

    # On Python 2, this test fails about 0.5% of the time when run with
    # high parallelism; TensorBoard subprocesses time out instead of
    # launching successfully.
    flaky = True,

to the BUILD file? (edit: done.)

tensorboard/manager.py

Summary: This function starts a new TensorBoard process with the given arguments, or reuses an existing compatible process. It returns a `TensorboardInfo` object describing how to reach the resulting TensorBoard process (whether new or reused). See docs for more details. Test Plan: End-to-end tests included. These appear to be lightly flaky: I ran bazel test //tensorboard:manager_e2e_test --runs_per_test=100 six times on each of Python 2 and 3, and experienced three total failures on Python 2 and zero on Python 3. On my machine, the test takes 14.7±0.9s to run on Python 2, and 17.9±1.0s to run on Python 3. To test manually, run `bazel build //tensorboard`, then add that binary to your path and head over to a Python REPL: $ export PATH="$(readlink -e ./bazel-bin/tensorboard):$PATH" $ python >>> from tensorboard import manager >>> r1 = manager.start(["--logdir", "~/tensorboard_data", "--port", "0"]) >>> type(r1) <class 'tensorboard.manager.StartLaunched'> >>> r2 = manager.start(["--logdir", "~/tensorboard_data", "--port", "0"]) >>> type(r2) <class 'tensorboard.manager.StartReused'> >>> r1.info == r2.info True >>> r1.info.port 39081 >>> import os >>> os.system("curl --silent localhost:39081 | tail -c 64") <tf-tensorboard use-hash brand="TensorBoard"></tf-tensorboard> 0 >>> manager.get_all() == [r1.info] True >>> os.kill(r1.info.pid, 15) >>> manager.get_all() == [] True wchargin-branch: manager-start

wchargin-branch: manager-start

wchargin · 2019-02-07T01:49:31Z

Updated and rebased; PTAL.

wchargin · 2019-02-07T02:29:43Z

Thanks for the reviews!

wchargin added the core:backend label Feb 5, 2019

wchargin force-pushed the wchargin-manager-start branch from c68c0e3 to 3345492 Compare February 5, 2019 18:52

wchargin force-pushed the wchargin-write-tensorboardinfo branch from 7f4afc4 to b22bca7 Compare February 5, 2019 19:12

wchargin force-pushed the wchargin-manager-start branch from 3345492 to 7258393 Compare February 5, 2019 19:58

wchargin requested review from manivaradarajan and stephanwlee February 5, 2019 19:59

stephanwlee reviewed Feb 5, 2019

View reviewed changes

wchargin force-pushed the wchargin-write-tensorboardinfo branch from 3bb60aa to e9d9d75 Compare February 6, 2019 01:27

wchargin force-pushed the wchargin-manager-start branch from cacc23d to 26cb6a2 Compare February 6, 2019 01:27

stephanwlee approved these changes Feb 6, 2019

View reviewed changes

wchargin force-pushed the wchargin-write-tensorboardinfo branch from e9d9d75 to bb0f6c0 Compare February 6, 2019 20:02

wchargin force-pushed the wchargin-manager-start branch from 26cb6a2 to 9459339 Compare February 6, 2019 20:02

manivaradarajan reviewed Feb 7, 2019

View reviewed changes

tensorboard/manager.py Outdated Show resolved Hide resolved

manivaradarajan requested changes Feb 7, 2019

View reviewed changes

tensorboard/manager.py Outdated Show resolved Hide resolved

tensorboard/manager.py Outdated Show resolved Hide resolved

wchargin force-pushed the wchargin-write-tensorboardinfo branch from bb0f6c0 to 9b17fd7 Compare February 7, 2019 01:29

wchargin changed the base branch from wchargin-write-tensorboardinfo to master February 7, 2019 01:40

wchargin added 11 commits February 6, 2019 17:40

fixup: IOError, not OSError, in Python 2

c1fdc6a

wchargin-branch: manager-start

fixup: separate tempfile TMPDIR from test tempdir

d6d587c

wchargin-branch: manager-start

fixup: assign myself to TODOs

f932dc1

wchargin-branch: manager-start

fixup: add test_launch_new_because_incompatible

8877c13

wchargin-branch: manager-start

fixup: add test_exit_success

b269a3e

wchargin-branch: manager-start

fixup: add test_unreadable_stdio

250938a

wchargin-branch: manager-start

review: add test for wiped info files

d6d3bd1

wchargin-branch: manager-start

review: describe nature of flakiness in BUILD file

859177b

wchargin-branch: manager-start

review: specify default timeout

a7c8802

wchargin-branch: manager-start

review: add header comment to StartResult types

f084065

wchargin-branch: manager-start

review: clarify timeout argument docs

80831e1

wchargin-branch: manager-start

wchargin force-pushed the wchargin-manager-start branch from 9372246 to 80831e1 Compare February 7, 2019 01:40

manivaradarajan approved these changes Feb 7, 2019

View reviewed changes

wchargin merged commit d1e1403 into master Feb 7, 2019

wchargin deleted the wchargin-manager-start branch February 7, 2019 23:04

manager: implement start #1809

manager: implement start #1809

Uh oh!

Conversation

wchargin commented Feb 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wchargin commented Feb 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephanwlee left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanwlee Feb 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wchargin commented Feb 6, 2019

Uh oh!

wchargin commented Feb 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wchargin commented Feb 7, 2019

Uh oh!

wchargin commented Feb 7, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

manager: implement `start` #1809

manager: implement `start` #1809

wchargin commented Feb 5, 2019 •

edited

Loading

wchargin commented Feb 5, 2019 •

edited

Loading

stephanwlee left a comment •

edited

Loading

stephanwlee Feb 6, 2019 •

edited

Loading

wchargin commented Feb 6, 2019 •

edited

Loading