Skip to content

Conversation

@wchargin
Copy link
Contributor

@wchargin wchargin commented Feb 5, 2019

Summary:
This function starts a new TensorBoard process with the given arguments,
or reuses an existing compatible process. It returns a TensorboardInfo
object describing how to reach the resulting TensorBoard process
(whether new or reused). See docs for more details.

Test Plan:
End-to-end tests included. These appear to be lightly flaky: I ran

bazel test //tensorboard:manager_e2e_test --runs_per_test=100

six times on each of Python 2 and 3, and experienced three total
failures on Python 2 and zero on Python 3. On my machine, the test takes
14.7±0.9s to run on Python 2, and 17.9±1.0s to run on Python 3.

To test manually, run bazel build //tensorboard, then add that binary
to your path and head over to a Python REPL:

$ export PATH="$(readlink -e ./bazel-bin/tensorboard):$PATH"
$ python
>>> from tensorboard import manager
>>> r1 = manager.start(["--logdir", "~/tensorboard_data", "--port", "0"])
>>> type(r1)
<class 'tensorboard.manager.StartLaunched'>
>>> r2 = manager.start(["--logdir", "~/tensorboard_data", "--port", "0"])
>>> type(r2)
<class 'tensorboard.manager.StartReused'>
>>> r1.info == r2.info
True
>>> r1.info.port
39081
>>> import os
>>> os.system("curl --silent localhost:39081 | tail -c 64")

<tf-tensorboard use-hash brand="TensorBoard"></tf-tensorboard>
0
>>> manager.get_all() == [r1.info]
True
>>> os.kill(r1.info.pid, 15)
>>> manager.get_all() == []
True

wchargin-branch: manager-start

@wchargin
Copy link
Contributor Author

wchargin commented Feb 5, 2019

Will probably add some more tests tomorrow, for the “start new because
incompatible with existing instance” case and maybe also the
“stdout/stderr files could not be read” edge case and the behavior that
if there are multiple matching instances then the one with the lower
port is picked.

(edit: done.)

@wchargin wchargin force-pushed the wchargin-manager-start branch from c68c0e3 to 3345492 Compare February 5, 2019 18:52
@wchargin wchargin force-pushed the wchargin-write-tensorboardinfo branch from 7f4afc4 to b22bca7 Compare February 5, 2019 19:12
@wchargin wchargin force-pushed the wchargin-manager-start branch from 3345492 to 7258393 Compare February 5, 2019 19:58
Copy link
Contributor

@stephanwlee stephanwlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome tests!

I missed it from previous code reviews but can server go up without info file (no permission or something went wrong?)

Also, can we add one more test when info file got deleted while server is correctly up? (and manager.start the second instance)

name = "manager_e2e_test",
size = "large", # spawns subprocesses, sleeps, makes requests to localhost
timeout = "short", # about 15 seconds on my machine
flaky = True, # on Python 2, fails about 0.5% of the time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the cause? a bug capturing context would be ideal but a comment would suffice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s just a timeout (TensorBoard doesn’t start within the expected
period of 10 seconds). Worth noting that I’ve only seen this happening
when running with factor-12 parallelism, so it’s plausible that the
machine really was under enough load and/or IO thrashing as to not be
able to fulfill in time.

I’ll see if I can trigger it again and add a comment.

"@org_pythonhosted_six",
],
data = [
":tensorboard",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required? Is it because manager does not declare dependency on :tensorboard or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it’s required. This isn’t a Python dependency; it’s a dependency on
the built tensorboard(1) binary, which we execute as a subprocess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming that it was indirectly part of :manager's data. Don't see the test directly invoking the binary so I thought it was a bit weird.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:manager invokes whatever tensorboard(1) is on the system path.
:manager can’t depend on :tensorboard, because :tensorboard
depends on it! But the test can depend on :tensorboard and add that to
the system path so that :manager picks up the binary:

    # Add our Bazel-provided `tensorboard` to the system path.
    tensorboard_binary_dir = os.path.realpath("./tensorboard/")
    path_environ = {
        "PATH": os.pathsep.join((tensorboard_binary_dir, os.environ["PATH"])),
    }
    path_environ_patcher = mock.patch.dict(os.environ, path_environ)
    path_environ_patcher.start()

It doesn’t directly invoke it, but it does directly consume it (by
adding it to the path as listed above).

infos = get_all()
candidates = [info for info in infos if info.cache_key == cache_key]
for candidate in sorted(candidates, key=lambda x: x.port):
# TODO(@wchargin): Check here that the provided port is still live.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No action required: what do you mean here? The port can be occupied by other program. Do you mean check whether TensorBoard is alive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, check whether TensorBoard is alive. If the TensorBoard process is
killed ungracefully (e.g., with SIGKILL or SIGQUIT), then it won’t get a
chance to clean up its info file, so the files in this directory may
be out of date. We can easily check whether the server is alive by just
sending an HTTP request to its /data/logdir.


# Spy on subprocesses spawned so that we can kill them.
self.popens = []
class PopenSpy(subprocess.Popen):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong but isn't this a mock, not a spy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? I don’t know. “Spy” seems more appropriate to me because I am
trying to spy on the behavior but I’m not actually mocking anything
out (we delegate to the real implementation, and actually do spawn
subprocesses). If it’d make you happy for this to say PopenMock
instead, I can change it.

Frankly, I have never carefully looked into the Official Terminology™ of
“mocks, stubs, and spies” or “libraries and frameworks” or similar
distinctions because I don’t see how they’re important or useful. :-)

# processes. Continue killing; fail the test later.
failed_kills.append(e)
for p in self.popens:
p.wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to forcibly kill the process? Does this mean, in the worst case scenario, we will wait ~15s (except for the timeout test)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preceding for-loop will kill the processes. On Unix it sends
SIGKILL; on Windows it sends TerminateProcess. You can’t ignore,
block, or recover from those signals. I don’t see any way that these
kills could fail (and I never saw the failed_kills check actually
fail).

Reaping them with wait is just good hygiene; it frees up the child’s
slot in the process table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for the explanation!

)
self.assertEqual(manager.get_all(), [])

def test_timeout(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if timeout is negative? 🙃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh. As written, if timeout is negative, it’ll always time out; it won’t
even atempt to read the info files to see if the process launched.

This behavior seems fine enough to me, but if you prefer I’d be happy to
raise ValueError if the timeout is negative (and test for that).

Copy link
Contributor

@stephanwlee stephanwlee Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was poking to see if we can add a test (without ValueError) but not sure how interesting it is anymore.

@wchargin
Copy link
Contributor Author

wchargin commented Feb 6, 2019

Awesome tests!

I missed it from previous code reviews but can server go up without
info file (no permission or something went wrong?)

In principle, it could, yes. This is the same as timing out, and is
covered by the test_timeout test case.

Also, can we add one more test when info file got deleted while server
is correctly up? (and manager.start the second instance)

Sure; that sounds like a good idea.

@wchargin
Copy link
Contributor Author

wchargin commented Feb 6, 2019

Test added.

The flaky test failure looks like

FAIL: test_simple_start (__main__.ManagerEndToEndTest)
test_simple_start (__main__.ManagerEndToEndTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/google/home/wchargin/virtualenv/tf-nightly-20190205-py2.7/local/lib/python2.7/site-packages/absl/third_party/unittest3_backport/case.py", line 37, in testPartExecutor
    yield
  File "/usr/local/google/home/wchargin/virtualenv/tf-nightly-20190205-py2.7/local/lib/python2.7/site-packages/absl/third_party/unittest3_backport/case.py", line 162, in run
    testMethod()
  File "/usr/local/google/home/wchargin/.bazel-output-base/sandbox/linux-sandbox/99/execroot/org_tensorflow_tensorboard/bazel-out/k8-fastbuild/bin/tensorboard/manager_e2e_test.runfiles/org_tensorflow_tensorboard/tensorboard/manager_e2e_test.py", line 150, in test_simple_start
    self.assertIsInstance(start_result, manager.StartLaunched)
AssertionError: StartTimedOut(pid=173) is not an instance of <class 'tensorboard.manager.StartLaunched'>

(but I’ve seen it occur in test cases other than test_simple_start).
Would you like me to add a comment like

    # On Python 2, this test fails about 0.5% of the time when run with
    # high parallelism; TensorBoard subprocesses time out instead of
    # launching successfully.
    flaky = True,

to the BUILD file? (edit: done.)

@wchargin wchargin force-pushed the wchargin-write-tensorboardinfo branch from 3bb60aa to e9d9d75 Compare February 6, 2019 01:27
@wchargin wchargin force-pushed the wchargin-manager-start branch from cacc23d to 26cb6a2 Compare February 6, 2019 01:27
@wchargin wchargin force-pushed the wchargin-write-tensorboardinfo branch from e9d9d75 to bb0f6c0 Compare February 6, 2019 20:02
@wchargin wchargin force-pushed the wchargin-manager-start branch from 26cb6a2 to 9459339 Compare February 6, 2019 20:02
@wchargin wchargin force-pushed the wchargin-write-tensorboardinfo branch from bb0f6c0 to 9b17fd7 Compare February 7, 2019 01:29
@wchargin wchargin changed the base branch from wchargin-write-tensorboardinfo to master February 7, 2019 01:40
Summary:
This function starts a new TensorBoard process with the given arguments,
or reuses an existing compatible process. It returns a `TensorboardInfo`
object describing how to reach the resulting TensorBoard process
(whether new or reused). See docs for more details.

Test Plan:
End-to-end tests included. These appear to be lightly flaky: I ran

    bazel test //tensorboard:manager_e2e_test --runs_per_test=100

six times on each of Python 2 and 3, and experienced three total
failures on Python 2 and zero on Python 3. On my machine, the test takes
14.7±0.9s to run on Python 2, and 17.9±1.0s to run on Python 3.

To test manually, run `bazel build //tensorboard`, then add that binary
to your path and head over to a Python REPL:

    $ export PATH="$(readlink -e ./bazel-bin/tensorboard):$PATH"
    $ python
    >>> from tensorboard import manager
    >>> r1 = manager.start(["--logdir", "~/tensorboard_data", "--port", "0"])
    >>> type(r1)
    <class 'tensorboard.manager.StartLaunched'>
    >>> r2 = manager.start(["--logdir", "~/tensorboard_data", "--port", "0"])
    >>> type(r2)
    <class 'tensorboard.manager.StartReused'>
    >>> r1.info == r2.info
    True
    >>> r1.info.port
    39081
    >>> import os
    >>> os.system("curl --silent localhost:39081 | tail -c 64")

    <tf-tensorboard use-hash brand="TensorBoard"></tf-tensorboard>
    0
    >>> manager.get_all() == [r1.info]
    True
    >>> os.kill(r1.info.pid, 15)
    >>> manager.get_all() == []
    True

wchargin-branch: manager-start
wchargin-branch: manager-start
wchargin-branch: manager-start
wchargin-branch: manager-start
wchargin-branch: manager-start
wchargin-branch: manager-start
wchargin-branch: manager-start
wchargin-branch: manager-start
wchargin-branch: manager-start
wchargin-branch: manager-start
@wchargin wchargin force-pushed the wchargin-manager-start branch from 9372246 to 80831e1 Compare February 7, 2019 01:40
@wchargin
Copy link
Contributor Author

wchargin commented Feb 7, 2019

Updated and rebased; PTAL.

@wchargin wchargin merged commit d1e1403 into master Feb 7, 2019
@wchargin
Copy link
Contributor Author

wchargin commented Feb 7, 2019

Thanks for the reviews!

@wchargin wchargin deleted the wchargin-manager-start branch February 7, 2019 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants