-
Notifications
You must be signed in to change notification settings - Fork 1.7k
manager: implement start
#1809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
manager: implement start
#1809
Conversation
|
Will probably add some more tests tomorrow, for the “start new because (edit: done.) |
c68c0e3 to
3345492
Compare
7f4afc4 to
b22bca7
Compare
3345492 to
7258393
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome tests!
I missed it from previous code reviews but can server go up without info file (no permission or something went wrong?)
Also, can we add one more test when info file got deleted while server is correctly up? (and manager.start the second instance)
tensorboard/BUILD
Outdated
| name = "manager_e2e_test", | ||
| size = "large", # spawns subprocesses, sleeps, makes requests to localhost | ||
| timeout = "short", # about 15 seconds on my machine | ||
| flaky = True, # on Python 2, fails about 0.5% of the time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the cause? a bug capturing context would be ideal but a comment would suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s just a timeout (TensorBoard doesn’t start within the expected
period of 10 seconds). Worth noting that I’ve only seen this happening
when running with factor-12 parallelism, so it’s plausible that the
machine really was under enough load and/or IO thrashing as to not be
able to fulfill in time.
I’ll see if I can trigger it again and add a comment.
| "@org_pythonhosted_six", | ||
| ], | ||
| data = [ | ||
| ":tensorboard", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this required? Is it because manager does not declare dependency on :tensorboard or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it’s required. This isn’t a Python dependency; it’s a dependency on
the built tensorboard(1) binary, which we execute as a subprocess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was assuming that it was indirectly part of :manager's data. Don't see the test directly invoking the binary so I thought it was a bit weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:manager invokes whatever tensorboard(1) is on the system path.
:manager can’t depend on :tensorboard, because :tensorboard
depends on it! But the test can depend on :tensorboard and add that to
the system path so that :manager picks up the binary:
# Add our Bazel-provided `tensorboard` to the system path.
tensorboard_binary_dir = os.path.realpath("./tensorboard/")
path_environ = {
"PATH": os.pathsep.join((tensorboard_binary_dir, os.environ["PATH"])),
}
path_environ_patcher = mock.patch.dict(os.environ, path_environ)
path_environ_patcher.start()It doesn’t directly invoke it, but it does directly consume it (by
adding it to the path as listed above).
| infos = get_all() | ||
| candidates = [info for info in infos if info.cache_key == cache_key] | ||
| for candidate in sorted(candidates, key=lambda x: x.port): | ||
| # TODO(@wchargin): Check here that the provided port is still live. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No action required: what do you mean here? The port can be occupied by other program. Do you mean check whether TensorBoard is alive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, check whether TensorBoard is alive. If the TensorBoard process is
killed ungracefully (e.g., with SIGKILL or SIGQUIT), then it won’t get a
chance to clean up its info file, so the files in this directory may
be out of date. We can easily check whether the server is alive by just
sending an HTTP request to its /data/logdir.
|
|
||
| # Spy on subprocesses spawned so that we can kill them. | ||
| self.popens = [] | ||
| class PopenSpy(subprocess.Popen): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong but isn't this a mock, not a spy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? I don’t know. “Spy” seems more appropriate to me because I am
trying to spy on the behavior but I’m not actually mocking anything
out (we delegate to the real implementation, and actually do spawn
subprocesses). If it’d make you happy for this to say PopenMock
instead, I can change it.
Frankly, I have never carefully looked into the Official Terminology™ of
“mocks, stubs, and spies” or “libraries and frameworks” or similar
distinctions because I don’t see how they’re important or useful. :-)
| # processes. Continue killing; fail the test later. | ||
| failed_kills.append(e) | ||
| for p in self.popens: | ||
| p.wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to forcibly kill the process? Does this mean, in the worst case scenario, we will wait ~15s (except for the timeout test)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The preceding for-loop will kill the processes. On Unix it sends
SIGKILL; on Windows it sends TerminateProcess. You can’t ignore,
block, or recover from those signals. I don’t see any way that these
kills could fail (and I never saw the failed_kills check actually
fail).
Reaping them with wait is just good hygiene; it frees up the child’s
slot in the process table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks for the explanation!
| ) | ||
| self.assertEqual(manager.get_all(), []) | ||
|
|
||
| def test_timeout(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if timeout is negative? 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh. As written, if timeout is negative, it’ll always time out; it won’t
even atempt to read the info files to see if the process launched.
This behavior seems fine enough to me, but if you prefer I’d be happy to
raise ValueError if the timeout is negative (and test for that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was poking to see if we can add a test (without ValueError) but not sure how interesting it is anymore.
In principle, it could, yes. This is the same as timing out, and is
Sure; that sounds like a good idea. |
|
Test added. The flaky test failure looks like (but I’ve seen it occur in test cases other than # On Python 2, this test fails about 0.5% of the time when run with
# high parallelism; TensorBoard subprocesses time out instead of
# launching successfully.
flaky = True,to the BUILD file? (edit: done.) |
3bb60aa to
e9d9d75
Compare
cacc23d to
26cb6a2
Compare
e9d9d75 to
bb0f6c0
Compare
26cb6a2 to
9459339
Compare
bb0f6c0 to
9b17fd7
Compare
Summary:
This function starts a new TensorBoard process with the given arguments,
or reuses an existing compatible process. It returns a `TensorboardInfo`
object describing how to reach the resulting TensorBoard process
(whether new or reused). See docs for more details.
Test Plan:
End-to-end tests included. These appear to be lightly flaky: I ran
bazel test //tensorboard:manager_e2e_test --runs_per_test=100
six times on each of Python 2 and 3, and experienced three total
failures on Python 2 and zero on Python 3. On my machine, the test takes
14.7±0.9s to run on Python 2, and 17.9±1.0s to run on Python 3.
To test manually, run `bazel build //tensorboard`, then add that binary
to your path and head over to a Python REPL:
$ export PATH="$(readlink -e ./bazel-bin/tensorboard):$PATH"
$ python
>>> from tensorboard import manager
>>> r1 = manager.start(["--logdir", "~/tensorboard_data", "--port", "0"])
>>> type(r1)
<class 'tensorboard.manager.StartLaunched'>
>>> r2 = manager.start(["--logdir", "~/tensorboard_data", "--port", "0"])
>>> type(r2)
<class 'tensorboard.manager.StartReused'>
>>> r1.info == r2.info
True
>>> r1.info.port
39081
>>> import os
>>> os.system("curl --silent localhost:39081 | tail -c 64")
<tf-tensorboard use-hash brand="TensorBoard"></tf-tensorboard>
0
>>> manager.get_all() == [r1.info]
True
>>> os.kill(r1.info.pid, 15)
>>> manager.get_all() == []
True
wchargin-branch: manager-start
wchargin-branch: manager-start
wchargin-branch: manager-start
wchargin-branch: manager-start
wchargin-branch: manager-start
wchargin-branch: manager-start
wchargin-branch: manager-start
wchargin-branch: manager-start
wchargin-branch: manager-start
wchargin-branch: manager-start
wchargin-branch: manager-start
wchargin-branch: manager-start
9372246 to
80831e1
Compare
|
Updated and rebased; PTAL. |
|
Thanks for the reviews! |
Summary:
This function starts a new TensorBoard process with the given arguments,
or reuses an existing compatible process. It returns a
TensorboardInfoobject describing how to reach the resulting TensorBoard process
(whether new or reused). See docs for more details.
Test Plan:
End-to-end tests included. These appear to be lightly flaky: I ran
six times on each of Python 2 and 3, and experienced three total
failures on Python 2 and zero on Python 3. On my machine, the test takes
14.7±0.9s to run on Python 2, and 17.9±1.0s to run on Python 3.
To test manually, run
bazel build //tensorboard, then add that binaryto your path and head over to a Python REPL:
wchargin-branch: manager-start