server: when no port is specified, try to find one #1851

wchargin · 2019-02-15T00:01:16Z

Summary:
This patch teaches TensorBoard about --port=default (which is the new
default value for --port). If --port=default is specified,
TensorBoard will attempt to bind to port 6006, just as prior to this
patch. If this fails, TensorBoard will try to bind incrementally to
ports 6007, 6008, …, giving up after 10 attempts.

If an explicit port is specified, TensorBoard will not try to search for
other ports; it will fail (as before) if it cannot bind. This is the
only way in which --port=default and --port=6006 differ.

This patch also changes the behavior of one edge case, at least on
Linux. Prior to this patch, explicit port numbers were interpreted
modulo 65536 (TCP ports are u16s), so --port=71542 and --port=6006
had the same behavior. It is now an error on all platforms to specify a
port that does not fit in a u16.

Resolves #1848.

Test Plan:

$ bazel build //tensorboard

$ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ & sleep 2
[1] 223547
TensorBoard 1.13.0a0 at http://<hostname>:6006 (Press CTRL+C to quit)

$ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ & sleep 2
[2] 223591
TensorBoard 1.13.0a0 at http://<hostname>:6007 (Press CTRL+C to quit)

$ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port default & sleep 2
[3] 224292
TensorBoard 1.13.0a0 at http://<hostname>:6008 (Press CTRL+C to quit)

$ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port 6006 & sleep 2  # should fail
[4] 225773
E0214 15:35:08.939894 140033352959744 program.py:232] TensorBoard could not bind to port 6006, it was already in use
ERROR: TensorBoard could not bind to port 6006, it was already in use
[4]+  Exit 255                ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port 6006

$ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port 0 & sleep 2
[4] 226298
TensorBoard 1.13.0a0 at http://<hostname>:33673 (Press CTRL+C to quit)

$ for i in {6009..6015}; do ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port "$i" & done && sleep 2
[5] 226638
[6] 226639
[7] 226640
[8] 226641
[9] 226642
[10] 226643
[11] 226644
TensorBoard 1.13.0a0 at http://<hostname>:6011 (Press CTRL+C to quit)
TensorBoard 1.13.0a0 at http://<hostname>:6013 (Press CTRL+C to quit)
TensorBoard 1.13.0a0 at http://<hostname>:6015 (Press CTRL+C to quit)
TensorBoard 1.13.0a0 at http://<hostname>:6012 (Press CTRL+C to quit)
TensorBoard 1.13.0a0 at http://<hostname>:6014 (Press CTRL+C to quit)
TensorBoard 1.13.0a0 at http://<hostname>:6009 (Press CTRL+C to quit)
TensorBoard 1.13.0a0 at http://<hostname>:6010 (Press CTRL+C to quit)

$ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ & sleep 2
[12] 227173
E0214 15:35:33.812504 140362460620544 program.py:232] TensorBoard could not bind to any port around 6006 (tried 10 times)
ERROR: TensorBoard could not bind to any port around 6006 (tried 10 times)

$ jobs -p | xargs kill

wchargin-branch: port-search

Summary: This patch teaches TensorBoard about `--port=default` (which is the new default value for `--port`). If `--port=default` is specified, TensorBoard will attempt to bind to port 6006, just as prior to this patch. If this fails, TensorBoard will try to bind incrementally to ports 6007, 6008, …, giving up after 10 attempts. If an explicit port is specified, TensorBoard will not try to search for other ports; it will fail (as before) if it cannot bind. This is the only way in which `--port=default` and `--port=6006` differ. This patch also changes the behavior of one edge case, at least on Linux. Prior to this patch, explicit port numbers were interpreted modulo 65536 (TCP ports are u16s), so `--port=71542` and `--port=6006` had the same behavior. It is now an error on all platforms to specify a port that does not fit in a u16. Resolves #1848. Test Plan: $ bazel build //tensorboard $ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ & sleep 2 [1] 223547 TensorBoard 1.13.0a0 at http://<hostname>:6006 (Press CTRL+C to quit) $ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ & sleep 2 [2] 223591 TensorBoard 1.13.0a0 at http://<hostname>:6007 (Press CTRL+C to quit) $ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port default & sleep 2 [3] 224292 TensorBoard 1.13.0a0 at http://<hostname>:6008 (Press CTRL+C to quit) $ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port 6006 & sleep 2 # should fail [4] 225773 E0214 15:35:08.939894 140033352959744 program.py:232] TensorBoard could not bind to port 6006, it was already in use ERROR: TensorBoard could not bind to port 6006, it was already in use [4]+ Exit 255 ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port 6006 $ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port 0 & sleep 2 [4] 226298 TensorBoard 1.13.0a0 at http://<hostname>:33673 (Press CTRL+C to quit) $ for i in {6009..6015}; do ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port "$i" & done && sleep 2 [5] 226638 [6] 226639 [7] 226640 [8] 226641 [9] 226642 [10] 226643 [11] 226644 TensorBoard 1.13.0a0 at http://<hostname>:6011 (Press CTRL+C to quit) TensorBoard 1.13.0a0 at http://<hostname>:6013 (Press CTRL+C to quit) TensorBoard 1.13.0a0 at http://<hostname>:6015 (Press CTRL+C to quit) TensorBoard 1.13.0a0 at http://<hostname>:6012 (Press CTRL+C to quit) TensorBoard 1.13.0a0 at http://<hostname>:6014 (Press CTRL+C to quit) TensorBoard 1.13.0a0 at http://<hostname>:6009 (Press CTRL+C to quit) TensorBoard 1.13.0a0 at http://<hostname>:6010 (Press CTRL+C to quit) $ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ & sleep 2 [12] 227173 E0214 15:35:33.812504 140362460620544 program.py:232] TensorBoard could not bind to any port around 6006 (tried 10 times) ERROR: TensorBoard could not bind to any port around 6006 (tried 10 times) $ jobs -p | xargs kill wchargin-branch: port-search

wchargin · 2019-02-15T00:14:34Z

(Recommend reviewing with “Hide whitespace changes”/git show -w.)

tensorboard/program.py

manivaradarajan · 2019-02-15T05:52:43Z

tensorboard/program.py

+        super(WerkzeugServer, self).__init__(host, port, wsgi_app)
+        break
+      except socket.error as e:
+        if hasattr(errno, 'EACCES') and e.errno == errno.EACCES:


Any reason you're not using os.strerror()?
It's ok to add additional strings for specific context, but I've always seen the standard strerror() used as well.

This code is not new in this PR; it just got indented by two extra
spaces. You can ask GitHub to omit whitespace-only diffs under the “Diff
settings” button at the top of the page:

To the question: I didn’t write the original code, but using strerror
would be fine with me (not that this it should be changed in this PR).

manivaradarajan · 2019-02-15T06:00:38Z

tensorboard/program.py

+        super(WerkzeugServer, self).__init__(host, port, wsgi_app)
+        break
+      except socket.error as e:
+        if hasattr(errno, 'EACCES') and e.errno == errno.EACCES:


Why do you need to check both the attr and e.errno?

The examples I'm seeing just do

e.errno == errno.EACCES

The code in question is not new in this PR, but I agree with the
conservative hasattr: not all platforms have all error codes, and thus
the values available on the errno module are platform-dependent. (In
practice, both Linux and Windows support the four codes checked here.)
I probably would have written this as

if e.errno == getattr(errno, "EADDRINUSE", None):

because it’s simpler, faster, and avoids hasattr, but it’s also
fine as written.

manivaradarajan · 2019-02-15T06:04:26Z

tensorboard/program.py

-        else:
+              'TensorBoard could not bind to unavailable address %s' % host)
+        elif hasattr(errno, 'EAFNOSUPPORT') and e.errno == errno.EAFNOSUPPORT:
          raise TensorBoardServerException(


It's unfortunate Python2 doesn't have exception chaining. It would be ideal to capture the true cause and not swallow it. Any ideas on how to do that?

The code in question is not new in this PR, but: no, unfortunately.
Python exception objects don’t store the traceback at which they were
created or thrown. You can extract the traceback corresponding to the
current exception handler manually from sys.exc_info() and store it
on the exception, but the calling code would need to know how to extract
it and display it. I generally only do this when writing libraries that
operate generically over exceptions (e.g., LazyRace.get()).

In this particular case, I think that it’s probably fine. We’ve
completely characterized the contents of the original error already
(it is determined by its errno), so the only thing that we’re missing
is the traceback into the Werkzeug and Python standard library
internals—but the traceback past this point can be reasonably considered
an implementation detail of TensorBoard’s WerkzeugServer.

tensorboard/program.py

wchargin-branch: port-search

wchargin · 2019-02-15T19:50:44Z

@manivaradarajan is out of office; re-assigning review to @stephanwlee.

Reviewer is OOO, comments addressed, PR approved by team member.

nfelt · 2019-02-15T20:06:51Z

tensorboard/program.py

      # Without an explicit host, we default to serving on all interfaces,
      # and will attempt to serve both IPv4 and IPv6 traffic through one socket.
-      host = self._get_wildcard_address(flags.port)
+      host = self._get_wildcard_address(base_port)


Strictly speaking this should be done per-port right? Though I agree it would be exotic to have the different ports have different wildcard addresses.

That’s true. I can move this, I think.

nfelt · 2019-02-15T20:07:05Z

tensorboard/program.py

+          'TensorBoard cannot bind to port %d > %d' % (base_port, 0xFFFF)
+      )
+    max_attempts = 10 if should_scan else 1
+    base_port = min(base_port + max_attempts, 65536) - max_attempts


Can you use either 0xFFFF or 65536 consistently?

Sure: 0xFFFF and 0x10000 is probably cleaner.

nfelt · 2019-02-15T20:09:23Z

tensorboard/program.py

+
+    # base_port: what's the first port to which we should try to bind?
+    # should_scan: if that fails, shall we try additional ports?
+    (base_port, should_scan) = (


A multi-valued, multi-line ternary is kind of a complicated beast to read. Maybe consider just doing

should_scan = flags.port is None base_port = flags.port if flags.port is not None else core_plugin.DEFAULT_PORT

Agreed; that’s cleaner. Will patch.

nfelt · 2019-02-15T20:10:23Z

tensorboard/program.py

+    for (attempt_index, port) in (
+        enumerate(xrange(base_port, base_port + max_attempts))):
+      try:
+        # Yes, this invokes the super initializer potentially many


This is kind of scary... invoking super() multiple times seems like asking for trouble if the base class changes so this isn't safe. Maybe add a TODO to revisit this in a way that doesn't require that? (i.e. use a nested server class)

I agree that it’s scary—not that there’s anything inherently
problematic with invoking __init__ multiple times (it’s a normal
method: an initializer, not a constructor), but because so uncustomary
that it’s probably reasonable for the base classes to make changes under
the assumption that __init__ is only called once.

I will add the TODO (and probably should have done that originally).
Using a nested server is what I was thinking of, too; the change should
be mostly straightforward, though not necessarily trivial.

FWIW, I actually did check every supertype in the mro, so I think it’s
safe with current library versions, though of course I may have missed
something.

@nfelt

Summary: In response to [a helpful post-commit review from @nfelt][1]. Test Plan: Most of the changes are non-functional. The wildcard address resolution logic did change, but we don’t expect the behavior to. The test plan from #1851 suffices, plus explicitly binding `--host` to each of `localhost`, `"$(hostname)"`, `"$(hostname -s)"`, `127.0.0.1`, and `::` to verify that each works. [1]: #1851 (review) wchargin-branch: port-search-revisions

…#6560) ## Motivation for features / changes #6552 ## Technical description of changes The port scanning logic originally added in #1851 does not seem to work anymore. WerkzeugServer no longer raises an Error when attempting to bind to a port which is in used and instead handles the issue silently. The current Werkzeug version is 2.3.7 but I had 2.3.4 installed when I first verified this. I tried installing 2.2.0 and the issues was not present so I assume it was introduced in 2.3.0. https://pypi.org/project/Werkzeug ## Detailed steps to verify changes work correctly (as executed by you) 1) Start TensorBoard, note the port it is bound to (it should be 6006) 2) In a separate terminal start TensorBoard again. 3) TensorBoard should successfully bind to a different port (this time 6007)

wchargin added type:feature core:backend labels Feb 15, 2019

wchargin requested a review from manivaradarajan February 15, 2019 00:01

manivaradarajan previously requested changes Feb 15, 2019

View reviewed changes

review: centralize _auto_wildcard decl + init

e85757f

wchargin-branch: port-search

wchargin requested review from manivaradarajan and stephanwlee and removed request for manivaradarajan February 15, 2019 19:49

stephanwlee approved these changes Feb 15, 2019

View reviewed changes

wchargin merged commit 2681aed into master Feb 15, 2019

wchargin deleted the wchargin-port-search branch February 15, 2019 20:01

nfelt reviewed Feb 15, 2019

View reviewed changes

wchargin mentioned this pull request Feb 16, 2019

server: amend #1851’s port-crawling implementation #1858

Merged

rileyajones mentioned this pull request Aug 28, 2023

Bug Fix: Update the port scanning logic to properly skip ports in use #6560

Merged

server: when no port is specified, try to find one #1851

server: when no port is specified, try to find one #1851

Uh oh!

Conversation

wchargin commented Feb 15, 2019

Uh oh!

wchargin commented Feb 15, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wchargin commented Feb 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants