Skip to content

Conversation

@wchargin
Copy link
Contributor

Summary:
This patch teaches TensorBoard about --port=default (which is the new
default value for --port). If --port=default is specified,
TensorBoard will attempt to bind to port 6006, just as prior to this
patch. If this fails, TensorBoard will try to bind incrementally to
ports 6007, 6008, …, giving up after 10 attempts.

If an explicit port is specified, TensorBoard will not try to search for
other ports; it will fail (as before) if it cannot bind. This is the
only way in which --port=default and --port=6006 differ.

This patch also changes the behavior of one edge case, at least on
Linux. Prior to this patch, explicit port numbers were interpreted
modulo 65536 (TCP ports are u16s), so --port=71542 and --port=6006
had the same behavior. It is now an error on all platforms to specify a
port that does not fit in a u16.

Resolves #1848.

Test Plan:

$ bazel build //tensorboard

$ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ & sleep 2
[1] 223547
TensorBoard 1.13.0a0 at http://<hostname>:6006 (Press CTRL+C to quit)

$ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ & sleep 2
[2] 223591
TensorBoard 1.13.0a0 at http://<hostname>:6007 (Press CTRL+C to quit)

$ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port default & sleep 2
[3] 224292
TensorBoard 1.13.0a0 at http://<hostname>:6008 (Press CTRL+C to quit)

$ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port 6006 & sleep 2  # should fail
[4] 225773
E0214 15:35:08.939894 140033352959744 program.py:232] TensorBoard could not bind to port 6006, it was already in use
ERROR: TensorBoard could not bind to port 6006, it was already in use
[4]+  Exit 255                ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port 6006

$ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port 0 & sleep 2
[4] 226298
TensorBoard 1.13.0a0 at http://<hostname>:33673 (Press CTRL+C to quit)

$ for i in {6009..6015}; do ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port "$i" & done && sleep 2
[5] 226638
[6] 226639
[7] 226640
[8] 226641
[9] 226642
[10] 226643
[11] 226644
TensorBoard 1.13.0a0 at http://<hostname>:6011 (Press CTRL+C to quit)
TensorBoard 1.13.0a0 at http://<hostname>:6013 (Press CTRL+C to quit)
TensorBoard 1.13.0a0 at http://<hostname>:6015 (Press CTRL+C to quit)
TensorBoard 1.13.0a0 at http://<hostname>:6012 (Press CTRL+C to quit)
TensorBoard 1.13.0a0 at http://<hostname>:6014 (Press CTRL+C to quit)
TensorBoard 1.13.0a0 at http://<hostname>:6009 (Press CTRL+C to quit)
TensorBoard 1.13.0a0 at http://<hostname>:6010 (Press CTRL+C to quit)

$ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ & sleep 2
[12] 227173
E0214 15:35:33.812504 140362460620544 program.py:232] TensorBoard could not bind to any port around 6006 (tried 10 times)
ERROR: TensorBoard could not bind to any port around 6006 (tried 10 times)

$ jobs -p | xargs kill

wchargin-branch: port-search

Summary:
This patch teaches TensorBoard about `--port=default` (which is the new
default value for `--port`). If `--port=default` is specified,
TensorBoard will attempt to bind to port 6006, just as prior to this
patch. If this fails, TensorBoard will try to bind incrementally to
ports 6007, 6008, …, giving up after 10 attempts.

If an explicit port is specified, TensorBoard will not try to search for
other ports; it will fail (as before) if it cannot bind. This is the
only way in which `--port=default` and `--port=6006` differ.

This patch also changes the behavior of one edge case, at least on
Linux. Prior to this patch, explicit port numbers were interpreted
modulo 65536 (TCP ports are u16s), so `--port=71542` and `--port=6006`
had the same behavior. It is now an error on all platforms to specify a
port that does not fit in a u16.

Resolves #1848.

Test Plan:

    $ bazel build //tensorboard

    $ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ & sleep 2
    [1] 223547
    TensorBoard 1.13.0a0 at http://<hostname>:6006 (Press CTRL+C to quit)

    $ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ & sleep 2
    [2] 223591
    TensorBoard 1.13.0a0 at http://<hostname>:6007 (Press CTRL+C to quit)

    $ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port default & sleep 2
    [3] 224292
    TensorBoard 1.13.0a0 at http://<hostname>:6008 (Press CTRL+C to quit)

    $ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port 6006 & sleep 2  # should fail
    [4] 225773
    E0214 15:35:08.939894 140033352959744 program.py:232] TensorBoard could not bind to port 6006, it was already in use
    ERROR: TensorBoard could not bind to port 6006, it was already in use
    [4]+  Exit 255                ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port 6006

    $ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port 0 & sleep 2
    [4] 226298
    TensorBoard 1.13.0a0 at http://<hostname>:33673 (Press CTRL+C to quit)

    $ for i in {6009..6015}; do ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port "$i" & done && sleep 2
    [5] 226638
    [6] 226639
    [7] 226640
    [8] 226641
    [9] 226642
    [10] 226643
    [11] 226644
    TensorBoard 1.13.0a0 at http://<hostname>:6011 (Press CTRL+C to quit)
    TensorBoard 1.13.0a0 at http://<hostname>:6013 (Press CTRL+C to quit)
    TensorBoard 1.13.0a0 at http://<hostname>:6015 (Press CTRL+C to quit)
    TensorBoard 1.13.0a0 at http://<hostname>:6012 (Press CTRL+C to quit)
    TensorBoard 1.13.0a0 at http://<hostname>:6014 (Press CTRL+C to quit)
    TensorBoard 1.13.0a0 at http://<hostname>:6009 (Press CTRL+C to quit)
    TensorBoard 1.13.0a0 at http://<hostname>:6010 (Press CTRL+C to quit)

    $ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ & sleep 2
    [12] 227173
    E0214 15:35:33.812504 140362460620544 program.py:232] TensorBoard could not bind to any port around 6006 (tried 10 times)
    ERROR: TensorBoard could not bind to any port around 6006 (tried 10 times)

    $ jobs -p | xargs kill

wchargin-branch: port-search
@wchargin
Copy link
Contributor Author

(Recommend reviewing with “Hide whitespace changes”/git show -w.)

super(WerkzeugServer, self).__init__(host, port, wsgi_app)
break
except socket.error as e:
if hasattr(errno, 'EACCES') and e.errno == errno.EACCES:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you're not using os.strerror()?
It's ok to add additional strings for specific context, but I've always seen the standard strerror() used as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is not new in this PR; it just got indented by two extra
spaces. You can ask GitHub to omit whitespace-only diffs under the “Diff
settings” button at the top of the page:

image

To the question: I didn’t write the original code, but using strerror
would be fine with me (not that this it should be changed in this PR).

super(WerkzeugServer, self).__init__(host, port, wsgi_app)
break
except socket.error as e:
if hasattr(errno, 'EACCES') and e.errno == errno.EACCES:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to check both the attr and e.errno?

The examples I'm seeing just do

e.errno == errno.EACCES

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in question is not new in this PR, but I agree with the
conservative hasattr: not all platforms have all error codes, and thus
the values available on the errno module are platform-dependent. (In
practice, both Linux and Windows support the four codes checked here.)
I probably would have written this as

if e.errno == getattr(errno, "EADDRINUSE", None):

because it’s simpler, faster, and avoids hasattr, but it’s also
fine as written.

else:
'TensorBoard could not bind to unavailable address %s' % host)
elif hasattr(errno, 'EAFNOSUPPORT') and e.errno == errno.EAFNOSUPPORT:
raise TensorBoardServerException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate Python2 doesn't have exception chaining. It would be ideal to capture the true cause and not swallow it. Any ideas on how to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in question is not new in this PR, but: no, unfortunately.
Python exception objects don’t store the traceback at which they were
created or thrown. You can extract the traceback corresponding to the
current exception handler manually from sys.exc_info() and store it
on the exception, but the calling code would need to know how to extract
it and display it. I generally only do this when writing libraries that
operate generically over exceptions (e.g., LazyRace.get()).

In this particular case, I think that it’s probably fine. We’ve
completely characterized the contents of the original error already
(it is determined by its errno), so the only thing that we’re missing
is the traceback into the Werkzeug and Python standard library
internals—but the traceback past this point can be reasonably considered
an implementation detail of TensorBoard’s WerkzeugServer.

@wchargin wchargin requested review from manivaradarajan and stephanwlee and removed request for manivaradarajan February 15, 2019 19:49
@wchargin
Copy link
Contributor Author

@manivaradarajan is out of office; re-assigning review to @stephanwlee.

@wchargin wchargin dismissed manivaradarajan’s stale review February 15, 2019 20:01

Reviewer is OOO, comments addressed, PR approved by team member.

@wchargin wchargin merged commit 2681aed into master Feb 15, 2019
@wchargin wchargin deleted the wchargin-port-search branch February 15, 2019 20:01
# Without an explicit host, we default to serving on all interfaces,
# and will attempt to serve both IPv4 and IPv6 traffic through one socket.
host = self._get_wildcard_address(flags.port)
host = self._get_wildcard_address(base_port)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking this should be done per-port right? Though I agree it would be exotic to have the different ports have different wildcard addresses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s true. I can move this, I think.

'TensorBoard cannot bind to port %d > %d' % (base_port, 0xFFFF)
)
max_attempts = 10 if should_scan else 1
base_port = min(base_port + max_attempts, 65536) - max_attempts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use either 0xFFFF or 65536 consistently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure: 0xFFFF and 0x10000 is probably cleaner.


# base_port: what's the first port to which we should try to bind?
# should_scan: if that fails, shall we try additional ports?
(base_port, should_scan) = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A multi-valued, multi-line ternary is kind of a complicated beast to read. Maybe consider just doing

should_scan = flags.port is None
base_port = flags.port if flags.port is not None else core_plugin.DEFAULT_PORT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed; that’s cleaner. Will patch.

for (attempt_index, port) in (
enumerate(xrange(base_port, base_port + max_attempts))):
try:
# Yes, this invokes the super initializer potentially many
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of scary... invoking super() multiple times seems like asking for trouble if the base class changes so this isn't safe. Maybe add a TODO to revisit this in a way that doesn't require that? (i.e. use a nested server class)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it’s scary—not that there’s anything inherently
problematic with invoking __init__ multiple times (it’s a normal
method: an initializer, not a constructor), but because so uncustomary
that it’s probably reasonable for the base classes to make changes under
the assumption that __init__ is only called once.

I will add the TODO (and probably should have done that originally).
Using a nested server is what I was thinking of, too; the change should
be mostly straightforward, though not necessarily trivial.

FWIW, I actually did check every supertype in the mro, so I think it’s
safe with current library versions, though of course I may have missed
something.

wchargin added a commit that referenced this pull request Feb 16, 2019
Summary:
In response to [a helpful post-commit review from @nfelt][1].

Test Plan:
Most of the changes are non-functional. The wildcard address resolution
logic did change, but we don’t expect the behavior to. The test plan
from #1851 suffices, plus explicitly binding `--host` to each of
`localhost`, `"$(hostname)"`, `"$(hostname -s)"`, `127.0.0.1`, and `::`
to verify that each works.

[1]: #1851 (review)

wchargin-branch: port-search-revisions
rileyajones added a commit that referenced this pull request Aug 29, 2023
…#6560)

## Motivation for features / changes
#6552

## Technical description of changes
The port scanning logic originally added in #1851 does not seem to work
anymore. WerkzeugServer no longer raises an Error when attempting to
bind to a port which is in used and instead handles the issue silently.

The current Werkzeug version is 2.3.7 but I had 2.3.4 installed when I
first verified this. I tried installing 2.2.0 and the issues was not
present so I assume it was introduced in 2.3.0.
https://pypi.org/project/Werkzeug

## Detailed steps to verify changes work correctly (as executed by you)
1) Start TensorBoard, note the port it is bound to (it should be 6006)
2) In a separate terminal start TensorBoard again.
3) TensorBoard should successfully bind to a different port (this time
6007)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants