-
Notifications
You must be signed in to change notification settings - Fork 1.7k
server: when no port is specified, try to find one #1851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: This patch teaches TensorBoard about `--port=default` (which is the new default value for `--port`). If `--port=default` is specified, TensorBoard will attempt to bind to port 6006, just as prior to this patch. If this fails, TensorBoard will try to bind incrementally to ports 6007, 6008, …, giving up after 10 attempts. If an explicit port is specified, TensorBoard will not try to search for other ports; it will fail (as before) if it cannot bind. This is the only way in which `--port=default` and `--port=6006` differ. This patch also changes the behavior of one edge case, at least on Linux. Prior to this patch, explicit port numbers were interpreted modulo 65536 (TCP ports are u16s), so `--port=71542` and `--port=6006` had the same behavior. It is now an error on all platforms to specify a port that does not fit in a u16. Resolves #1848. Test Plan: $ bazel build //tensorboard $ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ & sleep 2 [1] 223547 TensorBoard 1.13.0a0 at http://<hostname>:6006 (Press CTRL+C to quit) $ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ & sleep 2 [2] 223591 TensorBoard 1.13.0a0 at http://<hostname>:6007 (Press CTRL+C to quit) $ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port default & sleep 2 [3] 224292 TensorBoard 1.13.0a0 at http://<hostname>:6008 (Press CTRL+C to quit) $ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port 6006 & sleep 2 # should fail [4] 225773 E0214 15:35:08.939894 140033352959744 program.py:232] TensorBoard could not bind to port 6006, it was already in use ERROR: TensorBoard could not bind to port 6006, it was already in use [4]+ Exit 255 ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port 6006 $ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port 0 & sleep 2 [4] 226298 TensorBoard 1.13.0a0 at http://<hostname>:33673 (Press CTRL+C to quit) $ for i in {6009..6015}; do ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ --port "$i" & done && sleep 2 [5] 226638 [6] 226639 [7] 226640 [8] 226641 [9] 226642 [10] 226643 [11] 226644 TensorBoard 1.13.0a0 at http://<hostname>:6011 (Press CTRL+C to quit) TensorBoard 1.13.0a0 at http://<hostname>:6013 (Press CTRL+C to quit) TensorBoard 1.13.0a0 at http://<hostname>:6015 (Press CTRL+C to quit) TensorBoard 1.13.0a0 at http://<hostname>:6012 (Press CTRL+C to quit) TensorBoard 1.13.0a0 at http://<hostname>:6014 (Press CTRL+C to quit) TensorBoard 1.13.0a0 at http://<hostname>:6009 (Press CTRL+C to quit) TensorBoard 1.13.0a0 at http://<hostname>:6010 (Press CTRL+C to quit) $ ./bazel-bin/tensorboard/tensorboard --logdir ./logs/ & sleep 2 [12] 227173 E0214 15:35:33.812504 140362460620544 program.py:232] TensorBoard could not bind to any port around 6006 (tried 10 times) ERROR: TensorBoard could not bind to any port around 6006 (tried 10 times) $ jobs -p | xargs kill wchargin-branch: port-search
|
(Recommend reviewing with “Hide whitespace changes”/ |
| super(WerkzeugServer, self).__init__(host, port, wsgi_app) | ||
| break | ||
| except socket.error as e: | ||
| if hasattr(errno, 'EACCES') and e.errno == errno.EACCES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason you're not using os.strerror()?
It's ok to add additional strings for specific context, but I've always seen the standard strerror() used as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is not new in this PR; it just got indented by two extra
spaces. You can ask GitHub to omit whitespace-only diffs under the “Diff
settings” button at the top of the page:
To the question: I didn’t write the original code, but using strerror
would be fine with me (not that this it should be changed in this PR).
| super(WerkzeugServer, self).__init__(host, port, wsgi_app) | ||
| break | ||
| except socket.error as e: | ||
| if hasattr(errno, 'EACCES') and e.errno == errno.EACCES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to check both the attr and e.errno?
The examples I'm seeing just do
e.errno == errno.EACCES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in question is not new in this PR, but I agree with the
conservative hasattr: not all platforms have all error codes, and thus
the values available on the errno module are platform-dependent. (In
practice, both Linux and Windows support the four codes checked here.)
I probably would have written this as
if e.errno == getattr(errno, "EADDRINUSE", None):because it’s simpler, faster, and avoids hasattr, but it’s also
fine as written.
| else: | ||
| 'TensorBoard could not bind to unavailable address %s' % host) | ||
| elif hasattr(errno, 'EAFNOSUPPORT') and e.errno == errno.EAFNOSUPPORT: | ||
| raise TensorBoardServerException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unfortunate Python2 doesn't have exception chaining. It would be ideal to capture the true cause and not swallow it. Any ideas on how to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in question is not new in this PR, but: no, unfortunately.
Python exception objects don’t store the traceback at which they were
created or thrown. You can extract the traceback corresponding to the
current exception handler manually from sys.exc_info() and store it
on the exception, but the calling code would need to know how to extract
it and display it. I generally only do this when writing libraries that
operate generically over exceptions (e.g., LazyRace.get()).
In this particular case, I think that it’s probably fine. We’ve
completely characterized the contents of the original error already
(it is determined by its errno), so the only thing that we’re missing
is the traceback into the Werkzeug and Python standard library
internals—but the traceback past this point can be reasonably considered
an implementation detail of TensorBoard’s WerkzeugServer.
wchargin-branch: port-search
|
@manivaradarajan is out of office; re-assigning review to @stephanwlee. |
Reviewer is OOO, comments addressed, PR approved by team member.
| # Without an explicit host, we default to serving on all interfaces, | ||
| # and will attempt to serve both IPv4 and IPv6 traffic through one socket. | ||
| host = self._get_wildcard_address(flags.port) | ||
| host = self._get_wildcard_address(base_port) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking this should be done per-port right? Though I agree it would be exotic to have the different ports have different wildcard addresses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s true. I can move this, I think.
| 'TensorBoard cannot bind to port %d > %d' % (base_port, 0xFFFF) | ||
| ) | ||
| max_attempts = 10 if should_scan else 1 | ||
| base_port = min(base_port + max_attempts, 65536) - max_attempts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use either 0xFFFF or 65536 consistently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure: 0xFFFF and 0x10000 is probably cleaner.
|
|
||
| # base_port: what's the first port to which we should try to bind? | ||
| # should_scan: if that fails, shall we try additional ports? | ||
| (base_port, should_scan) = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A multi-valued, multi-line ternary is kind of a complicated beast to read. Maybe consider just doing
should_scan = flags.port is None
base_port = flags.port if flags.port is not None else core_plugin.DEFAULT_PORTThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed; that’s cleaner. Will patch.
| for (attempt_index, port) in ( | ||
| enumerate(xrange(base_port, base_port + max_attempts))): | ||
| try: | ||
| # Yes, this invokes the super initializer potentially many |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of scary... invoking super() multiple times seems like asking for trouble if the base class changes so this isn't safe. Maybe add a TODO to revisit this in a way that doesn't require that? (i.e. use a nested server class)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it’s scary—not that there’s anything inherently
problematic with invoking __init__ multiple times (it’s a normal
method: an initializer, not a constructor), but because so uncustomary
that it’s probably reasonable for the base classes to make changes under
the assumption that __init__ is only called once.
I will add the TODO (and probably should have done that originally).
Using a nested server is what I was thinking of, too; the change should
be mostly straightforward, though not necessarily trivial.
FWIW, I actually did check every supertype in the mro, so I think it’s
safe with current library versions, though of course I may have missed
something.
Summary: In response to [a helpful post-commit review from @nfelt][1]. Test Plan: Most of the changes are non-functional. The wildcard address resolution logic did change, but we don’t expect the behavior to. The test plan from #1851 suffices, plus explicitly binding `--host` to each of `localhost`, `"$(hostname)"`, `"$(hostname -s)"`, `127.0.0.1`, and `::` to verify that each works. [1]: #1851 (review) wchargin-branch: port-search-revisions
…#6560) ## Motivation for features / changes #6552 ## Technical description of changes The port scanning logic originally added in #1851 does not seem to work anymore. WerkzeugServer no longer raises an Error when attempting to bind to a port which is in used and instead handles the issue silently. The current Werkzeug version is 2.3.7 but I had 2.3.4 installed when I first verified this. I tried installing 2.2.0 and the issues was not present so I assume it was introduced in 2.3.0. https://pypi.org/project/Werkzeug ## Detailed steps to verify changes work correctly (as executed by you) 1) Start TensorBoard, note the port it is bound to (it should be 6006) 2) In a separate terminal start TensorBoard again. 3) TensorBoard should successfully bind to a different port (this time 6007)

Summary:
This patch teaches TensorBoard about
--port=default(which is the newdefault value for
--port). If--port=defaultis specified,TensorBoard will attempt to bind to port 6006, just as prior to this
patch. If this fails, TensorBoard will try to bind incrementally to
ports 6007, 6008, …, giving up after 10 attempts.
If an explicit port is specified, TensorBoard will not try to search for
other ports; it will fail (as before) if it cannot bind. This is the
only way in which
--port=defaultand--port=6006differ.This patch also changes the behavior of one edge case, at least on
Linux. Prior to this patch, explicit port numbers were interpreted
modulo 65536 (TCP ports are u16s), so
--port=71542and--port=6006had the same behavior. It is now an error on all platforms to specify a
port that does not fit in a u16.
Resolves #1848.
Test Plan:
wchargin-branch: port-search