Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support HTTPS calls to clusters #114

Merged
merged 18 commits into from
Nov 2, 2023
Merged

Support HTTPS calls to clusters #114

merged 18 commits into from
Nov 2, 2023

Conversation

jlewitt1
Copy link
Collaborator

@jlewitt1 jlewitt1 commented Sep 28, 2023

Adds option for starting up the Runhouse API server on the cluster with HTTPS. This makes it incredibly fast and easy to stand up a microservice with standard bearer token authentication (using a Runhouse token), allowing users to share Runhouse resources with collaborators, teams, customers, etc.

Highlights:

New server connection types:

  • ssh: Connects to the cluster via an SSH tunnel.
  • tls: Connects to the cluster via HTTPS (port 443) and enforces verification via TLS certificates. Only users with a valid cert will be able to make requests to the API server.
  • none: Does not use any port forwarding or enforce any authentication. Connects to the cluster via HTTP (port 80).
  • aws_ssm: Uses the AWS Systems Manager to
    create an SSH tunnel to the cluster. Note: this is currently only relevant for SageMaker Clusters
  • paramiko: Uses Paramiko to create an SSH tunnel to the cluster. This is relevant if you are using a cluster which require existing credentials (e.g. a password)

The API server will be started by default on port 32300.

Validating resource access on the cluster:

  • To start will apply general auth based on the Runhouse token provided in the request. When enabled requesting user must have access to the cluster (as saved in Den)
  • When executing a function, user must have access to the underlying function resource

NGINX (optional):

The Runhouse API server (a Fast API app) will by default run on a higher, non-privileged port (32300). Nginx will run in front of uvicorn and serve as a reverse proxy to forward requests from port 80 (default for HTTP) or port 443 (default for HTTPS) to the API server's port.

TODOs:

  • Add lots more tests (testing certs, performance, server connection types, den auth, nginx)
  • Smooth over flow of setting up nginx on the cluster when relevant
  • Implement Paramiko server connection logic

@jlewitt1 jlewitt1 marked this pull request as draft September 28, 2023 21:59
@jlewitt1 jlewitt1 force-pushed the cluster-auth branch 7 times, most recently from 7c1f6c9 to c9f6780 Compare October 1, 2023 18:58
@jlewitt1 jlewitt1 changed the title [WIP] Support HTTPS calls to clusters Support HTTPS calls to clusters Oct 3, 2023
@jlewitt1 jlewitt1 marked this pull request as ready for review October 4, 2023 07:34
@jlewitt1 jlewitt1 force-pushed the cluster-auth branch 7 times, most recently from 324cb6d to 54bdea7 Compare October 10, 2023 16:37
@jlewitt1 jlewitt1 force-pushed the cluster-auth branch 2 times, most recently from d498520 to 99d7ef3 Compare October 12, 2023 08:26
===========================
By default, Runhouse collects metadata from provisioned clusters and data relating to performance and error monitoring.
This data will only be used by Runhouse to improve the product.
No Personal Identifiable Information (PII) is collected and we will not sell or buy data about you.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this line, it's more complex than it seems.

Runhouse provides a couple of options to manage the connection to the Runhouse API server running on a cluster.


API Server Connection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me everything that follows this should be in the Cluster docs, and everything before we already cover in the data collection doc. What do you think of removing this file and just putting the info below into the Cluster docs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, you added the auth into here. I don't think they're really the same, so I would revert it (but still remove the line above), and just add the new auth info into the Cluster docs.

class Cluster(Resource):
RESOURCE_TYPE = "cluster"
REQUEST_TIMEOUT = 5 # seconds
DEFAULT_HOST = "127.0.0.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that if the connection type is tls or none (i.e. the calls are over HTTP alone and not a tunnel, the server will need to start with a host of 0.0.0.0. So maybe we should just make that the default, and only use localhost when the user starts with connection of ssh.

class Cluster(Resource):
RESOURCE_TYPE = "cluster"
REQUEST_TIMEOUT = 5 # seconds
DEFAULT_HOST = "127.0.0.1"
DEFAULT_HTTP_PORT = 50052
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we take the opportunity to change this now?

runhouse/resources/hardware/cluster.py Outdated Show resolved Hide resolved
@wraps(func)
async def wrapper(*args, **kwargs):
request: Request = kwargs.get("request")
is_https: bool = request.url.scheme == "https"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this mean the uesr can bypass auth if they just send a regular http request?

runhouse/servers/http/http_server.py Outdated Show resolved Hide resolved
Comment on lines 155 to 160
DEN_AUTH = False
memory_exporter = None

# NOTE: This is a temp in-mem cache, we will move this out into the object store for future permissions support
AUTH_CACHE = {}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't love using the class methods this way... but I don't know why. I think globals may be a bit safer.

Comment on lines 225 to 233
@classmethod
def get(cls, key) -> dict:
"""Get resources associated with a particular user"""
return cls.AUTH_CACHE.get(key, {})

@classmethod
def put(cls, key, value):
"""Update server cache with a user's resources and access type"""
cls.AUTH_CACHE[key] = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confusing for these to collide with getting and putting resources

@staticmethod
@app.get("/cert")
@validate_user
def get_cert(request: Request, message: Message):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait... is it a problem to be sending the cert over HTTP? Also, is it a problem to be sending the user's token over HTTP before they have the cert?


Removing Collected Data
------------------------------------
If you would like us to remove your collected data, please contact us.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we provide a link for contact (e.g. an email address or a contact form)?

The below options can be specified with the ``server_connection_type`` parameter
when :ref:`initializing a cluster <Cluster Factory Method>`:

- ``ssh``: Connects to the cluster via port forwarding. The API server will be started with HTTP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to say "on port 80", given our conversation earlier today?


- ``ssh``: Connects to the cluster via port forwarding. The API server will be started with HTTP.
- ``tls``: Connects to the cluster via port forwarding and enforces verification via TLS certificates. The API server
will be started with HTTPS. Only users with a valid cert will be able to make requests to the API server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... on port 443"
?

- ``tls``: Connects to the cluster via port forwarding and enforces verification via TLS certificates. The API server
will be started with HTTPS. Only users with a valid cert will be able to make requests to the API server.
- ``none``: Does not use any port forwarding or enforce any authentication. The API server will be started
with HTTP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

to enable token authentication. Runhouse will handle adding the token to each subsequent request as an auth header with
format: :code:`{"Authorization": "Bearer <token>"}`

Enabling TLS and Den Auth for the API server makes it incredibly fast and easy to stand up a microservice with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explicitly mention "Runhouse Den Auth" for folks who are dropping in just for this section / Google searchability?

docs/index.rst Outdated
@@ -82,7 +82,7 @@ Table of Contents

debugging_logging
troubleshooting
data_collection
auth_and_data_collection
Source Code <https://github.com/run-house/runhouse>
REST API Guide <https://api.run.house/docs>
Dashboard <https://www.run.house/dashboard>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runhouse Den Dashboard?

restart_ray,
screen,
create_logfile=True,
host=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add a "port" parameter here and elsewhere in this file, or would this be handled outside of this PR?


logger = logging.getLogger(__name__)


class ServerConnectionType(Enum):
"""Enum to manage the type of connection Runhouse will make with the Runhouse API server started on the cluster.
``ssh``: Use port forwarding to connect to the server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The API server will be started with HTTP."
?

@@ -24,6 +28,15 @@ def cluster(
host (str or List[str], optional): Hostname, IP address, or list of IP addresses for the BYO cluster.
ssh_creds (dict, optional): Dictionary mapping SSH credentials.
Example: ``ssh_creds={'ssh_user': '...', 'ssh_private_key':'<path_to_key>'}``
server_port (bool, optional): Port to use for the server. (Default: ``50052``).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flagging both lines for a potential change given our conversation earlier today.

@jlewitt1 jlewitt1 force-pushed the cluster-auth branch 2 times, most recently from cd875d6 to d9bf096 Compare October 18, 2023 08:34
# Conflicts:
#	runhouse/resources/hardware/cluster.py
#	runhouse/servers/http/http_client.py
#	runhouse/servers/http/http_utils.py
@jlewitt1 jlewitt1 force-pushed the cluster-auth branch 2 times, most recently from 68bc94d to 8fbe7c5 Compare October 24, 2023 08:57
…put and waiting for successful Uvicorn start line. Much better visibility when server fails to start properly!

- Print server logfile output to terminal on `runhouse restart`, even when screen is enabled.
- Fix https function and cluster fixtures
- Change check to not require auth or Ray status, just send a ping
- Fix async bug with HTTP server
@jlewitt1 jlewitt1 force-pushed the cluster-auth branch 2 times, most recently from 2a78242 to fc2cd96 Compare October 25, 2023 09:13
Comment on lines 191 to 192
- ``tls``: Connects to the cluster via HTTPS (by default on port :code:`443`) and enforces verification via TLS
certificates. Only users with a valid cert will be able to make requests to the API server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is out of date. Should just say "Connects to the cluster via HTTPS (by default on port :code:443) using either a provided certificate, or creating a new self-signed certificate just for this cluster. You must open the needed ports in the firewall, such as via the open_ports argument in the OnDemandCluster, or manually in the compute itself or cloud console.

Comment on lines 193 to 194
- ``none``: Does not use any port forwarding or enforce any authentication. Connects to the cluster with HTTP by
default on port :code:`80`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to add: "This is useful when connecting to a cluster within a VPC, or creating a tunnel manually on the side with custom settings."

Comment on lines 199 to 200
- ``paramiko``: Uses `Paramiko <https://www.paramiko.org/>`_ to create an SSH tunnel to the cluster. This
is relevant if you are using a cluster which require existing credentials (e.g. a password).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should just say "using a cluster which requires a password to authenticate."

Comment on lines 205 to 206
The ``tls`` connection type is the most secure and is recommended for production use if you are not running inside
of a VPC.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The ``tls`` connection type is the most secure and is recommended for production use if you are not running inside
of a VPC.
The ``tls`` connection type is the most secure and is recommended for production use if you are not running inside
of a VPC. However, be mindful that you must secure the cluster with authentication (see below) if you open it to the public internet.

Comment on lines 211 to 212
Runhouse allows you to authenticate users via their Runhouse token (generated when
:ref:`logging in <Login/Logout>`) and saved to local Runhouse configs in path: :code:`~/.rh/config.yaml`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Runhouse allows you to authenticate users via their Runhouse token (generated when
:ref:`logging in <Login/Logout>`) and saved to local Runhouse configs in path: :code:`~/.rh/config.yaml`.
If desired, Runhouse provides out-of-the-box authentication via users' Runhouse token (generated when
:ref:`logging in <Login/Logout>` and set locally at: :code:`~/.rh/config.yaml`). This is crucial if the cluster has ports open to the public internet, as would usually be the case when using the ``tls`` connection type. You may also set up your own authentication manually inside of your own code, but you should likely still enable Runhouse authentication to ensure that even your non-user-facing endpoints into the server are secured.

detail=f"Failed to validate Runhouse user: {load_resp_content(resp)}",
)

if use_den_auth or func_call:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be if func_call, if use_den_auth is false we return above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I don't see how we're validating auth into the cluster for the non-func_call case

Comment on lines 162 to 172
def _load_current_cluster(kwargs) -> Union[str, None]:
current_cluster = _get_cluster_from(_current_cluster("config"))
if current_cluster:
return current_cluster.rns_address

# If no cluster config saved yet on the cluster try getting the cluster uri from the message object
# included in the request
message: Message = kwargs.get("message")
resource = json.loads(message.data)
return resource.get("name")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this anymore?

@@ -225,7 +372,8 @@ def lookup_env_for_name(name, check_rns=False):

@staticmethod
@app.post("/resource")
def put_resource(message: Message):
@validate_user
def put_resource(request: Request, message: Message):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding Request objects here?

.save()
)

c.restart_server()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we restart the server here?


configs.set("token", "abcd123")

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to try calling a bunch of other cluster methods with a bad token too, not just a function call

@jlewitt1 jlewitt1 force-pushed the cluster-auth branch 5 times, most recently from ad5842c to 261e1fb Compare November 2, 2023 09:14
…reate & share resources. update auth logic in object store
# Conflicts:
#	runhouse/resources/hardware/cluster.py
@jlewitt1 jlewitt1 merged commit f9adbcc into main Nov 2, 2023
5 checks passed
@jlewitt1 jlewitt1 deleted the cluster-auth branch November 4, 2023 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants