Run queries through ad-hoc SSH tunnels #4797

rauchy · 2020-04-12T14:28:58Z

What type of PR is this? (check all applicable)

Feature

Description

Redash requires network access to the data sources it needs to query, but sometime these data sources aren't publicly available. A way around this is using SSH tunnels, which can be setup manually outside Redash, but can also be setup automatically before a connection is made, and closed automatically after they are no longer required.

This PR sets up ad-hoc SSH tunnels for data sources if they have a ssh_tunnel dict in their options, which must include an ssh_host and ssh_username of the bastion server to connect through.

In order to enable SSH tunnels, dynamic_settings.ssh_tunnel_auth() must be implemented to allow the Redash instance to authenticate with the bastion server(s).

Related Issues

#2013

arikfr · 2020-04-16T20:05:11Z

redash/query_runner/__init__.py

+    @property
+    def host(self):
+        if "host" in self.configuration:
+            return self.configuration["host"]
+        else:
+            raise NotImplementedError()
+
+    @host.setter
+    def host(self, host):
+        if "host" in self.configuration:
+            self.configuration["host"] = host
+        else:
+            raise NotImplementedError()
+
+    @property
+    def port(self):
+        if "port" in self.configuration:
+            return self.configuration["port"]
+        else:
+            raise NotImplementedError()
+
+    @port.setter
+    def port(self, port):
+        if "port" in self.configuration:
+            self.configuration["port"] = port
+        else:
+            raise NotImplementedError()


Let's add a comment explaining the purpose of these properties.

👍 8da215b

redash/query_runner/__init__.py

…ng the /jobs endpoint

rauchy · 2020-04-29T21:50:05Z

@arikfr forceful schema refreshes can't be handled by gunicorn, so I offloaded them to RQ and polled for their results using the /jobs endpoint. This seems to work well. If you think the implementation is on track, I'll use the same method for connection testing.

arikfr · 2020-04-30T09:23:31Z

redash/tasks/general.py

+        return True
+
+
+@job("schemas", queue_class=Queue, at_front=True, timeout=30, ttl=90)


Unfortunately for some data sources timeout needs to be much longer. Let's start with 5 minutes.

arikfr · 2020-04-30T09:25:19Z

redash/serializers/__init__.py

+            "result": result,
            "query_result_id": query_result_id,


To avoid accumulating tech debt, maybe we can return query_result_id only when it's a execute_query task?

(only if it's simple check)

If we want to avoid accumulating tech debt, we should probably always return result and ditch query_result_id, no?

But this will break scripts that use /jobs for polling and expect query_result_id... :| We could really use API versioning at this point.

I agree, that's why I wen't for doubling result and query_result_id, as the cheapest thing that won't break and won't be too awkward :(

arikfr · 2020-04-30T09:30:52Z

redash/handlers/data_sources.py

-        except NotSupported:
-            response["error"] = {
-                "code": 1,
-                "message": "Data source type does not support retrieving schema",
-            }
-        except Exception:
-            response["error"] = {"code": 2, "message": "Error retrieving schema."}


We need to keep reporting these error codes. For example, we might show different UI when it's not supported. And we don't want to throw random errors at the user but rather show "Error retrieving schema." (it might make sense to show some of the errors, but it's beyond the scope of this PR).

I thought so too, but then I realized we aren't really using the error codes defined in the service. The current implementation (returning [] on NotSupported from the backend) mimics the same behavior we have today - silent failing on NotSupported and a "Schema refresh failed." message when errors occur.

I know that we don't really use them anymore in code -- this is something that happened during the conversion to React, but we should someday. It's better to be explicit about this.

Co-authored-by: Restyled.io <commits@restyled.io>

…uto-ssh-tunnels

redash/tasks/general.py

jezdez

@arikfr @rauchy I have a few detailed questions about the job based schema fetching as I think this adds some risks regarding performance and user experience.

Apologies for not adding those comments before merging, I thought doing here than in a new ticket gives more context, but please let me know if you want me to open one anyway.

jezdez · 2020-05-13T18:59:54Z

redash/tasks/general.py

+        return True
+
+
+@job("schemas", queue_class=Queue, at_front=True, timeout=300, ttl=90)


@arikfr @rauchy I think this job adds a big risk of filling up Redis when a lot of people try to load a query or refresh the schema since it stores the job result for the default 500 seconds and data source schema can be quite big depending on the number of tables.

Since the result is effectively used only once when loaded I would suggest a small amount of 30 seconds and make it a configurable number. WDYT?

jezdez · 2020-05-13T19:06:38Z

client/app/pages/queries/hooks/useDataSourceSchema.js

 function getSchema(dataSource, refresh = undefined) {
  if (!dataSource) {
    return Promise.resolve([]);
  }

+  const fetchSchemaFromJob = (data) => {
+    return sleep(1000).then(() => {


This adds a noticeable delay to loading the schema browser without indication that it's loading the schema, could we add a "Loading... please wait" screen or something similar? Or reduce the sleep interval to something that is closer to an unnoticeable perception level?

Also, in case of a worker failure that prevents the job from ever returning a good result (e.g. Redis key eviction), could this add an upper limit on the number of tries to load the data and then show a button to fetch the schema anew with a new job?

jezdez · 2020-05-13T19:20:24Z

Oh one other comment, Marina and me spent a lot of time on a custom schema drawer feature in our fork (that was also merged here a while ago but was backed out after feedback from Arik): mozilla@109eac8

Since this PR adds a background job to load the schema, I want to point out that our schema drawer feature has modified the refresh_schema job to store the schema in Redash's postgres to be able to

amend the schema with metadata such as description, column samples and example queries but
to cache the response for the schema drawer API endpoint with a Redis lock based refresh system to prevent the thundering herd problem of the refresh schema button.

The results are lower traffic to cost intensive data sources (less $$$), immediate responses to the user and the ability to provide more details about the individual data source tables and columns.

If you'd like to get a quick demo again, please let me know, we're obviously eager to have alignment with you.

jezdez · 2020-05-29T17:05:46Z

@arikfr @rauchy We're about to start our rebase downstream to pull in these changes and I was wondering if you had seen my comments above about the anti-patterns in this feature around loading schema with a noticeable 1 second delay and a risk of raise conditions if a RQ worker doesn't process the get_schema job?

arikfr · 2020-05-31T10:15:40Z

We need to do a follow up on this:

Make the API return cached version if it's available immediately (this will address the 1 second delay).
Add some progress indicator (NTH though, as we had the same issue in the past).

Adding a lock could be nice, but I'm not sure if it's needed once we move the caching functionality to the API layer. Basically at this point it returns being the same behavior as before (in terms of number of restarts).

run queries through adhoc SSH tunnels

231f19b

rauchy requested a review from arikfr April 12, 2020 14:28

arikfr reviewed Apr 16, 2020

View reviewed changes

susodapop reviewed Apr 17, 2020

View reviewed changes

redash/query_runner/__init__.py Outdated Show resolved Hide resolved

weekly-digest bot mentioned this pull request Apr 20, 2020

Weekly Digest (13 April, 2020 - 20 April, 2020) #4815

Closed

Omer Lachish added 2 commits April 20, 2020 22:53

reduce indent by losing try/else clause

93a9748

document host/port getters and setters

8da215b

weekly-digest bot mentioned this pull request Apr 27, 2020

Weekly Digest (20 April, 2020 - 27 April, 2020) #4839

Closed

restyled-io bot mentioned this pull request Apr 29, 2020

Restyle Run queries through ad-hoc SSH tunnels #4847

Merged

handle forceful schema refreshes in RQ and poll for their results usi…

75a48b0

…ng the /jobs endpoint

arikfr reviewed Apr 30, 2020

View reviewed changes

Omer Lachish and others added 4 commits April 30, 2020 20:43

set schema refresh timeout to 5 minutes

7f3b458

Restyled by prettier (#4847)

9029e6e

Co-authored-by: Restyled.io <commits@restyled.io>

Merge branch 'auto-ssh-tunnels' of github.com:getredash/redash into a…

c38d826

…uto-ssh-tunnels

send schema refresh errors as part of API response

bc69c03

weekly-digest bot mentioned this pull request May 4, 2020

Weekly Digest (27 April, 2020 - 4 May, 2020) #4863

Closed

arikfr reviewed May 11, 2020

View reviewed changes

redash/tasks/general.py Outdated Show resolved Hide resolved

Use correct get_schema call.

6a26d78

arikfr merged commit 9562718 into master May 11, 2020

arikfr deleted the auto-ssh-tunnels branch May 11, 2020 10:22

jezdez reviewed May 13, 2020

View reviewed changes

gabrieldutra mentioned this pull request Jun 1, 2020

Return cached data source schema when available #4934

Merged

1 task

pahaz mentioned this pull request Jun 16, 2020

Who uses the sshtunnel? pahaz/sshtunnel#181

Open

eradman mentioned this pull request Oct 13, 2023

Allow Query.options to be None #6519

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run queries through ad-hoc SSH tunnels #4797

Run queries through ad-hoc SSH tunnels #4797

rauchy commented Apr 12, 2020 •

edited by arikfr

Loading

arikfr Apr 16, 2020

rauchy Apr 20, 2020

rauchy commented Apr 29, 2020

arikfr Apr 30, 2020

arikfr Apr 30, 2020

arikfr Apr 30, 2020

rauchy Apr 30, 2020

arikfr Apr 30, 2020

rauchy Apr 30, 2020

arikfr Apr 30, 2020

rauchy Apr 30, 2020

arikfr Apr 30, 2020

rauchy Apr 30, 2020

jezdez left a comment

jezdez May 13, 2020

jezdez May 13, 2020 •

edited

Loading

jezdez commented May 13, 2020 •

edited

Loading

jezdez commented May 29, 2020

arikfr commented May 31, 2020

		return True


		@job("schemas", queue_class=Queue, at_front=True, timeout=30, ttl=90)

Run queries through ad-hoc SSH tunnels #4797

Run queries through ad-hoc SSH tunnels #4797

Conversation

rauchy commented Apr 12, 2020 • edited by arikfr Loading

What type of PR is this? (check all applicable)

Description

Related Issues

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rauchy commented Apr 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jezdez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jezdez May 13, 2020 • edited Loading

Choose a reason for hiding this comment

jezdez commented May 13, 2020 • edited Loading

jezdez commented May 29, 2020

arikfr commented May 31, 2020

rauchy commented Apr 12, 2020 •

edited by arikfr

Loading

jezdez May 13, 2020 •

edited

Loading

jezdez commented May 13, 2020 •

edited

Loading