Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make SSH tunnel to query a remote redshift database in R #55

Open
fissehab opened this issue Dec 11, 2021 · 1 comment
Open

Make SSH tunnel to query a remote redshift database in R #55

fissehab opened this issue Dec 11, 2021 · 1 comment

Comments

@fissehab
Copy link

I am using the below Python code to get data from a remote Redshift database. I am using this function with the reticulate package in R. I want to convert the Python code to R so that I do not face the hassle of setting up an Anaconda environment in the cloud when I push my code (shiny app) to shipapps.io. I read about the SSH package and its capabilities but I could not successfully convert my code using it.

def get_redshift_data(query):
  import pandas as pd
  from sqlalchemy import create_engine
  from sqlalchemy import Table, MetaData
  import pymysql
  import paramiko
  from paramiko import SSHClient
  from sshtunnel import SSHTunnelForwarder
  from sqlalchemy.engine import url as u

  mypkey = paramiko.RSAKey.from_private_key_file('nfw-linux-key.pem')
  sql_username = 'my_sql_username'
  sql_hostname = 'sql_username@xxxx.us-east-1.redshift.amazonaws.com'
  sql_password = 'my_sql_password'
  sql_main_database = 'my_db'
  sql_port = 5439
  ssh_host = '127.0.0.1'
  ssh_user = 'ubuntu'
  ssh_port = 56714
  sql_ip = '1.1.1.1.1'
  with SSHTunnelForwarder(
          (ssh_host, ssh_port),
          ssh_username=ssh_user,
          ssh_pkey=mypkey,
          remote_bind_address=(sql_hostname, sql_port)) as tunnel:
      conn_url = u.URL(drivername='postgresql+psycopg2', username=sql_username, 
                   password=sql_password, host='127.0.0.1', 
                   port=tunnel.local_bind_port, database=sql_main_database)
      conn = create_engine(conn_url)
      result = pd.read_sql(query, conn)
  return result

Then I am sourcing the Python function above to get data from the database.

library(reticulate)
source_python("get_redshift_data.py")
df = get_redshift_data("select top 100 * from my_db.my_table")

But I want to avoid the Python part and use R only.

@FlorianSchwendinger
Copy link

FlorianSchwendinger commented Mar 9, 2022

You can try something like.

start_tunnel_process <- function(user, server, port, stderr = nullfile()) {
    proc <- callr::r_bg(
        function(user, server, port) {
            ssh_host <- paste(user, server, sep = "@")
            while (TRUE) {
                ssh_session <- ssh::ssh_connect(ssh_host, keyfile = "~/.ssh/id_rsa", verbose = FALSE)
                ssh::ssh_tunnel(ssh_session, port = port, target = "localhost:27017")
                ssh::ssh_disconnect(ssh_session)
            }
        },
        args = list(user, server, port),
        stdout = nullfile(),
        stderr = stderr
    )
    proc
}

I currently evaluating this for my use case with mongolite.

url <- sprintf("mongodb://%s:%s@%s:%s", user, password, host, as.integer(port))
con <- mongo(collection, db = database, url = url)
con$find()
con$disconnect()

You need the while loop since after you disconnect the tunnel will close.
The while will create a new tunnel. This also means it will fail, if the time between two connections is to short for the 2nd process to create a new tunnel. You can hack this by putting a fault tolerant loop around your connect function.

for (i in seq_len(100)) {
    status <- try(con <- mongo(collection, db = database, url = url), silent = TRUE)
    if (!inherits(status, "try-error")) break
    Sys.sleep(0.2)
}

This works, but is also kind of a hack since the background R process runs at 100% CPU all the time which is to much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants