forked from sicarul/redshiftTools
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
167 lines (112 loc) · 5.25 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
output:
md_document:
variant: markdown_github
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
![](http://www.r-pkg.org/badges/version-last-release/redshiftTools)
![](http://cranlogs.r-pkg.org/badges/redshiftTools)
# redshiftTools
This is an R Package meant to easen common operations with Amazon Redshift. The first motivation for this package was making it easier for bulk uploads, where the procedure for uploading data consists in generating various CSV files, uploading them to an S3 bucket and then calling a copy command on the server, this package helps with all those tasks in encapsulated functions.
## Installation
To install the latest CRAN version, you'll need to execute:
``` r
install.packages('redshiftTools')
```
If instead you want to install the latest github master version:
``` r
devtools::install_github("sicarul/redshiftTools")
```
## Drivers
This library supports two official ways of connecting to Amazon Redshift (Others may work, but untested):
### RPostgres
This Postgres library is great, and it works even with Amazon Redshift servers with SSL enabled. It previously didn't support transactions, but is now the recommended way to work with redshiftTools.
To use it, please configure like this:
``` r
devtools::install_github("r-dbi/RPostgres")
library(RPostgres)
con <- dbConnect(RPostgres::Postgres(), dbname="dbname",
host='my-redshift-url.amazon.com', port='5439',
user='myuser', password='mypassword',sslmode='require')
test=dbGetQuery(con, 'select 1')
```
### RJDBC
If you download the official redshift driver .jar, you can use it with this R library, it's not great in the sense that you can't use it with dplyr for example, since it doesn't implement all the standard DBI interfaces, but it works fine for uploading data.
To use it, please configure like this:
``` r
install.packages('RJDBC')
library(RJDBC)
# Save the driver into a directory
dir.create('~/.redshiftTools')
# - Check your AWS Dashboard to get the latest URL instead of this version -
download.file('http://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC41-1.1.9.1009.jar','~/.redshiftTools/redshift-driver.jar')
# Use Redshift driver
driver <- JDBC("com.amazon.redshift.jdbc41.Driver", "~/.redshiftTools/redshift-driver.jar", identifier.quote="`")
# Create connection, in production, you may want to move these variables to a .env file with library dotenv, or other methods.
dbname="dbname"
host='my-redshift-url.amazon.com'
port='5439'
user='myuser'
password='mypassword'
ssl='true'
url <- sprintf("jdbc:redshift://%s:%s/%s?tcpKeepAlive=true&ssl=%s&sslfactory=com.amazon.redshift.ssl.NonValidatingFactory", host, port, dbname, ssl)
conn <- dbConnect(driver, url, user, password)
```
## Usage
### Creating tables
For creating tables, there is a support function, `rs_create_statement`, which receives a data.frame and returns the query for creating the same table in Amazon Redshift.
``` r
n=1000
testdf = data.frame(
a=rep('a', n),
b=c(1:n),
c=rep(as.Date('2017-01-01'), n),
d=rep(as.POSIXct('2017-01-01 20:01:32'), n),
e=rep(as.POSIXlt('2017-01-01 20:01:32'), n),
f=rep(paste0(rep('a', 4000), collapse=''), n) )
cat(rs_create_statement(testdf, table_name='dm_great_table'))
```
This returns:
``` sql
CREATE TABLE dm_great_table (
a VARCHAR(8),
b int,
c date,
d timestamp,
e timestamp,
f VARCHAR(4096)
);
```
The cat is only done to view properly in console, it's not done directly in the function in case you need to pass the string to another function (Like a query runner)
### Uploading data
For uploading data, you'll have available now 2 functions: `rs_replace_table` and `rs_upsert_table`, both of these functions are called with almost the same parameters, except on upsert you can specify with which keys to search for matching rows.
For example, suppose we have a table to load with 2 integer columns, we could use the following code:
``` r
library("aws.s3")
library(RPostgres)
library(redshiftTools)
a=data.frame(a=seq(1,10000), b=seq(10000,1))
n=head(a,n=10)
n$b=n$a
nx=rbind(n, data.frame(a=seq(5:10), b=seq(10:5)))
con <- dbConnect(RPostgres::Postgres(), dbname="dbname",
host='my-redshift-url.amazon.com', port='5439',
user='myuser', password='mypassword',sslmode='require')
b=rs_replace_table(a, dbcon=con, table_name='mytable', bucket="mybucket", split_files=4)
c=rs_upsert_table(nx, dbcon=con, table_name = 'mytable', split_files=4, bucket="mybucket", keys=c('a'))
```
### Creating tables with data
A conjunction of `rs_create_statement` and `rs_replace_table` can be found in `rs_create_table`. You can create a table from scratch from R and upload the contents of the data frame, without needing to write SQL code at all.
``` r
library("aws.s3")
library(RPostgres)
library(redshiftTools)
a=data.frame(a=seq(1,10000), b=seq(10000,1))
con <- dbConnect(RPostgres::Postgres(), dbname="dbname",
host='my-redshift-url.amazon.com', port='5439',
user='myuser', password='mypassword',sslmode='require')
b=rs_create_table(a, dbcon=con, table_name='mytable', bucket="mybucket", split_files=4)
```