Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PostgreSQL is broken #45

Open
beanumber opened this issue Jun 10, 2016 · 11 comments
Open

PostgreSQL is broken #45

beanumber opened this issue Jun 10, 2016 · 11 comments

Comments

@beanumber
Copy link
Owner

One problem is that the dbWriteTable method from the RPostgreSQL package will not actually accept the HEADER flag (https://github.com/cran/RPostgreSQL/blob/master/R/PostgreSQLSupport.R#L570)
This breaks the current implementation.
Possible workarounds:

  • use the data.frame method -- but that will be slow and memory inefficient
  • don't write the header row in the CSV if you are using Postgres. Kludgy.
  • wait for @hadley et al to get the alternative RPostgres package working. This will probably be the long-term solution, but I don't know how long it will take.

I believe this problem applies to all etl packages that use the dbWriteTable character method.

@homerhanumat
Copy link

Do you know if package RPostgres is now up to the task? I'm planning to create an airlines DB and would like to use PostgreSQL if possible.

@beanumber
Copy link
Owner Author

@homerhanumat I haven't tried it in a long time, but I know that package is under active development. Can you try it out and report back?

@homerhanumat
Copy link

Will do. Thanks, by the way, for MDSR -- very impressive text.

@homerhanumat
Copy link

Might be a while before I get to try it. My remote Ubuntu-based server installs things fine but has only about 30G disk space to devote to the DB. My laptop is a Mac with plenty of space but RPostgres installation fails on Mac at the moment: r-dbi/RPostgres#155.

@nicholasjhorton
Copy link
Collaborator

nicholasjhorton commented Dec 8, 2017 via email

@homerhanumat
Copy link

2015 through the problematic 2016 -- sure, I can give ti a try.

@homerhanumat
Copy link

homerhanumat commented Dec 8, 2017

I tried this:

library(airlines)
db <- src_postgres(RPostgres::Postgres(), user = "homer",
                password = "XX",
                host = "localhost",
                dbname = "airlines",
                port = 5432)
airlines <- etl("airlines", db, dir = "~/dumps/airlines")
airlines %>%
  etl_init() %>%
  etl_update(years = 2015)

Got an error saying that it could not find the initialization script.

So I tried again with an explicit reference to the script in the package:

library(airlines)
db <- src_postgres(RPostgres::Postgres(), user = "homer",
                password = "XX",
                host = "localhost",
                dbname = "airlines",
                port = 5432)
script <- system.file("sql/init.postgres", package = "airlines")
airlines <- etl("airlines", db, dir = "~/dumps/airlines")
airlines %>%
  etl_init(script = script) %>%
  etl_update(years = 2015)

Got two errors, the second of which was a complaint that it could not parse the first commented line of the script.

I copied the script into my home directory, deleted the comment-lines, and tried again:

library(airlines)
db <- src_postgres(RPostgres::Postgres(), user = "homer",
                password = "XX",
                host = "localhost",
                dbname = "airlines",
                port = 5432)
script <- "/home/homer/init.postgres"
airlines <- etl("airlines", db, dir = "~/dumps/airlines")
airlines %>%
  etl_init(script = script) %>%
  etl_update(years = 2015)

Now I'm down to just the one error, which is:

Running SQL script at /home/homer/init.postgres
trying URL 'http://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_UNIQUE_CARRIERS'
Content type 'Application/X-Unknown' length 49465 bytes (48 KB)
==================================================
downloaded 48 KB

Parsed with column specification:
cols(
  Code = col_character(),
  Description = col_character()
)
trying URL 'https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat'
Content type 'text/plain; charset=utf-8' length 1068028 bytes (1.0 MB)
==================================================
downloaded 1.0 MB

Parsed with column specification:
cols(
  X1 = col_integer(),
  X2 = col_character(),
  X3 = col_character(),
  X4 = col_character(),
  X5 = col_character(),
  X6 = col_character(),
  X7 = col_double(),
  X8 = col_double(),
  X9 = col_integer(),
  X10 = col_double(),
  X11 = col_character(),
  X12 = col_character(),
  X13 = col_character(),
  X14 = col_character()
)
Warning: 322 parsing failures.
row # A tibble: 5 x 5 col     row   col           expected actual                                          file expected   <int> <chr>              <chr>  <chr>                                         <chr> actual 1   329    X2 delimiter or quote      C '/home/homer/dumps/airlines/raw/airports.dat' file 2   329    X2 delimiter or quote        '/home/homer/dumps/airlines/raw/airports.dat' row 3   666    X2 delimiter or quote      S '/home/homer/dumps/airlines/raw/airports.dat' col 4   666    X2 delimiter or quote        '/home/homer/dumps/airlines/raw/airports.dat' expected 5  1310    X2 delimiter or quote      M '/home/homer/dumps/airlines/raw/airports.dat'
... ................. ... ..................................................................................... ........ ..................................................................................... ...... ..................................................................................... .... ....... [... truncated]
Error in filter_impl(.data, quo) : found duplicated column name: NA
In addition: Warning message:
In rbind(names(probs), probs_f) :
  number of columns of result is not a multiple of vector length (arg 1)

Thoughts?

@homerhanumat
Copy link

The above error occurs in the extract phase and is unrelated to choice of database. Perhaps the structure of airports.dat has changed during the past year.

@homerhanumat
Copy link

Yes, some raw data sets have changed, some URLs have changed, and some dplyr underscore functions no longer work. I'll submit a pull request if I can get everything working.

@homerhanumat
Copy link

So I have updated the package (here)to account for changes in location and structure of source files and in dplyr as well. It now works again with the RMySQL driver. It fails using RPostgres, with an error that suggests that the RPostgres method for DBI::dbWriteTable() attempts to write the header row of a flights csv file into the flights table, just as with RPostgreSQL.

@homerhanumat
Copy link

homerhanumat commented Dec 12, 2017

Latest update to my fork of airlines permits creation and updating of a Postgres DB, loading the processed flights csv file to a PostgresDB with a direct sql COPY table FROM command instead of going through dbWriteTable(). Tested on 2015 and the first three month of 2016, with no parsing problems or other errors.

Monthly updates (both in MySQL and Postgres) conclude with a warning:

Missing column names filled in: 'X110' [110]

This occurs because the flight csv files have a spurious trailing comma on the first line, so no harm is done.

At this point one should not build the connection with RPostgres::Postgres(), as dplyr does not yet support it as a back-end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants