-
Notifications
You must be signed in to change notification settings - Fork 347
Bug fixes and auto testing of Redshift data source #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'm going to be taking over this PR. As it stands now, I think that this PR is too big because it contains both bugfixes and general test coverage improvements. I'm going to see about splitting these into a series of smaller incremental PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a new file with original code then it should have the DB copyright.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I see now that most of this suite is based on / copied from RedshiftSourceSuite. It would have been good to leave a comment explaining this (maybe instead of the uninformative comment that's currently at the top of this suite). It would have also been nice to factor out some of the code into common helper classes (e.g. not repeating expectedData in both suites).
Would have been good to compare / contrast this with the fix proposed by @koeninger in #32. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're going to reformat this anyways then you might as well replace this with a .getOrElse.
|
This PR is really difficult to review / understand in its current state, so I'm going to open a new WIP PR where I try to gradually revert some of the unnecessary changes in order to figure out which fixes were actually necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a more specific exception that we could expect?
|
Closing this in favor of #41. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I don't like should matchers. Also, it would be nice to use something similar to Spark SQL's checkAnswer so that you get useful debugging output for failures.
Major changes:
(1) Configuring aws access key pairs: The previous implementation is not consistent on the aws access key pairs used in unloading and reading data. This PR fixes this bug.
Now the aws access key pair is configured in this priority order:
<1> If the 'tempdir' option encodes the aws access key pair. That key pair is used.
<2> Otherwise, we use the aws key pair specified in the hadoop configuration of the spark context.
(2) Automating the test. We set up a redshift server for this test, and encode the security information using Travis.
(3) Auto convert column names to lower cases as Redshift COPY only supports lowercase column names.
(4) Other miscellaneous fixes.