Skip to content

Conversation

@kai-zeng
Copy link

Major changes:
(1) Configuring aws access key pairs: The previous implementation is not consistent on the aws access key pairs used in unloading and reading data. This PR fixes this bug.
Now the aws access key pair is configured in this priority order:
<1> If the 'tempdir' option encodes the aws access key pair. That key pair is used.
<2> Otherwise, we use the aws key pair specified in the hadoop configuration of the spark context.

(2) Automating the test. We set up a redshift server for this test, and encode the security information using Travis.

(3) Auto convert column names to lower cases as Redshift COPY only supports lowercase column names.

(4) Other miscellaneous fixes.

@kai-zeng
Copy link
Author

cc @marmbrus @yhuai

@JoshRosen
Copy link
Contributor

I'm going to be taking over this PR. As it stands now, I think that this PR is too big because it contains both bugfixes and general test coverage improvements. I'm going to see about splitting these into a series of smaller incremental PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a new file with original code then it should have the DB copyright.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I see now that most of this suite is based on / copied from RedshiftSourceSuite. It would have been good to leave a comment explaining this (maybe instead of the uninformative comment that's currently at the top of this suite). It would have also been nice to factor out some of the code into common helper classes (e.g. not repeating expectedData in both suites).

@JoshRosen
Copy link
Contributor

(1) Configuring aws access key pairs: The previous implementation is not consistent on the aws access key pairs used in unloading and reading data. This PR fixes this bug.
Now the aws access key pair is configured in this priority order:
<1> If the 'tempdir' option encodes the aws access key pair. That key pair is used.
<2> Otherwise, we use the aws key pair specified in the hadoop configuration of the spark context.

Would have been good to compare / contrast this with the fix proposed by @koeninger in #32.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're going to reformat this anyways then you might as well replace this with a .getOrElse.

@JoshRosen
Copy link
Contributor

This PR is really difficult to review / understand in its current state, so I'm going to open a new WIP PR where I try to gradually revert some of the unnecessary changes in order to figure out which fixes were actually necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more specific exception that we could expect?

@JoshRosen
Copy link
Contributor

Closing this in favor of #41.

@JoshRosen JoshRosen closed this Aug 18, 2015
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I don't like should matchers. Also, it would be nice to use something similar to Spark SQL's checkAnswer so that you get useful debugging output for failures.

@JoshRosen JoshRosen deleted the redshift-test branch August 20, 2015 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants