Add extracopyoptions #35

emlyn · 2015-07-31T16:55:34Z

Give the user the option to ignore a certain number of bad rows in the import. I've implemented it as a separate option, but it might be preferable to just have something like a freeform copyoptions that gets appended to the Redshift COPY query.

codecov-io · 2015-07-31T16:58:13Z

Current coverage is `87.84%`

Merging #35 into master will decrease coverage by -7.16% as of 7a5d7e7

@@            master     #35   diff @@
======================================
  Files           12      12       
  Stmts          501     502     +1
  Branches       122     122       
  Methods          0       0       
======================================
- Hit            476     441    -35
  Partial          0       0       
- Missed          25      61    +36

Review entire Coverage Diff as of 7a5d7e7

Powered by Codecov. Updated on successful CI builds.

JoshRosen · 2015-08-25T03:10:29Z

Given that the data that we're loading is being generated by spark-redshift itself, what sorts of errors is this option guarding against?

emlyn · 2015-08-25T07:43:38Z

Errors I have seen myself include text fields being too long, and invalid UTF8 characters in strings, although there are probably more possible.
It's quite frustrating when a long process fails at the last step, when in many cases the output may be useful even with a few missing rows, so I think the ability to set this option would be useful.
Actually thinking about it, it would be really nice if this could read the error messages from STV_LOAD_ERRORS and log them when there are errors.

JoshRosen · 2015-08-25T17:44:00Z

Ah, gotcha. Hopefully we can address the "text fields too long" issue as part of the fix for #29.

I think that the invalid UTF8 character issue may have been fixed by some escaping / quoting fixes in one of my earlier bugfix patches. If you still observe this issue with the current master, though, it would be great if you could file a new issue with an error message so that I can debug.

Reading the error message from stv_load_errors is a really good idea; this would have been super helpful to me while debugging. I've been doing it manually using SQLWorkbenchJ but automating it would be helpful.

JoshRosen · 2015-08-26T01:01:49Z

@emlyn, I've added support for automatically querying STL_LOAD_ERRORS in #53

jaley · 2015-10-14T09:32:21Z

Hi @JoshRosen,

We're currently using a fork with this patch added at the moment, as we keep finding new reasons for it being useful. Here's a little more info about our use cases for context:

The data we're processing comes in from mobile clients, so it contains all kinds of invalid mess. Strings with bad chars, timestamps at +/-MAXINT, etc. All things that slip through the Avro type system easily enough, but don't rightly make sense in the Redshift schema. We add validation to our code to filter it out before attempting to write to Redshift, but new things pop up every so often as new data comes in.

We'd like to make sure that when new types of bad data come in, we're aware of it, but it's not a catastrophic failure for the whole ETL jobs. Being AWS-based, the most sensible way for us to do this is for our ETL app to emit CloudWatch metrics, telling us how many rows were added to stl_load_errors in the save operation. We can then set up alarms to page us when we see non-zero values, but the good data will have all made it in, so our analysts can still operate while we fix it. That is, provided there weren't so many bad rows that it exceeds MAXERRORS, which we might set to say 0.1% of the total table size.

It'd therefore be pretty useful for us to expose the Redshift MAXERRORS parameter through spark-redshift. What can we do to get this patch mergeable?

emlyn · 2015-10-14T10:09:45Z

I think something like this could be more generally useful, to allow users to append arbitrary options to the end of the copy command.

JoshRosen · 2015-10-15T16:14:33Z

Hey @emlyn and @jaley,

I like the idea of adding a copyoptions-like flag for advanced users. The only possible snag that I can anticipate is a case where users are used to configuring something through copyoptions and then spark-redshift adds built-in support for configuring that same option, leading to conflicts. This probably isn't a huge problem in practice, though: the default COPY string that we build is unlikely to change significantly, so the chance of a spark-redshift upgrade breaking something is minimal unless the user happens to change other configurations at the same time. Also, this feature is intended more for advanced users to spare them the hassle of publishing their own version of the library, so it's less likely to be abused in hard-to-maintain ways.

What do you think about calling this extracopyoptions and documenting it the README? If we do that, then I'm cool with merging this.

JoshRosen · 2015-10-15T16:16:16Z

One nice additional advantage of extracopyoptions: it would give users the ability to fix #87, too.

jaley · 2015-10-15T16:41:53Z

I think that sounds good. It may mean there's less of a maintenance burden to keep up with new parameters potentially being added to Redshift's copy functionality too. Looking at the documentation as it is now, it actually seems like all the things people are likely to want to change work fine at the end of the copy statement. I guess it's possible that some things won't work just added onto the end later, but if we're clear in the documentation that this is what happens, I think it'll be a useful feature.

JoshRosen · 2015-10-15T19:12:02Z

Sounds good to me. @emlyn, if you go ahead and add the extracopyoptions and documentation as described above then I'll merge this today.

emlyn · 2015-10-16T08:29:05Z

@JoshRosen I've renamed it to extracopyoptions and added some documentation to the README, hopefully it's OK to merge now.

JoshRosen · 2015-10-16T17:07:54Z

src/main/scala/com/databricks/spark/redshift/Parameters.scala

pass -> append

JoshRosen · 2015-10-16T17:08:18Z

Yep, looks good to me. I'm going to merge this now. Thanks!

JoshRosen · 2015-10-16T17:26:37Z

src/main/scala/com/databricks/spark/redshift/RedshiftWriter.scala

Whoops, just realized that this isn't going to be interpolated properly due to a missing s at the start of this string. Will hotfix now. Missed this because the integration tests didn't run for this third-party PR :(

nhuray · 2015-11-10T22:37:59Z

@JoshRosen Hey could you provide an example how to use the extraCopyOptions parameter because I tested it with :

    df.write
      .format("com.databricks.spark.redshift")
      .option("diststyle", "key")
      .option("extracopyoptions", "TRUNCATECOLUMNS")
      .mode("overwrite")
      .save()

but it didnt work.

Moreover after taking a look to the code and the AWS documentation I don't figure out how it could work ...

JoshRosen · 2015-11-10T23:02:16Z

Hey @nhuray,

Which version of spark-redshift are you using? This option only takes effect in version 0.5.2+.

nhuray · 2015-11-11T17:31:27Z

Hi @JoshRosen I used the 0.5.2.

Add maxerrors option

02fa789

JoshRosen added the enhancement label Aug 21, 2015

JoshRosen added the stale / awaiting update label Sep 9, 2015

emlyn added 3 commits October 14, 2015 10:41

Bring this up to date with master

d383c77

Fix scalastyle

8c94de2

Replace maxerrors option with general copyoptions

390e08a

emlyn added 2 commits October 14, 2015 11:16

Fix typo in comment

b398732

Simplify copyoptions

a113f41

JoshRosen removed the stale / awaiting update label Oct 15, 2015

JoshRosen self-assigned this Oct 15, 2015

emlyn added 3 commits October 16, 2015 09:06

Rename to extracopyoptions

0a6dd23

Document extracopyoptions

7fc9202

Make indentation consistent

183aa6c

emlyn changed the title ~~Add maxerrors option~~ Add extracopyoptions Oct 16, 2015

JoshRosen reviewed Oct 16, 2015
View reviewed changes

src/main/scala/com/databricks/spark/redshift/Parameters.scala

Copy link

Contributor

JoshRosen Oct 16, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass -> append

JoshRosen added this to the 0.5.2 milestone Oct 16, 2015

JoshRosen closed this in e96213e Oct 16, 2015

JoshRosen mentioned this pull request Oct 16, 2015

Add configurations to allow tempdir and Redshift cluster to be in different AWS regions #87

Closed

JoshRosen reviewed Oct 16, 2015
View reviewed changes

Add extracopyoptions #35

Add extracopyoptions #35

Uh oh!

Conversation

emlyn commented Jul 31, 2015

Uh oh!

codecov-io commented Jul 31, 2015

Current coverage is 87.84%

Uh oh!

JoshRosen commented Aug 25, 2015

Uh oh!

emlyn commented Aug 25, 2015

Uh oh!

JoshRosen commented Aug 25, 2015

Uh oh!

JoshRosen commented Aug 26, 2015

Uh oh!

jaley commented Oct 14, 2015

Uh oh!

emlyn commented Oct 14, 2015

Uh oh!

JoshRosen commented Oct 15, 2015

Uh oh!

JoshRosen commented Oct 15, 2015

Uh oh!

jaley commented Oct 15, 2015

Uh oh!

JoshRosen commented Oct 15, 2015

Uh oh!

emlyn commented Oct 16, 2015

Uh oh!

JoshRosen Oct 16, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Oct 16, 2015

Uh oh!

JoshRosen Oct 16, 2015

Choose a reason for hiding this comment

Uh oh!

nhuray commented Nov 10, 2015

Uh oh!

JoshRosen commented Nov 10, 2015

Uh oh!

nhuray commented Nov 11, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Current coverage is `87.84%`