Skip to content

Conversation

@emlyn
Copy link
Contributor

@emlyn emlyn commented Jul 31, 2015

Give the user the option to ignore a certain number of bad rows in the import. I've implemented it as a separate option, but it might be preferable to just have something like a freeform copyoptions that gets appended to the Redshift COPY query.

@codecov-io
Copy link

Current coverage is 87.84%

Merging #35 into master will decrease coverage by -7.16% as of 7a5d7e7

@@            master     #35   diff @@
======================================
  Files           12      12       
  Stmts          501     502     +1
  Branches       122     122       
  Methods          0       0       
======================================
- Hit            476     441    -35
  Partial          0       0       
- Missed          25      61    +36

Review entire Coverage Diff as of 7a5d7e7

Powered by Codecov. Updated on successful CI builds.

@JoshRosen
Copy link
Contributor

Given that the data that we're loading is being generated by spark-redshift itself, what sorts of errors is this option guarding against?

@emlyn
Copy link
Contributor Author

emlyn commented Aug 25, 2015

Errors I have seen myself include text fields being too long, and invalid UTF8 characters in strings, although there are probably more possible.
It's quite frustrating when a long process fails at the last step, when in many cases the output may be useful even with a few missing rows, so I think the ability to set this option would be useful.
Actually thinking about it, it would be really nice if this could read the error messages from STV_LOAD_ERRORS and log them when there are errors.

@JoshRosen
Copy link
Contributor

Ah, gotcha. Hopefully we can address the "text fields too long" issue as part of the fix for #29.

I think that the invalid UTF8 character issue may have been fixed by some escaping / quoting fixes in one of my earlier bugfix patches. If you still observe this issue with the current master, though, it would be great if you could file a new issue with an error message so that I can debug.

Reading the error message from stv_load_errors is a really good idea; this would have been super helpful to me while debugging. I've been doing it manually using SQLWorkbenchJ but automating it would be helpful.

@JoshRosen
Copy link
Contributor

@emlyn, I've added support for automatically querying STL_LOAD_ERRORS in #53

@jaley
Copy link
Contributor

jaley commented Oct 14, 2015

Hi @JoshRosen,

We're currently using a fork with this patch added at the moment, as we keep finding new reasons for it being useful. Here's a little more info about our use cases for context:

The data we're processing comes in from mobile clients, so it contains all kinds of invalid mess. Strings with bad chars, timestamps at +/-MAXINT, etc. All things that slip through the Avro type system easily enough, but don't rightly make sense in the Redshift schema. We add validation to our code to filter it out before attempting to write to Redshift, but new things pop up every so often as new data comes in.

We'd like to make sure that when new types of bad data come in, we're aware of it, but it's not a catastrophic failure for the whole ETL jobs. Being AWS-based, the most sensible way for us to do this is for our ETL app to emit CloudWatch metrics, telling us how many rows were added to stl_load_errors in the save operation. We can then set up alarms to page us when we see non-zero values, but the good data will have all made it in, so our analysts can still operate while we fix it. That is, provided there weren't so many bad rows that it exceeds MAXERRORS, which we might set to say 0.1% of the total table size.

It'd therefore be pretty useful for us to expose the Redshift MAXERRORS parameter through spark-redshift. What can we do to get this patch mergeable?

@emlyn
Copy link
Contributor Author

emlyn commented Oct 14, 2015

I think something like this could be more generally useful, to allow users to append arbitrary options to the end of the copy command.

@JoshRosen
Copy link
Contributor

Hey @emlyn and @jaley,

I like the idea of adding a copyoptions-like flag for advanced users. The only possible snag that I can anticipate is a case where users are used to configuring something through copyoptions and then spark-redshift adds built-in support for configuring that same option, leading to conflicts. This probably isn't a huge problem in practice, though: the default COPY string that we build is unlikely to change significantly, so the chance of a spark-redshift upgrade breaking something is minimal unless the user happens to change other configurations at the same time. Also, this feature is intended more for advanced users to spare them the hassle of publishing their own version of the library, so it's less likely to be abused in hard-to-maintain ways.

What do you think about calling this extracopyoptions and documenting it the README? If we do that, then I'm cool with merging this.

@JoshRosen
Copy link
Contributor

One nice additional advantage of extracopyoptions: it would give users the ability to fix #87, too.

@jaley
Copy link
Contributor

jaley commented Oct 15, 2015

I think that sounds good. It may mean there's less of a maintenance burden to keep up with new parameters potentially being added to Redshift's copy functionality too. Looking at the documentation as it is now, it actually seems like all the things people are likely to want to change work fine at the end of the copy statement. I guess it's possible that some things won't work just added onto the end later, but if we're clear in the documentation that this is what happens, I think it'll be a useful feature.

@JoshRosen
Copy link
Contributor

Sounds good to me. @emlyn, if you go ahead and add the extracopyoptions and documentation as described above then I'll merge this today.

@emlyn emlyn changed the title Add maxerrors option Add extracopyoptions Oct 16, 2015
@emlyn
Copy link
Contributor Author

emlyn commented Oct 16, 2015

@JoshRosen I've renamed it to extracopyoptions and added some documentation to the README, hopefully it's OK to merge now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass -> append

@JoshRosen
Copy link
Contributor

Yep, looks good to me. I'm going to merge this now. Thanks!

@JoshRosen JoshRosen added this to the 0.5.2 milestone Oct 16, 2015
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, just realized that this isn't going to be interpolated properly due to a missing s at the start of this string. Will hotfix now. Missed this because the integration tests didn't run for this third-party PR :(

@nhuray
Copy link

nhuray commented Nov 10, 2015

@JoshRosen Hey could you provide an example how to use the extraCopyOptions parameter because I tested it with :

    df.write
      .format("com.databricks.spark.redshift")
      .option("diststyle", "key")
      .option("extracopyoptions", "TRUNCATECOLUMNS")
      .mode("overwrite")
      .save()

but it didnt work.

Moreover after taking a look to the code and the AWS documentation I don't figure out how it could work ...

@JoshRosen
Copy link
Contributor

Hey @nhuray,

Which version of spark-redshift are you using? This option only takes effect in version 0.5.2+.

@nhuray
Copy link

nhuray commented Nov 11, 2015

Hi @JoshRosen I used the 0.5.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants