Use maxlength metadata to configure VARCHAR column lengths #54

JoshRosen · 2015-08-26T07:36:32Z

This patch allows users to specify a maxlength column metadata entry for string columns in order to control the width of VARCHAR columns in generated Redshift table schemas. This is necessary in order to support string columns that are wider than 256 characters. In addition, this configuration can be used as an optimization to achieve space-savings in Redshift. For more background on the motivation of this feature, see #29.

See also: #53 to improve error reporting when LOAD fails.

JoshRosen · 2015-08-26T07:36:56Z

src/main/scala/com/databricks/spark/redshift/RedshiftJDBCWrapper.scala

Whoops, this should be 256.

Also, should this be configurable?

Should we just use TEXT in this case, which essentially defers the default to redshift?

JoshRosen · 2015-08-26T07:47:03Z

This is a minimum viable patch to illustrate the basic idea and to gather feedback. I'll add more thorough tests tomorrow.

@marmbrus, one question on naming: throughout spark-redshift we've used all-lowercase names for many of the library's configuration parameters. I think that you originally proposed a camel-cased name (maxLength). Given that you've also proposed using a similar column metadata option in Spark itself (apache/spark#8374), how do you think we should handle capitalization here? Are column metadata field names case-insensitive, in which case this point is moot?

Another question: in the future, is there ever a case where we'd want to generate CHAR columns instead of VARCHAR? Will this approach be compatible with that (I think so, but we should confirm)?

@Aerlinger, @jaley, @traviscrawford, and @eduardoramirez, you might also want to follow this PR to make sure that it addresses your respective use-cases.

codecov-io · 2015-08-26T07:49:37Z

Current coverage is `88.47%`

Merging #54 into master will increase coverage by +0.54% as of b92e689

@@            master     #54   diff @@
======================================
  Files           10      10       
  Stmts          373     373       
  Branches        85      85       
  Methods          0       0       
======================================
+ Hit            328     330     +2
  Partial          0       0       
+ Missed          45      43     -2

Review entire Coverage Diff as of b92e689

Powered by Codecov. Updated on successful CI builds.

marmbrus · 2015-08-26T18:54:50Z

I believe metadata is case sensitive (which is unfortunate since options are not). Lowercase looks good to me. We should perhaps check what the convention in mllib is as they are the heaviest users of this feature.

JoshRosen · 2015-08-26T23:55:50Z

It turns out that the user-facing APIs for manipulating existing columns' metadata are somewhat incomplete in the Scala API and are missing in the Python, SQL, and R language APIs. I don't think that this is a blocker for merging this patch, however, as I feel that the right approach is to improve Spark's APIs for working with column metadata.

In followup PRs, we can consider whether we want to add additional configurations for changing the default size or enabling truncation for working around errors due to limited column width.

First cut at supporting configurable VARCHAR column lengths

2e08c65

JoshRosen added the enhancement label Aug 26, 2015

JoshRosen added this to the 0.5 milestone Aug 26, 2015

JoshRosen reviewed Aug 26, 2015
View reviewed changes

JoshRosen added 6 commits August 26, 2015 15:26

Merge remote-tracking branch 'origin/master' into max-length

4c7919d

Merge remote-tracking branch 'origin/master' into max-length

6dc6d22

Add more missing keys

326a4c7

Use TEXT as default

e9772d5

Add unit test for create table command

88927de

Add user-facing documentation on how to use column metadata

b92e689

JoshRosen changed the title ~~[WIP] Use maxlength metadata to configure VARCHAR column lengths~~ Use maxlength metadata to configure VARCHAR column lengths Aug 26, 2015

JoshRosen mentioned this pull request Aug 26, 2015

Calculate VARCHAR(N) on schema generation to support text longer than the 256 characters #37

Closed

JoshRosen closed this in 08f9419 Aug 27, 2015

JoshRosen deleted the max-length branch August 27, 2015 00:37

JoshRosen mentioned this pull request Sep 9, 2015

Generated table creation SQL fails for strings longer than 256 characters #29

Closed

JoshRosen mentioned this pull request Dec 18, 2015

maxlength metadata not working with pyspark #137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use maxlength metadata to configure VARCHAR column lengths #54

Use maxlength metadata to configure VARCHAR column lengths #54

Uh oh!

JoshRosen commented Aug 26, 2015

Uh oh!

JoshRosen Aug 26, 2015

Uh oh!

marmbrus Aug 26, 2015

Uh oh!

JoshRosen commented Aug 26, 2015

Uh oh!

codecov-io commented Aug 26, 2015

Uh oh!

marmbrus commented Aug 26, 2015

Uh oh!

JoshRosen commented Aug 26, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Use maxlength metadata to configure VARCHAR column lengths #54

Use maxlength metadata to configure VARCHAR column lengths #54

Uh oh!

Conversation

JoshRosen commented Aug 26, 2015

Uh oh!

JoshRosen Aug 26, 2015

Choose a reason for hiding this comment

Uh oh!

marmbrus Aug 26, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Aug 26, 2015

Uh oh!

codecov-io commented Aug 26, 2015

Current coverage is 88.47%

Uh oh!

marmbrus commented Aug 26, 2015

Uh oh!

JoshRosen commented Aug 26, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Current coverage is `88.47%`