Skip to content

Conversation

@JoshRosen
Copy link
Contributor

This patch allows users to specify a maxlength column metadata entry for string columns in order to control the width of VARCHAR columns in generated Redshift table schemas. This is necessary in order to support string columns that are wider than 256 characters. In addition, this configuration can be used as an optimization to achieve space-savings in Redshift. For more background on the motivation of this feature, see #29.

See also: #53 to improve error reporting when LOAD fails.

@JoshRosen JoshRosen added this to the 0.5 milestone Aug 26, 2015
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, this should be 256.

Also, should this be configurable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just use TEXT in this case, which essentially defers the default to redshift?

@JoshRosen
Copy link
Contributor Author

This is a minimum viable patch to illustrate the basic idea and to gather feedback. I'll add more thorough tests tomorrow.

@marmbrus, one question on naming: throughout spark-redshift we've used all-lowercase names for many of the library's configuration parameters. I think that you originally proposed a camel-cased name (maxLength). Given that you've also proposed using a similar column metadata option in Spark itself (apache/spark#8374), how do you think we should handle capitalization here? Are column metadata field names case-insensitive, in which case this point is moot?

Another question: in the future, is there ever a case where we'd want to generate CHAR columns instead of VARCHAR? Will this approach be compatible with that (I think so, but we should confirm)?

@Aerlinger, @jaley, @traviscrawford, and @eduardoramirez, you might also want to follow this PR to make sure that it addresses your respective use-cases.

@codecov-io
Copy link

Current coverage is 88.47%

Merging #54 into master will increase coverage by +0.54% as of b92e689

@@            master     #54   diff @@
======================================
  Files           10      10       
  Stmts          373     373       
  Branches        85      85       
  Methods          0       0       
======================================
+ Hit            328     330     +2
  Partial          0       0       
+ Missed          45      43     -2

Review entire Coverage Diff as of b92e689

Powered by Codecov. Updated on successful CI builds.

@marmbrus
Copy link
Contributor

I believe metadata is case sensitive (which is unfortunate since options are not). Lowercase looks good to me. We should perhaps check what the convention in mllib is as they are the heaviest users of this feature.

@JoshRosen JoshRosen changed the title [WIP] Use maxlength metadata to configure VARCHAR column lengths Use maxlength metadata to configure VARCHAR column lengths Aug 26, 2015
@JoshRosen
Copy link
Contributor Author

It turns out that the user-facing APIs for manipulating existing columns' metadata are somewhat incomplete in the Scala API and are missing in the Python, SQL, and R language APIs. I don't think that this is a blocker for merging this patch, however, as I feel that the right approach is to improve Spark's APIs for working with column metadata.

In followup PRs, we can consider whether we want to add additional configurations for changing the default size or enabling truncation for working around errors due to limited column width.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants