-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3966 [Java] JDBC Column Metadata in Arrow Field Metadata #3134
Conversation
…sultSet resultSet, JdbcToArrowConfig config)
…n connection, String query, JdbcToArrowConfig config)
…eive JdbcToArrowConfig objects.
Bring master up to date.
https://travis-ci.org/apache/arrow/jobs/465515849 - This appears to be a Travis / environment hiccup, and not a result of my change. Could someone restart this build? |
You can trigger a fresh build by |
@siddharthteotia @atuldambalkar can you review this? |
Week Ending 12/15
12/30/18 Pulldown
Arrow Master 1/29/2019
Preconditions.checkNotNull(calendar, "Calendar object can't be null"); | ||
|
||
jdbcToArrowVectors(rs, root, new JdbcToArrowConfig(new RootAllocator(0), calendar)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, RootAllocator with argument 0, may not behave as expected. Have you tested this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little confusing. The config's allocator is used in the JdbcToArrow
static member functions to construct a VectorSchemaRoot
. At this point, the VectorSchemaRoot
has already been constructed, so the allocator is an unused argument. That's why its allocation can be 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, got it. Mainly for the backeward compatibilit for JdbcToArrowUtils.java class.
break; | ||
} | ||
|
||
if (fieldType != null) { | ||
fields.add(new Field(columnName, fieldType, null)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the use of FieldType object instead of creating Field object in each case stmt. I am fine with this.
/** | ||
* For the given SQL query, execute and fetch the data from Relational DB and convert it to Arrow objects. | ||
* | ||
* @param connection Database connection to be used. This method will not close the passed connection object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be a simpler contract to pass through the ResultSet object instead here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That version exists on line 224.
This version can certainly be removed, but I was trying to preserve the pair of contracts the original author supplied. The two contracts were:
- <Database + Query> + [Allocator | Calendar | Both | Neither]
- + [Allocator | Calendar | Both | Neither]
Since I'm adding a new contract of + , it made sense to me to also add <Database + Query> + .
@@ -89,7 +89,9 @@ public static VectorSchemaRoot sqlToArrow(Connection connection, String query, B | |||
Preconditions.checkArgument(query != null && query.length() > 0, "SQL query can not be null or empty"); | |||
Preconditions.checkNotNull(allocator, "Memory allocator object can not be null"); | |||
|
|||
return sqlToArrow(connection, query, allocator, Calendar.getInstance(TimeZone.getTimeZone("UTC"), Locale.ROOT)); | |||
JdbcToArrowConfig config = | |||
new JdbcToArrowConfig(allocator, Calendar.getInstance(TimeZone.getTimeZone("UTC"), Locale.ROOT), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a utility method for getting the UTC calendar make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Done.
assertNotNull(metadata); | ||
assertEquals(4, metadata.size()); | ||
|
||
assertEquals(rsmd.getCatalogName(i + 1), metadata.get(Constants.SQL_CATALOG_NAME_KEY)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be cleaner if you start the loop at 1 instead of doing the addition in 4 places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I agree. I had to use i - 1
when getting the field metadata from the VectorSchemaRoot
, but I agree that's 3 fewer lines of code with an offset. Updated.
Codecov Report
@@ Coverage Diff @@
## master #3134 +/- ##
==========================================
+ Coverage 87.82% 88.96% +1.14%
==========================================
Files 666 441 -225
Lines 82371 53471 -28900
Branches 1069 0 -1069
==========================================
- Hits 72341 47571 -24770
+ Misses 9919 5900 -4019
+ Partials 111 0 -111
Continue to review full report at Codecov.
|
Thanks for the code review! I believe I've responded to all of the feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the late review..
some of them are minor style comments..i am ok if you ignore/address in another PR..
* @return This instance of the <code>JdbcToArrowConfig</code>, for chaining. | ||
* @exception NullPointerExeption if <code>calendar</code> is <code>null</code>. | ||
*/ | ||
public JdbcToArrowConfig setCalendar(Calendar calendar) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor style nit (please ignore if you disagree) i would prefer creating a builder, so that the POJO itself is immutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
older comment..please ignore..
* | ||
* @return Whether this configuration is valid. | ||
*/ | ||
public boolean isValid() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using a builder would ensure that this object is always valid..and we can validate inside the builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
older comment..please ignore..
* | ||
* @return <code>true</code> to include field metadata, <code>false</code> to exclude it. | ||
*/ | ||
public boolean getIncludeMetadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit (please ignore if you dont agree :))
should this be named "shouldIncludeMetadata"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point; will fix tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -98,6 +137,6 @@ public JdbcToArrowConfigBuilder setCalendar(Calendar calendar) { | |||
* @throws NullPointerException if either the allocator or calendar was not set. | |||
*/ | |||
public JdbcToArrowConfig build() { | |||
return new JdbcToArrowConfig(allocator, calendar); | |||
return new JdbcToArrowConfig(allocator, calendar, includeMetadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor nit - again ignore if you dont agree)
can we default the include metadata in the builder so that only the clients that need it will override..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Will change tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hang on, I'm confused - includeMetadata
is initialized to false
in the builder. This is the behavior you asked for, right? Only people who call setIncludeMetadata(true)
will have the metadata generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I saw the ctors using the false flag in other places and was misled..this looks ok.
|
||
final Map<String, String> metadata; | ||
if (config.getIncludeMetadata()) { | ||
metadata = new HashMap<String, String>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can use the <> operator to avoid the mention of types in RHS again.. https://www.javaworld.com/article/2074080/core-java/core-java-jdk-7-the-diamond-operator.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point; will fix tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
metadata.put(Constants.SQL_TYPE_KEY, rsmd.getColumnTypeName(i)); | ||
|
||
} else { | ||
metadata = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor nit)
do this in the initialization itself?..avoids the else block..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that, but then I can't make the metadata variable final. I've worked on teams in the past where the best practice was to mark things final as often as possible; is that the same with Arrow?
Thanks for the feedback! I believe I've responded to all of it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Looks good. Thanks @mikepigott
@mikepigott Can you rebase? Then I would merge this. |
Sure thing; I'll merge master when I get home tonight. Thanks! |
Arrow Master Through 2/5
Merged. Thanks for the review! |
I noticed in the merge that one part of #3066 was lost in the shuffle, and I just added it back in. The calendar's time zone is used to create Timestamp types, even though it is not needed. I'm not sure why it was lost, but I just added it back. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
Thanks! |
https://issues.apache.org/jira/browse/ARROW-3966 This change includes #3133, and supports a new configuration item called "Include Metadata." If true, metadata from the JDBC ResultSetMetaData object is pulled along to the Schema Field Metadata. For now, this includes: * Catalog Name * Table Name * Column Name * Column Type Name Author: Mike Pigott <mpigott@gmail.com> Author: Michael Pigott <mikepigott@users.noreply.github.com> Closes #3134 from mikepigott/jdbc-column-metadata and squashes the following commits: 02f2f34 <Mike Pigott> ARROW-3966: Picking up lost change to support null calendars. 7049c36 <Mike Pigott> Merge branch 'master' into jdbc-column-metadata e9a9b2b <Michael Pigott> Merge pull request #6 from apache/master 65741a9 <Mike Pigott> ARROW-3966: Code review feedback cc6cc88 <Mike Pigott> ARROW-3966: Using a 1:N loop instead of a 0:N-1 loop for fewer index offsets in code. cfb2ba6 <Mike Pigott> ARROW-3966: Using a helper method for building a UTC calendar with root locale. 2928513 <Mike Pigott> ARROW-3966: Moving the metadata flag assignment into the builder. 69022c2 <Mike Pigott> ARROW-3966: Fixing merge. 4a6de86 <Mike Pigott> Merge branch 'master' into jdbc-column-metadata 509a1cc <Michael Pigott> Merge pull request #5 from apache/master 789c8c8 <Michael Pigott> Merge pull request #4 from apache/master e5b19ee <Michael Pigott> Merge pull request #3 from apache/master 3b17c29 <Michael Pigott> Merge pull request #2 from apache/master d847ebc <Mike Pigott> Fixing file location 1ceac9e <Mike Pigott> Merge branch 'master' into jdbc-column-metadata 881c6c8 <Michael Pigott> Merge pull request #1 from apache/master 03091a8 <Mike Pigott> Unit tests for including result set metadata. 72d64cc <Mike Pigott> Affirming the field metadata is empty when the configuration excludes field metadata. 7b4527c <Mike Pigott> Test for the include-metadata flag in the configuration. 7e9ce37 <Mike Pigott> Merge branch 'jdbc-to-arrow-config' into jdbc-column-metadata bb3165b <Mike Pigott> Updating the function calls to use the JdbcToArrowConfig versions. a6fb1be <Mike Pigott> Fixing function call 5bfd6a2 <Mike Pigott> Merge branch 'jdbc-to-arrow-config' into jdbc-column-metadata 68c91e7 <Mike Pigott> Modifying the jdbcToArrowSchema and jdbcToArrowVectors methods to receive JdbcToArrowConfig objects. b5b0cb1 <Mike Pigott> Merge branch 'jdbc-to-arrow-config' into jdbc-column-metadata 8d6cf00 <Mike Pigott> Documentation for public static VectorSchemaRoot sqlToArrow(Connection connection, String query, JdbcToArrowConfig config) 4f1260c <Mike Pigott> Adding documentation for public static VectorSchemaRoot sqlToArrow(ResultSet resultSet, JdbcToArrowConfig config) e34a9e7 <Mike Pigott> Fixing formatting. fe097c8 <Mike Pigott> Merge branch 'jdbc-to-arrow-config' into jdbc-column-metadata df632e3 <Mike Pigott> Updating the SQL tests to include JdbcToArrowConfig versions. b270044 <Mike Pigott> Updated validaton & documentation, and unit tests for the new JdbcToArrowConfig. da77cbe <Mike Pigott> Creating a configuration class for the JDBC-to-Arrow converter. a78c770 <Mike Pigott> Updating Javadocs. 523387f <Mike Pigott> Updating the API to support an optional 'includeMetadata' field. 5af1b5b <Mike Pigott> Separating out the field-type creation from the field creation.
https://issues.apache.org/jira/browse/ARROW-4142 This adds support for reading JDBC arrays and converting them to Arrow ListVectors. JDBC does not provide a great way to get the array type; there is a ResultSet object for walking the array, but its ResultSetMetaData may not contain the right value type (H2, for example, returns a JDBC type of NULL). This is based on #3134, which includes ARROW-3965 and ARROW-3966. I found Arrow arrays to be very confusing, and I am not sure if I am using them correctly here. One thing I noticed was if I added a null array to a ListVector of VarCharVectors, the next value in the VarCharVector would be empty. I would appreciate any help on why! The ListVector unit tests weren't very helpful. For all other cases, this code seems to work. I look forward to your review! Author: Mike Pigott <mpigott@gmail.com> Closes #3294 from mikepigott/jdbc-array-field and squashes the following commits: 66376dd <Mike Pigott> Support for reading Array records to ListVector from JDBC
https://issues.apache.org/jira/browse/ARROW-4142 This adds support for reading JDBC arrays and converting them to Arrow ListVectors. JDBC does not provide a great way to get the array type; there is a ResultSet object for walking the array, but its ResultSetMetaData may not contain the right value type (H2, for example, returns a JDBC type of NULL). This is based on apache#3134, which includes ARROW-3965 and ARROW-3966. I found Arrow arrays to be very confusing, and I am not sure if I am using them correctly here. One thing I noticed was if I added a null array to a ListVector of VarCharVectors, the next value in the VarCharVector would be empty. I would appreciate any help on why! The ListVector unit tests weren't very helpful. For all other cases, this code seems to work. I look forward to your review! Author: Mike Pigott <mpigott@gmail.com> Closes apache#3294 from mikepigott/jdbc-array-field and squashes the following commits: 66376dd <Mike Pigott> Support for reading Array records to ListVector from JDBC
https://issues.apache.org/jira/browse/ARROW-3966
This change includes #3133, and supports a new configuration item called "Include Metadata." If true, metadata from the JDBC ResultSetMetaData object is pulled along to the Schema Field Metadata. For now, this includes: