Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3966 [Java] JDBC Column Metadata in Arrow Field Metadata #3134

Closed
wants to merge 35 commits into from
Closed

ARROW-3966 [Java] JDBC Column Metadata in Arrow Field Metadata #3134

wants to merge 35 commits into from

Conversation

mikepigott
Copy link
Contributor

https://issues.apache.org/jira/browse/ARROW-3966

This change includes #3133, and supports a new configuration item called "Include Metadata." If true, metadata from the JDBC ResultSetMetaData object is pulled along to the Schema Field Metadata. For now, this includes:

  • Catalog Name
  • Table Name
  • Column Name
  • Column Type Name

Mike Pigott and others added 22 commits December 4, 2018 20:24
…sultSet resultSet, JdbcToArrowConfig config)
…n connection, String query, JdbcToArrowConfig config)
Bring master up to date.
@mikepigott
Copy link
Contributor Author

https://travis-ci.org/apache/arrow/jobs/465515849 - This appears to be a Travis / environment hiccup, and not a result of my change. Could someone restart this build?

@wesm
Copy link
Member

wesm commented Dec 9, 2018

You can trigger a fresh build by git commit --amend followed by git push -f. I'm restarting this build entry

@wesm
Copy link
Member

wesm commented Dec 14, 2018

@siddharthteotia @atuldambalkar can you review this?

Arrow Master 1/29/2019
Preconditions.checkNotNull(calendar, "Calendar object can't be null");

jdbcToArrowVectors(rs, root, new JdbcToArrowConfig(new RootAllocator(0), calendar));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, RootAllocator with argument 0, may not behave as expected. Have you tested this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little confusing. The config's allocator is used in the JdbcToArrow static member functions to construct a VectorSchemaRoot. At this point, the VectorSchemaRoot has already been constructed, so the allocator is an unused argument. That's why its allocation can be 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, got it. Mainly for the backeward compatibilit for JdbcToArrowUtils.java class.

break;
}

if (fieldType != null) {
fields.add(new Field(columnName, fieldType, null));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the use of FieldType object instead of creating Field object in each case stmt. I am fine with this.

/**
* For the given SQL query, execute and fetch the data from Relational DB and convert it to Arrow objects.
*
* @param connection Database connection to be used. This method will not close the passed connection object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be a simpler contract to pass through the ResultSet object instead here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That version exists on line 224.

This version can certainly be removed, but I was trying to preserve the pair of contracts the original author supplied. The two contracts were:

  1. <Database + Query> + [Allocator | Calendar | Both | Neither]
  2. + [Allocator | Calendar | Both | Neither]

Since I'm adding a new contract of + , it made sense to me to also add <Database + Query> + .

@@ -89,7 +89,9 @@ public static VectorSchemaRoot sqlToArrow(Connection connection, String query, B
Preconditions.checkArgument(query != null && query.length() > 0, "SQL query can not be null or empty");
Preconditions.checkNotNull(allocator, "Memory allocator object can not be null");

return sqlToArrow(connection, query, allocator, Calendar.getInstance(TimeZone.getTimeZone("UTC"), Locale.ROOT));
JdbcToArrowConfig config =
new JdbcToArrowConfig(allocator, Calendar.getInstance(TimeZone.getTimeZone("UTC"), Locale.ROOT), false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a utility method for getting the UTC calendar make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Done.

assertNotNull(metadata);
assertEquals(4, metadata.size());

assertEquals(rsmd.getCatalogName(i + 1), metadata.get(Constants.SQL_CATALOG_NAME_KEY));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be cleaner if you start the loop at 1 instead of doing the addition in 4 places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I agree. I had to use i - 1 when getting the field metadata from the VectorSchemaRoot, but I agree that's 3 fewer lines of code with an offset. Updated.

@codecov-io
Copy link

codecov-io commented Feb 3, 2019

Codecov Report

Merging #3134 into master will increase coverage by 1.14%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3134      +/-   ##
==========================================
+ Coverage   87.82%   88.96%   +1.14%     
==========================================
  Files         666      441     -225     
  Lines       82371    53471   -28900     
  Branches     1069        0    -1069     
==========================================
- Hits        72341    47571   -24770     
+ Misses       9919     5900    -4019     
+ Partials      111        0     -111
Impacted Files Coverage Δ
cpp/src/arrow/csv/reader.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/array/builder_union.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/array/builder_union.cc 4.16% <0%> (-95.84%) ⬇️
cpp/src/arrow/csv/reader.cc 0.53% <0%> (-90.38%) ⬇️
cpp/src/arrow/adapters/orc/adapter.cc 0.26% <0%> (-73.04%) ⬇️
cpp/src/arrow/csv/options.h 66.66% <0%> (-33.34%) ⬇️
cpp/src/parquet/column_scanner.cc 59.09% <0%> (-22.73%) ⬇️
cpp/src/arrow/vendored/xxhash/xxhash.c 51.26% <0%> (-22.34%) ⬇️
cpp/src/arrow/memory_pool.cc 68.8% <0%> (-16.52%) ⬇️
cpp/src/arrow/io/interfaces.cc 40% <0%> (-15%) ⬇️
... and 273 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4004b72...02f2f34. Read the comment docs.

@mikepigott
Copy link
Contributor Author

Thanks for the code review! I believe I've responded to all of the feedback.

Copy link
Contributor

@praveenbingo praveenbingo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the late review..

some of them are minor style comments..i am ok if you ignore/address in another PR..

* @return This instance of the <code>JdbcToArrowConfig</code>, for chaining.
* @exception NullPointerExeption if <code>calendar</code> is <code>null</code>.
*/
public JdbcToArrowConfig setCalendar(Calendar calendar) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor style nit (please ignore if you disagree) i would prefer creating a builder, so that the POJO itself is immutable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

older comment..please ignore..

*
* @return Whether this configuration is valid.
*/
public boolean isValid() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using a builder would ensure that this object is always valid..and we can validate inside the builder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

older comment..please ignore..

*
* @return <code>true</code> to include field metadata, <code>false</code> to exclude it.
*/
public boolean getIncludeMetadata() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit (please ignore if you dont agree :))

should this be named "shouldIncludeMetadata"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point; will fix tonight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -98,6 +137,6 @@ public JdbcToArrowConfigBuilder setCalendar(Calendar calendar) {
* @throws NullPointerException if either the allocator or calendar was not set.
*/
public JdbcToArrowConfig build() {
return new JdbcToArrowConfig(allocator, calendar);
return new JdbcToArrowConfig(allocator, calendar, includeMetadata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor nit - again ignore if you dont agree)
can we default the include metadata in the builder so that only the clients that need it will override..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Will change tonight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hang on, I'm confused - includeMetadata is initialized to false in the builder. This is the behavior you asked for, right? Only people who call setIncludeMetadata(true) will have the metadata generated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I saw the ctors using the false flag in other places and was misled..this looks ok.


final Map<String, String> metadata;
if (config.getIncludeMetadata()) {
metadata = new HashMap<String, String>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use the <> operator to avoid the mention of types in RHS again.. https://www.javaworld.com/article/2074080/core-java/core-java-jdk-7-the-diamond-operator.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point; will fix tonight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

metadata.put(Constants.SQL_TYPE_KEY, rsmd.getColumnTypeName(i));

} else {
metadata = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor nit)
do this in the initialization itself?..avoids the else block..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that, but then I can't make the metadata variable final. I've worked on teams in the past where the best practice was to mark things final as often as possible; is that the same with Arrow?

@mikepigott
Copy link
Contributor Author

Thanks for the feedback! I believe I've responded to all of it.

Copy link
Contributor

@praveenbingo praveenbingo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Looks good. Thanks @mikepigott

@wesm @xhochy - please help merging this..

@xhochy
Copy link
Member

xhochy commented Feb 5, 2019

@mikepigott Can you rebase? Then I would merge this.

@mikepigott
Copy link
Contributor Author

Sure thing; I'll merge master when I get home tonight. Thanks!

@mikepigott
Copy link
Contributor Author

Merged. Thanks for the review!

@mikepigott
Copy link
Contributor Author

I noticed in the merge that one part of #3066 was lost in the shuffle, and I just added it back in. The calendar's time zone is used to create Timestamp types, even though it is not needed. I'm not sure why it was lost, but I just added it back.

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@xhochy xhochy closed this in 54182b2 Feb 6, 2019
@xhochy
Copy link
Member

xhochy commented Feb 6, 2019

Thanks!

xhochy pushed a commit that referenced this pull request Feb 8, 2019
https://issues.apache.org/jira/browse/ARROW-3966

This change includes #3133, and supports a new configuration item called "Include Metadata."  If true, metadata from the JDBC ResultSetMetaData object is pulled along to the Schema Field Metadata.  For now, this includes:
* Catalog Name
* Table Name
* Column Name
* Column Type Name

Author: Mike Pigott <mpigott@gmail.com>
Author: Michael Pigott <mikepigott@users.noreply.github.com>

Closes #3134 from mikepigott/jdbc-column-metadata and squashes the following commits:

02f2f34 <Mike Pigott> ARROW-3966: Picking up lost change to support null calendars.
7049c36 <Mike Pigott> Merge branch 'master' into jdbc-column-metadata
e9a9b2b <Michael Pigott> Merge pull request #6 from apache/master
65741a9 <Mike Pigott> ARROW-3966: Code review feedback
cc6cc88 <Mike Pigott> ARROW-3966: Using a 1:N loop instead of a 0:N-1 loop for fewer index offsets in code.
cfb2ba6 <Mike Pigott> ARROW-3966: Using a helper method for building a UTC calendar with root locale.
2928513 <Mike Pigott> ARROW-3966: Moving the metadata flag assignment into the builder.
69022c2 <Mike Pigott> ARROW-3966: Fixing merge.
4a6de86 <Mike Pigott> Merge branch 'master' into jdbc-column-metadata
509a1cc <Michael Pigott> Merge pull request #5 from apache/master
789c8c8 <Michael Pigott> Merge pull request #4 from apache/master
e5b19ee <Michael Pigott> Merge pull request #3 from apache/master
3b17c29 <Michael Pigott> Merge pull request #2 from apache/master
d847ebc <Mike Pigott> Fixing file location
1ceac9e <Mike Pigott> Merge branch 'master' into jdbc-column-metadata
881c6c8 <Michael Pigott> Merge pull request #1 from apache/master
03091a8 <Mike Pigott> Unit tests for including result set metadata.
72d64cc <Mike Pigott> Affirming the field metadata is empty when the configuration excludes field metadata.
7b4527c <Mike Pigott> Test for the include-metadata flag in the configuration.
7e9ce37 <Mike Pigott> Merge branch 'jdbc-to-arrow-config' into jdbc-column-metadata
bb3165b <Mike Pigott> Updating the function calls to use the JdbcToArrowConfig versions.
a6fb1be <Mike Pigott> Fixing function call
5bfd6a2 <Mike Pigott> Merge branch 'jdbc-to-arrow-config' into jdbc-column-metadata
68c91e7 <Mike Pigott> Modifying the jdbcToArrowSchema and jdbcToArrowVectors methods to receive JdbcToArrowConfig objects.
b5b0cb1 <Mike Pigott> Merge branch 'jdbc-to-arrow-config' into jdbc-column-metadata
8d6cf00 <Mike Pigott> Documentation for public static VectorSchemaRoot sqlToArrow(Connection connection, String query, JdbcToArrowConfig config)
4f1260c <Mike Pigott> Adding documentation for public static VectorSchemaRoot sqlToArrow(ResultSet resultSet, JdbcToArrowConfig config)
e34a9e7 <Mike Pigott> Fixing formatting.
fe097c8 <Mike Pigott> Merge branch 'jdbc-to-arrow-config' into jdbc-column-metadata
df632e3 <Mike Pigott> Updating the SQL tests to include JdbcToArrowConfig versions.
b270044 <Mike Pigott> Updated validaton & documentation, and unit tests for the new JdbcToArrowConfig.
da77cbe <Mike Pigott> Creating a configuration class for the JDBC-to-Arrow converter.
a78c770 <Mike Pigott> Updating Javadocs.
523387f <Mike Pigott> Updating the API to support an optional 'includeMetadata' field.
5af1b5b <Mike Pigott> Separating out the field-type creation from the field creation.
pravindra pushed a commit that referenced this pull request Feb 19, 2019
https://issues.apache.org/jira/browse/ARROW-4142

This adds support for reading JDBC arrays and converting them to Arrow ListVectors.  JDBC does not provide a great way to get the array type; there is a ResultSet object for walking the array, but its ResultSetMetaData may not contain the right value type (H2, for example, returns a JDBC type of NULL).

This is based on #3134, which includes ARROW-3965 and ARROW-3966.

I found Arrow arrays to be very confusing, and I am not sure if I am using them correctly here.  One thing I noticed was if I added a null array to a ListVector of VarCharVectors, the next value in the VarCharVector would be empty.  I would appreciate any help on why!  The ListVector unit tests weren't very helpful.

For all other cases, this code seems to work.  I look forward to your review!

Author: Mike Pigott <mpigott@gmail.com>

Closes #3294 from mikepigott/jdbc-array-field and squashes the following commits:

66376dd <Mike Pigott> Support for reading Array records to ListVector from JDBC
tanyaschlusser pushed a commit to tanyaschlusser/arrow that referenced this pull request Feb 21, 2019
https://issues.apache.org/jira/browse/ARROW-4142

This adds support for reading JDBC arrays and converting them to Arrow ListVectors.  JDBC does not provide a great way to get the array type; there is a ResultSet object for walking the array, but its ResultSetMetaData may not contain the right value type (H2, for example, returns a JDBC type of NULL).

This is based on apache#3134, which includes ARROW-3965 and ARROW-3966.

I found Arrow arrays to be very confusing, and I am not sure if I am using them correctly here.  One thing I noticed was if I added a null array to a ListVector of VarCharVectors, the next value in the VarCharVector would be empty.  I would appreciate any help on why!  The ListVector unit tests weren't very helpful.

For all other cases, this code seems to work.  I look forward to your review!

Author: Mike Pigott <mpigott@gmail.com>

Closes apache#3294 from mikepigott/jdbc-array-field and squashes the following commits:

66376dd <Mike Pigott> Support for reading Array records to ListVector from JDBC
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants