-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-4142: [Java] JDBC Array -> Arrow ListVector #3294
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3294 +/- ##
==========================================
+ Coverage 87.79% 89.03% +1.24%
==========================================
Files 689 463 -226
Lines 84209 55085 -29124
Branches 1081 0 -1081
==========================================
- Hits 73929 49045 -24884
+ Misses 10165 6040 -4125
+ Partials 115 0 -115
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm new to the java codebase so don't take my questions/suggestions as gospel.
|
||
/** | ||
* Builds a <code>JdbcFieldInfo</code> using only the {@link java.sql.Types} type. Do not use this constructor | ||
* if the field type is {@link java.sql.Types#DECIMAL} or {@link java.sql.Types#NUMERIC}; the precision and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe put a check for this in the constructor for this condition, in case someone doesn't read the javadoc?
|
||
/** | ||
* Builds a <code>JdbcFieldInfo</code> from the {@link java.sql.Types} type, precision, and scale. | ||
* Use this constructor for {@link java.sql.Types#DECIMAL} and {@link java.sql.Types#NUMERIC} types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe warn if this is used with other types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, log a warning? I don't think this project logs much of anything.
public JdbcFieldInfo(ResultSetMetaData rsmd, int column) throws SQLException { | ||
Preconditions.checkNotNull(rsmd, "ResultSetMetaData cannot be null."); | ||
Preconditions.checkArgument(column > 0, "ResultSetMetaData columns have indices starting at 1."); | ||
Preconditions.checkArgument(column <= rsmd.getColumnCount(), "The index must be within the number of columns"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for these two (probably more for the second one, it might pay to use the templating feature of Preconditions.checkArgument to pass through the bad values
throws SQLException, IOException { | ||
Preconditions.checkNotNull(connection, "JDBC connection object can not be null"); | ||
Preconditions.checkArgument(query != null && query.length() > 0, "SQL query can not be null or empty"); | ||
Preconditions.checkNotNull(config, "The configuration cannot be null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style consistency. it looks like you combined null and a validity check but here you split them for config
*/ | ||
public static VectorSchemaRoot sqlToArrow(ResultSet resultSet, JdbcToArrowConfig config) | ||
throws SQLException, IOException { | ||
Preconditions.checkNotNull(resultSet, "JDBC ResultSet object can not be null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm new to the java code base, but given all the nullness checks it might be nice to follow a convention things are not null by default and use "@nullable" annotations when they can be null (might be worth discussing on the mailing list)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just trying to follow the style in the existing code. I think stylistic changes like these should be a different PR ...
break; | ||
final Map<String, String> metadata; | ||
if (config.getIncludeMetadata()) { | ||
metadata = new HashMap<String, String>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to use a Map here vs POJO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One advantage of a POJO might be you can move this logic to the constructor of that object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it is getting passed through to the existing schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. Also, this seems to be a merge conflict that was resolved the wrong way; the metadata logic was implemented in another PR.
|
||
final ArrowType arrowType = getArrowTypeForJdbcField(new JdbcFieldInfo(rsmd, i), config.getCalendar()); | ||
if (arrowType != null) { | ||
final FieldType fieldType = new FieldType(true, arrowType, null, metadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a /* parameter name */ for null
JdbcToArrowConfig config) | ||
throws SQLException, IOException { | ||
|
||
final JdbcFieldInfo fieldInfo = getJdbcFieldInfoForArraySubType(resultSet.getMetaData(), arrayIndex, config); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to move this to be a method on config?
|
||
final JdbcFieldInfo fieldInfo = getJdbcFieldInfoForArraySubType(resultSet.getMetaData(), arrayIndex, config); | ||
if (fieldInfo == null) { | ||
throw new IllegalStateException("Column " + arrayIndex + " is an array of unknown type."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems more like config wasn't setup correctly so illegal argument exception might be better.
|
||
while (rs.next()) { | ||
// The second column contains the actual data. | ||
jdbcToFieldVector(rs, 2, fieldInfo.getJdbcType(), valueCount + arrayRowCount, fieldVector, config); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make 2, a named constant?
* @return The corresponding <code>ArrowType</code>. | ||
* @throws NullPointerException if either <code>fieldInfo</code> or <code>calendar</code> are <code>null</code>. | ||
*/ | ||
public static ArrowType getArrowTypeForJdbcField(JdbcFieldInfo fieldInfo, Calendar calendar) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could potentially move this to a method on fieldInfo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an idea. This logic, in some form, has always been in this file, which is why I left it there.
break; | ||
default: | ||
// no-op, shouldn't get here | ||
arrowType = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to throw an exception instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or at least log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't trying to stray too far from what the original author was doing, since his work did pass code review ...
The diff is a little confusing, but the original logic is on the old logic is on the old (red) line 249.
} | ||
rowCount++; | ||
} | ||
root.setRowCount(rowCount); | ||
} | ||
|
||
private static void jdbcToFieldVector( | ||
ResultSet rs, | ||
int i, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a name indicating this is the column index might be a little clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point.
/* Uses the configuration to determine what the array sub-type JdbcFieldInfo is. | ||
* If no sub-type can be found, returns null. | ||
*/ | ||
private static JdbcFieldInfo getJdbcFieldInfoForArraySubType( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to JdbcToArrowConfig?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to avoid making JdbcToArrowConfig too "smart." This logic can certainly be moved there, but I wanted the configuration to be a POJO that the JdbcToArrow logic pulled from, rather than define parts of the JdbcToArrow behavior.
Thanks for the feedback! I'll cycle back to this once ARROW-3966 (#3134) and ARROW-3923 (#3066) are approved and merged. For now, do you know enough about Arrow ListVectors to know why the Varchar ListVector doesn't quite work? |
java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrowUtils.java
Show resolved
Hide resolved
I tried to document my issue in the PR description, but it may be easiest to see with a test. I'm in my phone so I can't look it up, but one of the unit tests I wrote builds a VarChar ListVector. It passes now because I add the null value last. If you change the test to add the null earlier in the test, it will fail. The value after the null will not be found. I'm not sure why. |
I just merged with the latest master, and got all of the tests to pass. (I haven't responded to any of the feedback on this PR yet, but I will this week.) My last commit modified the unit test to show the error I'm getting when using a VarChar ListVector with a
All of these are written to the ListVector successfully, but only the first two elements come back out again. "5" comes back out as an empty string. I'm not sure why.
Thanks for your help! |
@mikepigott Thanks for the test case, there seems to be a bug in the way ListVector handles the value counts for its child vector. Will discuss with @pravindra and get back on this. |
@mikepigott Can you rebase this on master or do you need help with that? |
Hi @xhochy - I can merge master into this branch if you need; rebasing always gets me into trouble. Should I expect my unit test to work now? |
As we sadly rebase master on release tags, we also need to rebase this on master. |
Rebased. @mikepigott please avoid |
Sure thing. Sorry for the trouble. |
@xhochy this branch looks right to me, compare with your PR |
Ugh. I'm going to force push the branch again |
@mikepigott I just edited the branch again. Please |
OK, I have your changes now. |
Cool sorry for the cross-fire =) |
No worries! I'll try to respond to the code review feedback this weekend. |
I think I've either responded to or implemented all of the code-review feedback. Please let me know if I missed anything! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my feedback. I left a few more comments (most of the m can likely be done as part of follow-up CLs).
I assume the failing unit tests are still due to the issue @praveenbingo brought up, don't know if there has been an update there.
* @param scale The field's numeric scale. | ||
*/ | ||
public JdbcFieldInfo(int jdbcType, int precision, int scale) { | ||
this.jdbcType = jdbcType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to add a similar check to the one above in here (only reversed). Or comment why the check isn't there (maybe it makes reflection or something else easier?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no real reason not to use this constructor for non-numeric & non-decimal types; the precision & scale are just ignored. No harm done.
* @param calendar The calendar to use when constructing Timestamp fields and reading time-based results. | ||
*/ | ||
public JdbcToArrowConfig(BaseAllocator allocator, Calendar calendar) { | ||
Preconditions.checkNotNull(allocator, "Memory allocator cannot be null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM.
* @param index The {@link java.sql.ResultSetMetaData} column name of an {@link java.sql.Types#ARRAY} type. | ||
* @return The {@link JdbcFieldInfo} for that array's sub-type, or <code>null</code> if not defined. | ||
*/ | ||
public JdbcFieldInfo getArraySubTypeByColumnName(String name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not for this review. I think having a clear convention around nulls is important (and my preference but I'm sure there are others) is to assume things are non-null by default and mark things that can be null explcitly. I'll go through the mailing list archive to see if this was discussed before and if not post to the mailing list.
* @return The {@link JdbcFieldInfo} for that array's sub-type, or <code>null</code> if not defined. | ||
*/ | ||
public JdbcFieldInfo getArraySubTypeByColumnName(String name) { | ||
if (arraySubTypesByColumnName == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a style thing also, so not relevant for this review, but in the past I've had a preference for validating up front (again I'm not familiar with conventions in this code base though).
@@ -30,6 +31,8 @@ | |||
private Calendar calendar; | |||
private BaseAllocator allocator; | |||
private boolean includeMetadata; | |||
private Map<Integer, JdbcFieldInfo> arraySubTypesByColumnIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not worth changing for this review, but given this would be a relatively dense map, it might be slightly more performant to have this as an JdbcFieldInfo[]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on the number of fields returned by the query. If a query returns 10 fields and the ninth is an array, the map wouldn't be very dense at all. I'd prefer to leave this as-is until proven otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with leaving as is. You are right not quite dense, but still a few number of elements in general vs boxing/unboxing cost for integers on every call (like I said probably not a performance issue at all.
} | ||
} | ||
|
||
private void testIntegerList(ListVector listVector, int rowCount, Integer[][] expectedValues) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't rowCount determined by expectedValues.length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it would look like:
int rowCount = (expectedValues == null) ? 0 : expectedValues.length;
I don't think that's clearer.
} | ||
|
||
private void testFloatList(ListVector listVector, int rowCount, Float[][] expectedValues) { | ||
Float4Vector vector = (Float4Vector) listVector.getDataVector(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there seems to be a lot of redundant code in these test methods is it possible to refactor to use a lambda to avoid the duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The redundancy is due to the different vectors having different types. I'm not at all familiar with Java lambdas, but from what I'm reading in https://medium.freecodecamp.org/learn-these-4-things-and-working-with-lambda-expressions-b0ab36e0fffc it looks like I can't use it to hide the types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No they can't really hide types, but if the function took two Object values you could have a specific one that does the casts internally for the listVector and expectedValues.
No need to implement this just a thought.
public void testJdbcToArrowWithNulls() throws Exception { | ||
int rowCount = 3; | ||
Integer[][] intArrays = new Integer[rowCount][]; | ||
intArrays[0] = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in addition to nulls it might be worth testing a non-null empty list as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
int offset = offsetBuffer.getInt((row + 1) * ListVector.OFFSET_WIDTH); | ||
|
||
if (expectedValues[row] == null) { | ||
assertEquals(0, offset - prevOffset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i might be missing it but shouldn't there be a test against the validity bitmap as well for each row?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vector.get(i) throws an exception if trying to get an invalid item (which is determined by checking the validity bitmap). So I think it's covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking on listVector not vector, is that case still covered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, nice catch, I wasn't indicating values as NULL when I should have been. I fixed that.
java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrowConfig.java
Show resolved
Hide resolved
@mikepigott @emkornfield |
Thanks @praveenbingo! @emkornfield - I'll try to respond to your feedback by Friday. |
@emkornfield: I believe I've responded to and/or implemented all of your feedback. That said, I don't think I follow your comment #3294 (comment) ... where did you want to see it replicated? Thanks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On using length instead of passing rowCount (#3294 (comment)) I didn't think you ever passed in nulls so the code would need to change.
Overall, other then the issues I responded to, I don't think I have other comments. So its up to a committer to take a pass. Thanks for bearing with me.
java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrowConfig.java
Show resolved
Hide resolved
@@ -30,6 +31,8 @@ | |||
private Calendar calendar; | |||
private BaseAllocator allocator; | |||
private boolean includeMetadata; | |||
private Map<Integer, JdbcFieldInfo> arraySubTypesByColumnIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with leaving as is. You are right not quite dense, but still a few number of elements in general vs boxing/unboxing cost for integers on every call (like I said probably not a performance issue at all.
} catch (Exception e) { | ||
e.printStackTrace(); | ||
throw e; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, might be worth leaving a comment reflecting that. I wonder if it is a configuration parameter (I would think at least in an IDE it would be possible to get the full strack trace without this).
} | ||
|
||
private void testFloatList(ListVector listVector, int rowCount, Float[][] expectedValues) { | ||
Float4Vector vector = (Float4Vector) listVector.getDataVector(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No they can't really hide types, but if the function took two Object values you could have a specific one that does the casts internally for the listVector and expectedValues.
No need to implement this just a thought.
public void testJdbcToArrowWithNulls() throws Exception { | ||
int rowCount = 4; | ||
|
||
Integer[][] intArrays = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missed this before but why the boxes type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm when I get home, but I believe the JDBC function to build a JDBC Array requires an Object[]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm when I get home, but I believe the JDBC function to build a JDBC Array requires an Object[]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed.
Array intArray = conn.createArrayOf("INT", integerArray);
- the createArrayOf
field requires an Object[]
as its second parameter.
int offset = offsetBuffer.getInt((row + 1) * ListVector.OFFSET_WIDTH); | ||
|
||
if (expectedValues[row] == null) { | ||
assertEquals(0, offset - prevOffset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking on listVector not vector, is that case still covered?
Sorry, I won't be able to dive into your comments tonight. I'll try again tomorrow. Have a good night! |
OK! I think my last push covers the last of the feedback. |
@praveenbingo - I saw #3625 was merged, so I rebased master and retried the failing unit test from before. It passes now! Thank you so much for fixing that. I think I've responded to all of the feedback on this PR? |
@mikepigott - Sure no problem :) Something seems off in the PR..did you merge from master by any chance? PR seems to include a lot of existing commits.. |
@praveenbingo - I rebased from apache/arrow:master using the instructions in https://thoughtbot.com/blog/keeping-a-github-fork-updated ... I'm sorry if I did that wrong. But yes, that's where the extra commits came from. |
@mikepigott you probably had not fetched from the apache remote. Would it be OK if I fixed your branch? |
There's a merge commit here " |
Absolutely - sorry for the trouble! |
After I performed the rebase, git told me I needed to merge from ironponyexpress/arrow:jdbc-array-field because the branches had diverged. I listened to it. |
That's not right. The next step is to force push your remote branch |
Oh. Sorry. |
Change-Id: Id6788671f5b30ed2f33f4317d00b5e6dc92ee889
Fixed. Please make sure to
|
Done. Sorry again. |
No worries, git workflows are always an exercise in blood, sweat, and tears :) |
Looks like the build is passed, who can have a final look and sign off on this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikepigott - thanks for pushing this through :)
@pravindra @wesm - could you please help merge.
https://issues.apache.org/jira/browse/ARROW-4142 This adds support for reading JDBC arrays and converting them to Arrow ListVectors. JDBC does not provide a great way to get the array type; there is a ResultSet object for walking the array, but its ResultSetMetaData may not contain the right value type (H2, for example, returns a JDBC type of NULL). This is based on apache#3134, which includes ARROW-3965 and ARROW-3966. I found Arrow arrays to be very confusing, and I am not sure if I am using them correctly here. One thing I noticed was if I added a null array to a ListVector of VarCharVectors, the next value in the VarCharVector would be empty. I would appreciate any help on why! The ListVector unit tests weren't very helpful. For all other cases, this code seems to work. I look forward to your review! Author: Mike Pigott <mpigott@gmail.com> Closes apache#3294 from mikepigott/jdbc-array-field and squashes the following commits: 66376dd <Mike Pigott> Support for reading Array records to ListVector from JDBC
https://issues.apache.org/jira/browse/ARROW-4142
This adds support for reading JDBC arrays and converting them to Arrow ListVectors. JDBC does not provide a great way to get the array type; there is a ResultSet object for walking the array, but its ResultSetMetaData may not contain the right value type (H2, for example, returns a JDBC type of NULL).
This is based on #3134, which includes ARROW-3965 and ARROW-3966.
I found Arrow arrays to be very confusing, and I am not sure if I am using them correctly here. One thing I noticed was if I added a null array to a ListVector of VarCharVectors, the next value in the VarCharVector would be empty. I would appreciate any help on why! The ListVector unit tests weren't very helpful.
For all other cases, this code seems to work. I look forward to your review!