Skip to content

Conversation

@herval
Copy link

@herval herval commented Jun 22, 2017

What is this PR for?

One thing we tracked down is that if you issue a large query on a very large table, it will simply try to load all results (and then cap them on Zeppelin's side), which seems suboptimal (and will freeze the server). Setting this on the JDBC level seems to solve the problem.

What type of PR is it?

Bug Fix

Todos

  • Tests

How should this be tested?

  • Create or use a table with a very large number of rows. In our tests, I simply created a:
createdb zeppelin_test

psql zeppelin_test

create table too_many_rows(n int)

And added 5m rows to it.

Making a paragraph like this will hang without setting a limit:

%zeppelin_test

select * from too_many_rows

Questions:

  • Does the licenses files need update?
    No
  • Is there breaking changes for older versions?
    No
  • Does this needs documentation?
    No

@herval herval force-pushed the hfreire/limit-row-count branch from f574f84 to d3d6eba Compare June 22, 2017 03:30
@tinkoff-dwh
Copy link
Contributor

user will not know that the table has more rows and the result is not complete. Maybe set fetchSize?

@herval herval force-pushed the hfreire/limit-row-count branch from d3d6eba to 1061a66 Compare June 22, 2017 17:46
@herval
Copy link
Author

herval commented Jun 22, 2017

@tinkoff-dwh you're right, picked the wrong method - using setFetchSize now

@herval
Copy link
Author

herval commented Jun 22, 2017

I'm also not exactly sure how to write a test for this, as the JDBCInterpreterTest doesn't use anything in the line of mocks (and the outside behavior - limiting the number of returned rows - is already guaranteed by some filtering code. Any ideas @tinkoff-dwh ?

@tinkoff-dwh
Copy link
Contributor

it would be better getMaxResult() + 1

@tinkoff-dwh
Copy link
Contributor

I think it's impossible to test.

@herval
Copy link
Author

herval commented Jun 22, 2017

Why +1?

@tinkoff-dwh
Copy link
Contributor

yep, getMaxResult is correct. i thought the condition <=
while (displayRowCount < getMaxResult() && resultSet.next())
and last next() will request the next part (fetchSize)

@herval
Copy link
Author

herval commented Jun 22, 2017

It seems apache/zeppelin master (which is what this branch is based on) is failing the build - should I rebase it to 0.7 instead, or wait for master to be fixed? Please let me know what should I do next here (this is my first contribution to the project, pardon the ignorance)

@tinkoff-dwh
Copy link
Contributor

https://travis-ci.org/herval/zeppelin/builds/245889847
try restart 3 and 5 job

@herval
Copy link
Author

herval commented Jun 23, 2017

@tinkoff-dwh build #3 is failing in both my master & zeppelin master: https://travis-ci.org/apache/zeppelin - restarting didn't make it pass: https://travis-ci.org/apache/zeppelin/jobs/246113686

@tinkoff-dwh
Copy link
Contributor

tinkoff-dwh commented Jun 24, 2017

failed test which load dependencies, it is possible too long or problems with the network. try restart again

@herval herval force-pushed the hfreire/limit-row-count branch from 1061a66 to f751be8 Compare June 26, 2017 17:49
for (int i = 0; i < sqlArray.size(); i++) {
String sqlToExecute = sqlArray.get(i);
statement = connection.createStatement();
statement.setFetchSize(getMaxResult());
Copy link
Contributor

@zjffdu zjffdu Jun 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If getMaxResult() is used in statement directly, then it might not be necessary to use it in method getResults

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zjffdu
? it is count for size of fetch, not limit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my misunderstanding of this API. Please ignore it.

@tinkoff-dwh
Copy link
Contributor

LGTM

@herval
Copy link
Author

herval commented Jun 28, 2017

Build's passing: https://travis-ci.org/herval/zeppelin/builds/247193502

Let me know if this is mergeable :-) @tinkoff-dwh

@tinkoff-dwh
Copy link
Contributor

@herval
reopen pull request. and wait another reviewers or committers)

Copy link
Member

@khalidhuseynov khalidhuseynov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for improvement. LGTM and CI is passing, will be merging into master if no more discussion

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be something to indicate the result is truncated?

@herval
Copy link
Author

herval commented Jun 29, 2017

hmm.. It actually seems that setFetchSize does not limit the query execution (so it takes forever & hangs Zeppelin)

Figuring out if there's a way to show results are truncated + use setMaxRows

@herval
Copy link
Author

herval commented Jun 29, 2017

screenshot 2017-06-29 12 37 41

@tinkoff-dwh
Copy link
Contributor

we already have a check to limit. setMaxRows duplicate it

@herval
Copy link
Author

herval commented Jun 29, 2017

@tinkoff-dwh it's not duplicate. The bug here is that if you don't do statement.setMaxRows, some drivers (eg postgres) will try to load everything when you issue a select *. Using setFetchSize won't affect that (I tested it). So the solution I found is to set the limit to (fetch size + 1). There's a function that will prune the results after they're back from JDBC (getResults) - I'm guessing we could remove that, if all it did was checking that limit, but don't think it's worth pursuing at this time.

TLDR - the current limit check is only applied after you get results. If you query a big enough table, you'll never get the results (and will kill Zeppelin in the process)

@felixcheung
Copy link
Member

got it. I agree it's generally better to restrict the job at the source
perhaps you can use the existing settings but apply it at the JDBC instead? it would be easier to switch existing user over the new behavior this way - actually, let's see what other thinks about this approach.

@herval
Copy link
Author

herval commented Jun 30, 2017 via email

@herval
Copy link
Author

herval commented Jul 4, 2017

any additional thoughts on this? Is it mergeable? @khalidhuseynov @tinkoff-dwh @felixcheung

@herval herval force-pushed the hfreire/limit-row-count branch from 4beb6d8 to 4f66469 Compare July 10, 2017 16:58
@herval
Copy link
Author

herval commented Jul 10, 2017

Rebased to latest master - please advise

dabaitu pushed a commit to dabaitu/zeppelin that referenced this pull request Jul 10, 2017
Summary: As per the PR discussion on apache#2428

Differential Revision: https://phabricator.twitter.biz/D65563
@herval
Copy link
Author

herval commented Jul 14, 2017

ping - please advise

@astroshim
Copy link
Contributor

LGTM~
I think this PR can be merged.


// fetch n+1 rows in order to indicate there's more rows available (for large selects)
statement.setFetchSize(getMaxResult());
statement.setMaxRows(getMaxResult() + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so hopefully getMaxResult() won't return Integer.MAX_VALUE? because +1 will overflow?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless someone specifies a fetch size of two billion rows, this should be fine (the UI would break and Zeppelin would run OOM and the world would melt w/ that amount of rows anyway, so I wouldn't particularly worry about that scenario? :-))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, yes

@felixcheung
Copy link
Member

merging if no more comment

@asfgit asfgit closed this in 87705a9 Jul 17, 2017
@herval herval deleted the hfreire/limit-row-count branch July 17, 2017 21:43
prabhjyotsingh pushed a commit to prabhjyotsingh/zeppelin that referenced this pull request Oct 23, 2017
### What is this PR for?

One thing we tracked down is that if you issue a large query on a very large table, it will simply try to load all results (and then cap them on Zeppelin's side), which seems suboptimal (and will freeze the server). Setting this on the JDBC level seems to solve the problem.

### What type of PR is it?
Bug Fix

### Todos
* [x] Tests

### How should this be tested?

- Create or use a table with a very large number of rows. In our tests, I simply created a:

```
createdb zeppelin_test

psql zeppelin_test

create table too_many_rows(n int)
```

And added 5m rows to it.

Making a paragraph like this will hang without setting a limit:
```
%zeppelin_test

select * from too_many_rows
```

### Questions:
* Does the licenses files need update?
No
* Is there breaking changes for older versions?
No
* Does this needs documentation?
No

Author: Herval Freire <hfreire@twitter.com>

Closes apache#2428 from herval/hfreire/limit-row-count and squashes the following commits:

4f66469 [Herval Freire] display truncation message
b538c44 [Herval Freire] limiting results from jdbc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants