Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cassandra-journal.max-result-size via Paging #70

Closed
chbatey opened this issue Jul 28, 2015 · 3 comments
Closed

Implement cassandra-journal.max-result-size via Paging #70

chbatey opened this issue Jul 28, 2015 · 3 comments
Milestone

Comments

@chbatey
Copy link
Collaborator

chbatey commented Jul 28, 2015

Atm a LIMIT query is used. We could use cassandra paging fetch-size introduced in C* 2.0:

http://datastax.github.io/java-driver/2.1.7/features/paging/

This is used in the C* Spark Connector when bringing a large number of rows into Spark from C*.

It would remove one of the cases in the RowIterator, we're already doing synchronous queries so this would be almost transparent (call to next() would just block). You can also query the paging state to see how many rows can be taken without blocking / get a future for when more rows are available.

WDYT?

@krasserm
Copy link
Owner

krasserm commented Aug 1, 2015

After reading the docs again, +1 to use the fetch-size on the driver (and simplify RowIterator).

The current implementation came from a misunderstanding on my side: I thought the server always limits the size of a result set (using a default value if none is given) so that clients need to query for more results once they've iterated over a previous query. I thought that LIMIT is independent of the driver's fetch-size but they seem to be same. I somehow missed in the docs that the server doesn't limit the size of a result set at all. Is my understanding correct?

@chbatey
Copy link
Collaborator Author

chbatey commented Aug 1, 2015

You were right that LIMIT is independent of fetch size and that the server by default will return all the rows.

LIMIT is mainly used for manual paging (like we are now) or getting top N results, and will result in you needing to issue another query based on the last row of the previous query.

Prior to 2.0 if you didn't specify a LIMIT (paging didn't exist) you would get OutOfMemory in either the coordinator or your app as C* just brought back all the rows. I deleted so much code when paging was added.

I believe the driver sets a default fetch-size and without a LIMIT all the rows will come back (and if you read them all into memory you of course still risk a OOM).

To add confusion cqlsh adds a 1000 LIMIT to all your queries tho this may have been changed to use fetch-size as I see cqlsh has a less like interface for large results now. I need to confirm this.

TLDR we can remove LIMIT and just rely on fetch-size

@krasserm
Copy link
Owner

krasserm commented Aug 1, 2015

OK, sound good.

@krasserm krasserm modified the milestone: 0.4 Sep 1, 2015
krasserm added a commit that referenced this issue Sep 3, 2015
[#70] Use cassandra paging to implement max result size
@chbatey chbatey closed this as completed Sep 3, 2015
jypma pushed a commit to jypma/akka-persistence-cassandra that referenced this issue Sep 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants