Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum query array size of client.executeBatch #68

Open
webcc opened this issue Mar 8, 2014 · 7 comments
Open

Maximum query array size of client.executeBatch #68

webcc opened this issue Mar 8, 2014 · 7 comments

Comments

@webcc
Copy link

webcc commented Mar 8, 2014

Hi,

We would like to know the maximum query array size that is possible to send to client.executeBatch(). We think it is a good idea to document that because we are facing problems with sizes bigger than 4000 +.

We are addressing that for the moment by splicing the array.

@jorgebay
Copy link
Owner

As far as I know, there is no limit on batches at protocol level or CQL.

The INSERT / UPDATE queries are for the same partition key?
If not, I don't think it would be a good idea to batch atomic queries for large amount of partitions...

Another consideration is that, to batch a large amount of queries, you create in memory a large number of queries and parameters. Also, you send all that data through the wire "serially"...

@webcc
Copy link
Author

webcc commented Mar 12, 2014

There seems to be a limit. To give you an idea of the issue, we are sending for a given primary key around 150,000 INSERTs. That generates an exception in FrameWriter. If we splice the array of queries in chunks of 5000 items or less, the problem disappears.

Could you tell us what their options in queryFlag do? In particular, we would like to know what the property pageSize does. That seems to have influence with the performance of the queries if we put that in the configuration of the driver.

And by the way, many thanks for this excellent piece of software.

@jorgebay
Copy link
Owner

Thanks!

queryFlag is not affecting the batch in any way, pageSize is used by Cassandra only for select queries and it is ignored for others.

I still think it is not a good idea to batch such a large amount of queries, consider that each query should take on average 50 bytes (depending on the size of the query and parameters) multiplied by 150,000 queries is more than 7Mb of data on memory (that then is transfered over the wire).
Is there a reason to do such large operations?

Also, if possible, use non atomic batches (atomic batches have a performance impact):

client.executeBatch(queries, consistency, {atomic: false}, callback);

If you are getting an error from the FrameWriter, please post it.

@darthcav
Copy link

Hi Jorge,

Could you provide us with a short comment on what the option {atomic: false} (versus {atomic: true}) exactly does?

@jorgebay
Copy link
Owner

Its atomic in database terms: if any part of the batch succeeds, all of it will.

More info: Atomic batches in Cassandra

@dsimmons
Copy link

I'm seeing the same thing. It took me a while to track it down because only a few of my INSERTS are failing with exception TypeError: value is out of bounds. At first I thought it was due to incorrect type coersion of very long IDs (Twitter ID's, 64-bit ints that I'm storing as a string).

The problem stems from the following code:

FrameWriter.prototype.writeShort = function(num) {
  var buf = new Buffer(2);
  buf.writeUInt16BE(num, 0);
  this.buffers.push(buf);
};

The parameter num in one of my failure cases, for example, is 197136. I looked up writeUInt16BE in the Node docs, but some simple math tells me that 197136 is way outside of the 2^16 possible.

Now, with no knowledge of the underlying Cassandra wire protocol, my question is: is it possible to step up this value to perhaps 2^32? I realize that batches of this size are probably recommended against, but for these particular transactions, I need them to be that big to remain atomic. This particular insert is around 12MB uncompressed as JSON.

@adam-roth
Copy link

@jorgebay - Wouldn't it be more descriptive to say that if any part of the batch fails, the entire batch fails? I suppose the two are equivalent, but typically "what happens when something fails?" is the main concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants