-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compatibility using Microsoft.Azure.CosmosDB.BulkExecutor with Microsoft.Azure.Documents.Client #605
Comments
We are working on adding batch support into the SDK directly to avoid these issues in the future. Take a look at #584. |
This PR adding bulk stream support. |
@j82w Thank you sir! Has it been merged yet? Is there any documentation that we can refer? |
It has not been merged yet. Right now the plan is to do a preview release with batch and bulk. I don't have a time frame of when it will be available right now. This issue will get updated once it is available. |
@j82w Hi has the functionality of batch operation been able to merged to v3 cosmos client library now? |
@kevinding0218 batch is available in the 3.2.0-preview nuget. It's in preview so we are open to feedback. Please try it out. |
@j82w , thank you so much for your reply! We're excited to start to play with it now! |
@j82w, sorry we may have to bother you, in the branch 3.2.0-preview, we were having some trouble to find where the bulk executor api located, we looked through the PR and found there is one for #585 but seems like the change code for bulkexecutor was removed from the branch...Is there any sample or API code that we can refer? |
@kevinding0218 I think there is some confusion.
|
@j82w, Thank you for your clarification! I see, we're looking for execution like bulk-insert or bulk-update (close to bulk executor library with v2) so I guess it might be part of you No.2 feature, right? If so, please feel free to let me know so we'll wait until next week and see if we can use it. |
Here is a sample of how the batch API looks. https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/src/Resource/Container/Container.cs#L1107 |
@kevinding0218 what is your scenario for needing bulk support? Is there anything preventing you from using the Batch API? |
@kevinding0218 If you are doing bulk insert or bulk update on a known Partition Key, then the Batch API should be very similar. |
@j82w , our scenario is to have the ability to insert/update a batch size of object into the collection in a single operation, yes we did found the BATCH api in master branch, however, the branch 3.2.0-preview seems removing the BATCH API, so we're not sure if this feature will continue in the future or it'll be re-written for any reason... |
@j82w Thank you for sharing the sample. We've gone through the sample code and had couple of confusions there, we're not sure if the example shown is what we look for like performing bulk insert/update/upsert for a subset of items.
The The The If above three methods could deal with a subset of items, then how would we define |
@j82w, Hello not sure if you could have a chance to look at our question, regarding the bulk api for inserting/updating bunch of items with cosmos db in v3 SDK, would you happen to know if there is any code sample that we could refer? Thank you very much! |
@j82w, we've noticed that in ItemManagement there was a comment like "For insert operations where you are creating many items we recommend using a Stored Procedure and pass batches of new items to this sproc." and there is a "BulkImport.js" of using store procedure, is there any C# API to perform the similar functionality? And by considering the performance impact, which one would be better at considering creating/updating many items, store procedure or c# bulk api? |
@abhijitpai or @ealsur can you answer the question? |
Code samples for the Batch API are an item being tracked #685 Since the API is still in preview, that is the reason there is no documentation out yet. But the code has samples: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/src/Resource/Container/Container.cs#L1106 Regarding BulkImport.js, I believe that was a sample scenario on how to deal with multiple item creations before the introduction of Batch. That sample would be replaced when the new samples for Batch come. It still is a valid scenario, but Batch API would be better. |
@kevinding0218 Is your goal to atomically insert/update a set of documents in a partition key? Or is it to get high throughput / low overall latency to ingest a large amount of data into Cosmos DB? The former use case is met by the Batch API which is in 3.2-preview, and the latter by the upcoming bulk functionality. Assuming the former, the stream methods within batch API are just mechanisms to provide your JSON documents in a serialized manner as opposed to typed objects. Each of the Create/Replace/UpsertItemStream method calls take a stream which when read provides the UTF-8 serialized version of one document. The methods are documented here. Upon executing the batch via ExecuteAsync, the batch of operations is executed in a performant manner. |
Hello @ealsur and @abhijitpai , thank you for both of your reply! Actually we've gone over the code sample #685 and method description of batch api in Container example before, but it seems not match with what we need. Our goal is to atomically insert/update a set of (could be 1,000 items as batch size) documents that might not come with a same partition key, but I can split the group into sub-group within same partition key, then deal over sub-group level for a set of documents which shouldn't be a problem, so let's assume the use case is what you introduced by the Batch API in 3.2-preview. However, as to my previous comment and @abhijitpai 's comment, here is why we don't think it meets our goal:
That's where we got confused as we thought the Bulk API should be used to deal with a subset(could be 1,000 items as batch size) of documents at one time, not one single document at one time, in addition we're unable to see an example of how to define the streamPayload as of a subset of documents, and according to @abhijitpai comment that "a stream which when read provides the UTF-8 serialized version of one document.", we don't think the Bulk API is what we're looking for in order to meet our requirement. Please feel free to correct me if I am wrong... Here is our pseudo code for what we're expecting
And This is NOT what we're expecting
|
Let's say you have 1000 items, and each item has a
You can group your operations per partition key value and do a batch on that. |
@ealsur Thank you a ton for the code sample, that makes a lot of sense now and we'll try it out! |
Please keep us informed when Cross partition-key bulk execution is available in a preview version. |
@SteffenMangold see #741 to follow-up |
Hello @ealsur/ @abhijitpai , we've tried with the batch api and it works pretty good! As we're digging more into it and when we are trying to process a large data set, we've encountered some exceptions like This batch request cannot be executed as it is larger than the allowed limit. Please reduce the number of operations in the batch and try again", I've attached a screen shot of our log output. So few things we've noticed here:
Thank you very much for your help! |
Kevin, what you want is not Batch API but bulk stream being introduced in #741 which is the incorporation of bulk executor library functionality into our Cosmos SDK. Using that will ensure you don't need to handle request splitting, and also you do not want atomicity which Batch API will give. |
@abhijitpai Should the exception @kevinding0218 reported be a CosmosException? |
@abhijitpai , sure thing, let me try that out as well! |
@ealsur Also we did some more testing on how to handle the current batch request limitation issue since the bulk stream has not come out yet. We're kind of confusing that we thought the max request data size is 2 MB as discussed, so suppose we have 2,500 items and each item is 1 KB, we're now batching them into two subset (1,500 and 1,000), However, when we tried to batch with 1,500 items, the exception still shows up every time. We even tried to batch them into 800 per subset but it still failed. Please feel free to see our attached screenshot Finally when we change our batch size to 100, it worked...You could see our total RU spent with upserting the 2,500 items here. It doesn't seem that the max limit is 2 MB but could be much smaller. |
@kevinding0218 Right now, the expectation is that the full batch call from CreateBatch onwards needs to be within the retry as we empty the batch once ExecuteAsync is run on it so that you can use the Batch object to add and run more operations; we can look at changing this behavior if it is not intuitive. With respect to the request limits, there are two limits on batch requests - one is the size (max 2 MB), and the other is the operation count (max 100). |
@abhijitpai Thank you for making clearer on the request limits! I guess for now we've had to made the operation count of max 100, we're looking forward to seeing the new bulk stream coming up! Please feel free to let us know once it's in release, thank you! |
@kevinding0218 Please do not use this issue to discuss other topics. Yes, the SDK retries on Throttles (429) automatically, and can be customized based on that attribute in the configuration. |
Hello, just curious if PR 741 has been fully tested and merged to any recent stable branch? Currently for the Batch API we're still restricted the batch size to be 100 but we're looking forward to any progress made now or in the future, thank you very much for your help! |
Batch and Bulk are still in preview, they are both merged in master, but only available on the preview packages: https://www.nuget.org/packages/Microsoft.Azure.Cosmos/3.2.0-preview2 |
@ealsur Thank you Matias and I will test it out! |
@ealsur , Hello looks like the batch size limit is still 100? Well 100 is really too narrow with processing the data, as on the otherside the CosmosTrigger also triggered as 100 items being updated, so for the case of 5,000 items that we're processing, it split into 50 batches and our CosmosTriggered Function also triggered 50 time...Is there any plan to increase the limit? |
@abhijitpai can probably answer Batch related questions and that max. Bulk (non-transactional) does not have a limit. https://github.com/Azure/azure-cosmos-dotnet-v3/tree/master/Microsoft.Azure.Cosmos.Samples/Usage/BulkSupport, but if you need Transactional support, you will need to use Batch. |
@ealsur Thank you for your reply! We did noticed the "bulk" (non-transactional) in the sample code, however, that seems like doing a parallel tasks where each task is performing one insert/update action, so considering the performance aspect regarding bulk vs batch, which one would be better to use for large amount of data (~5,000 docs) ? @abhijitpai , is there any drawback for releasing more batch size on batch operation? The previous bulkexecutor in v2 seems not having this limitation |
@kevinding0218 my apologies, we don't have official documentation yet because were are still in preview, but Bulk works through the When that flag is Regarding your question for the size, the Bulk Executor V2 did not have that limit because what it did is, take your 5000 operations and internally, split them into smaller groups and those were the ones that executed. Batch is more straight forward, since the goal is to be a single all-or-nothing transaction. |
HI @ealsur, Should I have one instance of DocumentClient and share between all BulkExecutors instances in multi-thread application, or have one instance of DocumentClient per each BulkExecutor? Thanks in advance! |
@IvanMijailovic Yes, you certainly can share one instance of DocumentClient across multiple BulkExecutors. |
@ealsur , Thank you for your input! We've tried with using bulk operation by setting the flag "AllowBulkExecution" to true to insert 200 items, it looks like it'll group the 200 items into 5 items per group of tasks, because in our CosmosDBTrigger Functions, we always receive/triggerred 5 items at a time, so for total 200 items we received/triggered for exact 40 times, is this number of bulk operation, the 5 some kind of default value? Could we increase the number here? However, our initial idea is to process as many as documents as possible at one operation (either bulk or batch), for the batch which maximum limit of 100 or bulk grouping by 5 as of now might not fit in our process, as we normally would have to process thousands of items at a time, splitting them into small chunk might create more latency in the overview of entire process. |
@kevinding0218 The grouping is based on partitions, just like the Bulk Executor. Are you creating all 200 Tasks in a list and each Task is a CreateItem call? |
@ealsur You're right. We're actually just following the sample code by create 200 tasks in a list and each would be an UpsertItemAsync call, all of our 200 items would have one single common partition key just like the sample code as well, if the grouping is based on partitions, our 200 items should be considered as only one group, correct? |
Yes. How do you know that the Bulk is doing batches of 5? Based on what you see on the Change Feed Triggers? Bulk is not transactional, so it's hard to correlate your Change Feed triggers with the backend requests coming from Bulk. Are you setting the AllowBulkExecution flag in the CosmosClientOptions before creating the client or switching it after? If you are capturing the SDK logs, it should show the Bulk traces. |
@ealsur We're using CosmosDBTrigger Function here so we set the AllowBulkExecution flag in the Options before creating the CosmosClient in Startup.cs. You're right we're monitoring the input count from Change Feed Triggers that we saw every incoming documents are number of 5, therefore we assume the bulk operation was grouping one transaction at 5 as well. Based on your explanation, this seems something more related with Change Feed on the Azure Function side. Is there anyway to increase the Change Feed there (as far as I remember the change feed for CosmosDBTrigger Function would be automatically set) so we could catch as many input documents as possible with the operation of bulk? |
@ealsur Should we create DocumentBulkExecutor instance by using a builder per each delete/import/update request and close it when the request is done? Or, create one instance of DocumentBulkExecutor when the application starts and close it on application shutdown? Thanks! |
Treat the BulkExecutor instance as a singleton, as per https://docs.microsoft.com/en-us/azure/cosmos-db/bulk-executor-dot-net#performance-tips |
@ealsur Do you suggest to have two instances of document client and use one for CRUD collection operations and another one for initialization of BulkExecutor and bulk operations on a collection. Or, use one instance of document client for both? |
You can certainly reuse the same instance. DocumentClient is used to execute the Bulk operations, they are normal service requests, so the DocumentClient instance you use to do normal CRUD will do. |
Hi @ealsur I have a problem. We had bulkexecutor in our implementations. And then I want to upgrade to the new sdk. However I noticed the time between doing a bulkexecutor (1000 documents to one partition key) is less than using the new sdk and the worker approach you define in your blog. thanks. |
The new SDK replaces BulkExecutor as you can see here: https://docs.microsoft.com/en-gb/azure/cosmos-db/tutorial-sql-api-dotnet-bulk-import Is your BulkExecutor implementation way faster than your implementation with the new SDK? |
@AntonioJDios - which one of these is your real use case:
|
@alexmartinezm yes with bulk executor taking to upsert 1300 documents is around 40 seconds while with the new sdk it is spending around 2 minutes. @abhijitpai The scenario is the next: 1300 documents that will go to just a one partitionkey because then from the FE wi fetch directly the complete partition key (with pagination) So we tried to optimize the query for the front end and we do not mind the writes can be slower than the readings. However the new sdk is much slower than the previous. do you have any idea? |
@AntonioJDios what's the upsert document size? |
{ |
@AntonioJDios Could you open an Issue with the code snippet of how you are inserting the documents when its taking 2 minutes? This thread is about a different issue. Closing because the original ask has been solved long ago. |
We're having some trouble and tons of confusion about choosing the correct version of BulkExecutor with .Net cosmos SDK, currently we're building 2 class library projects using different versions of cosmos DB SDK
The first class library project we're using .Net Core with Microsoft.Azure.Cosmos 3.0.0 which is the current one, we use this for reading an item from Cosmos DB collection.
The second class library project we've tried with .Net Framework 4.6.1/4.7.1 with Microsoft.Azure.Documents.Client which is the v2 package, we use this for bulkInsert or bulkUpdate item collections with Cosmos DB.
However, when our v3 project add the reference of v2 project, we're not able to initialize
the connection of cosmos db using Microsoft.Azure.Documents.Client, I wonder if it's an issue
of different versions of Cosmos DB SDK being used here? Could we use the combination of v2 and v3 in same solution?
The text was updated successfully, but these errors were encountered: