Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Occured in Bulk OperationError: Partition key error. Either the partitions have split or an operation has an unsupported partitionKey type #20162

Open
developer-ankursingh opened this issue Feb 1, 2022 · 12 comments
Assignees
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. Cosmos customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Service Attention Workflow: This issue is responsible by Azure service team.
Milestone

Comments

@developer-ankursingh
Copy link

developer-ankursingh commented Feb 1, 2022

Azure based dependencies being used -
@azure/cosmos":"^3.15.1"
@azure/identity": "^2.0.1"

Describe the bug
We have been using Azure Cosmos library with Azure identity liberary for RBAC mechanism to perform bulk delete operation. Till 2 days back , it was working fine so we were able to delete multiple records from Cosmos DB. But now we are getting a weird error/exception each time we try to delete even a single record. As of now, the no. of records is around 2.3 millions in the container in which we are trying to perform deletion. We have validated multiple times that we are passing correct data to bulk delete API (uniqueId, PartitionKey & Operation Type - Delete ) even then we are facing above issue . Same Bulk delete API is working fine for other containers having much lesser amount of data .

To Reproduce
Steps to reproduce the behavior:
1.

Expected behavior
We should be able to perform deletion sucessfully no matter how many records we have in the container.

Screenshots
Added screenshot to help explain our problem.
Error-screenshot

    `   while (i) {
           var pos = data.length - i;
          var superItem = data[pos];
        operations.push({
            operationType: "Delete",
            id: superItem.id,
            partitionKey: superItem.partitionKey
        })

        if (i % 100 === 0 || i === 1) {
            try {
                var response = await createdContainer.items.bulk(operations, { continueOnError: true });
                throttledCount = response.filter((r) => r.statusCode === 429).length;
                while (throttledCount !== 0) {
                    response = await createdContainer.items.bulk(operations, { continueOnError: true });
                    throttledCount = response.filter((r) => r.statusCode === 429).length;
                }
            } catch (error) {
                console.log("Error Occured in Bulk Operation" + error);
            }

            operations = [];
            if (i === 0) {
                console.log('100 records got deleted')
            }
        }
        i--;
    }`

Additional context
As part of a validation check, we tried to delete records for other containers in our DEV & SIT env and the records got deleted successfully. We did an analysis for the above issue and couldn’t find much on internet also except the below links.

#18682

"Partition key error. Either the partitions have split or an operation has an unsupported partitionKey type"

Using above links, we can see the code written in Azure Cosmos Node JS library and the exception is being thrown from the library itself based on some internal min-max range for partition key. From our side, we cannot do much as of now to debug this further so raising a ticket here for further help.

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Feb 1, 2022
@developer-ankursingh
Copy link
Author

FYI, we are able to perform bulk update operation successfully but the bulk delete operation is throwing error for the same container.
I have attached debug logs & screenshots for both Update & Delete operations for your reference.

Update_API_Azure_Cosmos_NodeJS
Bulk_Delete_API_Azure_Cosmos_NodeJS_part_1
Bulk_Delete_API_Azure_Cosmos_NodeJS_part_2

@ramya-rao-a ramya-rao-a added Client This issue points to a problem in the data-plane of the library. Cosmos labels Feb 4, 2022
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Feb 4, 2022
@developer-ankursingh
Copy link
Author

attached logs as wel
bulk_delete_error_logs.txt
l

@kushagraThapar
Copy link
Member

@developer-ankursingh - we are investigating this issue. @sajeetharan can you please look into this ?

@sajeetharan sajeetharan moved this from Todo to In Progress in @azure/cosmos Project Feb 8, 2022
@developer-ankursingh
Copy link
Author

Thanks for the update @sajeetharan . We can also have a session if possible with your team to demo the issue. Please let us know since it is impacting our project developement .

@sajeetharan
Copy link
Member

sajeetharan commented Feb 9, 2022

@developer-ankursingh From the image looks like this happens in the stage environment right? We have checked the backend logs, we are able to see the 410 errors as well on this account on 02/02/22, just wanted to confirm if this is only for deletion. We'll schedule a call to understand once we analyze the code and check with backend team.

@developer-ankursingh
Copy link
Author

developer-ankursingh commented Feb 9, 2022

@sajeetharan It is happening in SIT & STAGING but not in DEV. Just to confirm this isssue pops up only for DELETE operation for SIT & STAGING containers only but same code is working fine for other containers (all environments). It would be really nice if we can have a call where we can give you a walkthrough of the code.

@sajeetharan
Copy link
Member

sajeetharan commented Feb 11, 2022

@developer-ankursingh Sorry for the delayed response, we were able to identify your requests in the backend and it looks like an issue to be on the SDK, which related to cache which maintains the partition key range. We will work on this fix soon.

@developer-ankursingh
Copy link
Author

@sajeetharan Thank you for the update ! Let us know whenever you push out the fix so that we can validate and confirm.

@jay-most jay-most moved this from In Progress to Ready to start in @azure/cosmos Project Mar 9, 2022
@jay-most jay-most moved this from Ready to start to Spikes in @azure/cosmos Project Mar 16, 2022
@jay-most jay-most assigned JericHunter and unassigned sajeetharan Mar 16, 2022
@jay-most jay-most moved this from Spikes to Ready to start in @azure/cosmos Project Mar 16, 2022
@jay-most jay-most moved this from Ready to start to Business requirements in @azure/cosmos Project Mar 16, 2022
@sajeetharan
Copy link
Member

sajeetharan commented Mar 21, 2022

Suggestions to resolve the issue,

  1. The SDK reads the entire partition key map on every bulk operation. It's never cached which is a huge issue for large collections with many partitions. Loading the ranges can easily become much more expensive than making the bulk operation itself. The fix is to cache the ranges on the container object so they are only loaded once.

  2. The SDK explicitly throws an error on bulk operations during a partition split. It needs to handle the 410 gracefully by retrying the failed operations AFTER invalidating the cache partition map. But first you have to fix (1) above.

An even better option would be to fix this in the gateway. When the backend bubbles up a 410 to the gateway during bulk it could refresh its own internal partition map and reroute to the correct partition.

@jay-most @JericHunter

@jay-most jay-most assigned jay-most and unassigned JericHunter Jun 3, 2022
@jay-most
Copy link
Contributor

jay-most commented Jun 7, 2022

Thanks @sajeetharan! @developer-ankursingh reopen if needed.

@jay-most jay-most closed this as completed Jun 7, 2022
Repository owner moved this from Business requirements to Done in @azure/cosmos Project Jun 7, 2022
@sajeetharan sajeetharan reopened this Dec 6, 2022
@xirzec xirzec added bug This issue requires a change to an existing behavior in the product in order to be resolved. Service Attention Workflow: This issue is responsible by Azure service team. and removed question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Mar 31, 2023
@ghost ghost added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Mar 31, 2023
@ghost
Copy link

ghost commented Mar 31, 2023

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @bkolant-MSFT, @sajeetharan, @pjohari-ms.

Issue Details

Azure based dependencies being used -
@azure/cosmos":"^3.15.1"
@azure/identity": "^2.0.1"

Describe the bug
We have been using Azure Cosmos library with Azure identity liberary for RBAC mechanism to perform bulk delete operation. Till 2 days back , it was working fine so we were able to delete multiple records from Cosmos DB. But now we are getting a weird error/exception each time we try to delete even a single record. As of now, the no. of records is around 2.3 millions in the container in which we are trying to perform deletion. We have validated multiple times that we are passing correct data to bulk delete API (uniqueId, PartitionKey & Operation Type - Delete ) even then we are facing above issue . Same Bulk delete API is working fine for other containers having much lesser amount of data .

To Reproduce
Steps to reproduce the behavior:
1.

Expected behavior
We should be able to perform deletion sucessfully no matter how many records we have in the container.

Screenshots
Added screenshot to help explain our problem.
Error-screenshot

    `   while (i) {
           var pos = data.length - i;
          var superItem = data[pos];
        operations.push({
            operationType: "Delete",
            id: superItem.id,
            partitionKey: superItem.partitionKey
        })

        if (i % 100 === 0 || i === 1) {
            try {
                var response = await createdContainer.items.bulk(operations, { continueOnError: true });
                throttledCount = response.filter((r) => r.statusCode === 429).length;
                while (throttledCount !== 0) {
                    response = await createdContainer.items.bulk(operations, { continueOnError: true });
                    throttledCount = response.filter((r) => r.statusCode === 429).length;
                }
            } catch (error) {
                console.log("Error Occured in Bulk Operation" + error);
            }

            operations = [];
            if (i === 0) {
                console.log('100 records got deleted')
            }
        }
        i--;
    }`

Additional context
As part of a validation check, we tried to delete records for other containers in our DEV & SIT env and the records got deleted successfully. We did an analysis for the above issue and couldn’t find much on internet also except the below links.

#18682

"Partition key error. Either the partitions have split or an operation has an unsupported partitionKey type"

Using above links, we can see the code written in Azure Cosmos Node JS library and the exception is being thrown from the library itself based on some internal min-max range for partition key. From our side, we cannot do much as of now to debug this further so raising a ticket here for further help.

Author: developer-ankursingh
Assignees: jay-most
Labels:

bug, customer-reported, Client, Cosmos, Service Attention

Milestone: -

@sajeetharan sajeetharan assigned v1k1 and topshot99 and unassigned jay-most Apr 4, 2023
@topshot99 topshot99 moved this to In Progress in @azure/cosmos Project Apr 25, 2023
@sajeetharan sajeetharan added this to the 2024-01 milestone Sep 26, 2023
@sajeetharan sajeetharan moved this from In Progress to Ready to start in @azure/cosmos Project Sep 26, 2023
Copy link

Hi @developer-ankursingh, we deeply appreciate your input into this project. Regrettably, this issue has remained inactive for over 2 years, leading us to the decision to close it. We've implemented this policy to maintain the relevance of our issue queue and facilitate easier navigation for new contributors. If you still believe this topic requires attention, please feel free to create a new issue, referencing this one. Thank you for your understanding and ongoing support.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 13, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Mar 13, 2024
@sajeetharan sajeetharan reopened this Dec 17, 2024
@sajeetharan sajeetharan moved this to Ready to start in @azure/cosmos Project Dec 17, 2024
@sajeetharan sajeetharan modified the milestones: 2024-01, 4.0.0 GA Dec 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. Cosmos customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Service Attention Workflow: This issue is responsible by Azure service team.
Projects
Status: Ready to start
Development

No branches or pull requests

9 participants