-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two share requests running in parallel can override each other #941
Comments
Hi @zsaltys, thanks for the issue. It is an important task and we will look into it. We can work together on the design in this issue |
Issue: Share processes executed in parallel on the SAME dataset override each other and cause sharing failures. DesignAlternatives
Other very-creative solutions: [Update]: CDK constructs found for ECS-SQS: https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_ecs_patterns/QueueProcessingFargateService.html I will dive deep into the most promising ideas. Let me know if you want to assess any other idea or if you can think of other viable alternatives. 1.
|
Concurrency constraint on ECS task group (group per dataset) would be perfect if only it would exist... aws/containers-roadmap#232 |
We could also address this by introducing a locking mechanism for the resources.
Create a dedicated table in RDS database to store information about locks. This table could have columns such as dataset_id and is_locked.
Implement a function to release the lock for a specific dataset.
Before processing a share request for a specific dataset, we attempt to acquire the lock. If the lock is successfully acquired, proceed with processing. If not, wait or handle accordingly.
We can have this RDS table in the infra accounts. Solutions for handling tasks when lock is already occupied:
Pros:
Cons
Pros:
Cons:
Let me know what you think of using this technique |
Im in favor of the "Locking in RDS + check/wait in ECS" solution. Even though it's not generic and would only work for tasks running in ECS it's also the simplest and we know that this problem only exists for datasets at the moment where we run manual SDK calls. I don't think it's a problem that ECS tasks can wait because the only concern there is cost and this problem is such an edge case that the impact of this waiting is very small. The "Locking in RDS + re-queue" is more flexible, there's no waiting, it would work for non ECS tasks too but it's also more complex, new permissions and there's a worry of what happens if a message cannot be re-queued like an outage? Fore the first solution we should just make sure that two tasks cannot believe they both have a lock at the same time. I propose that we have a separate lock table with a field like resource_uri which can be used for other resources not just dataset uris if there's such in the future. This table can be reused by lamba approach if we decide to switch to it later on. You can also add a field like version to the locks table and use it to make sure two resources cannot get a lock at the same time. |
Agree with @zsaltys , about using "Locking in RDS + check/wait in ECS" solution. Although, I think we can use the same For "Locking in RDS + check/wait in ECS" or the "Locking in RDS + re-queue", will have to make sure that the reads (SELECTs ) and the UPDATEs are locked ( row-locked) on the postgres level at the time of transaction so that we don't have the same issue in which two shares are parallely working for the same dataset. Ref - https://www.postgresql.org/docs/current/sql-select.html#SQL-FOR-UPDATE-SHARE |
### Feature or Bugfix - Feature ### Detail Share requests running in parallel override one another. ### Relates - #941 ### Testing - 2 simultaneous shares got processed successfully with locking mechanism. - Here is what the dataset_lock DB looks like and the values in it. <img width="1386" alt="Screenshot 2024-02-14 at 5 12 54 PM" src="https://github.com/data-dot-all/dataall/assets/26413731/9e36044f-4f3d-4371-a6f2-7dee399da7a7"> - acquiredBy column show the share which last acquired the lock for the particular dataset ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Noah Paige <69586985+noah-paige@users.noreply.github.com> Co-authored-by: dlpzx <71252798+dlpzx@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jaidisido <jaidisido@gmail.com> Co-authored-by: dlpzx <dlpzx@amazon.com> Co-authored-by: mourya-33 <134511711+mourya-33@users.noreply.github.com> Co-authored-by: nikpodsh <124577300+nikpodsh@users.noreply.github.com> Co-authored-by: MK <manjula_kasturi@hotmail.com> Co-authored-by: Manjula <manjula.kasturi@gmail.com> Co-authored-by: Zilvinas Saltys <zilvinas.saltys@gmail.com> Co-authored-by: Zilvinas Saltys <zilvinas.saltys@yahooinc.com> Co-authored-by: Daniel Lorch <98748454+lorchda@users.noreply.github.com> Co-authored-by: Anushka Singh <anushka.singh@yahooinc.com> Co-authored-by: Tejas Rajopadhye <71188245+TejasRGitHub@users.noreply.github.com> Co-authored-by: trajopadhye <tejas.rajopadhye@yahooinc.com>
Describe the bug
We had a situation where a dataset owner was approving a lot of share requests very quickly. What ended up happening is that we see in CloudTrail that the pivotRole was putting a new bucket policy on the same bucket twice at the exact same time. What ended up happening is that one share request overrode the other because both requests ran in parallel, figured out how the policy needs to be updated and tried to update it. In the end one request overrode the other.
From my understanding share requests run as ecs tasks. There is probably no easy way to ensure that only one such task can be running at a time for 1 dataset. Therefore we probably need to introduce locking into the sharing mechanism so that other tasks have to wait until others are done before they attempt to read the existing bucket policy and make changes.
How to Reproduce
It is not trivial to do but we've seen this happen multiple times already. The result with S3 bucket sharing is that access request is granted and the share works but the bucket policy is missing permissions.
Expected behavior
Share requests running in parallel should never override one another.
Your project
No response
Screenshots
No response
OS
N/A
Python version
N/A
AWS data.all version
2.2
Additional context
No response
The text was updated successfully, but these errors were encountered: