Skip to content
This repository has been archived by the owner on Jun 27, 2023. It is now read-only.

Fix race condition on childer. #124

Closed
wants to merge 1 commit into from
Closed

Conversation

ajnavarro
Copy link
Member

  • Use only one slice for shards and links.
  • Protect slice with an RWMutex to avoid race conditions.
  • Remove all direct slice access from outside childer.
  • Add concurrency test.

Signed-off-by: Antonio Navarro Perez antnavper@gmail.com

- Use only one slice for shards and links.
- Protect slice with a RWMutex to avoid race conditions.
- Remove all direct slice access from outside childer.
- Add concurrency test.

Signed-off-by: Antonio Navarro Perez <antnavper@gmail.com>
@ajnavarro ajnavarro requested a review from schomatis July 25, 2022 13:55
@welcome
Copy link

welcome bot commented Jul 25, 2022

Thank you for submitting this PR!
A maintainer will be here shortly to review it.
We are super grateful, but we are also overloaded! Help us by making sure that:

  • The context for this PR is clear, with relevant discussion, decisions
    and stakeholders linked/mentioned.

  • Your contribution itself is clear (code comments, self-review for the
    rest) and in its best form. Follow the code contribution
    guidelines

    if they apply.

Getting other community members to do a review would be great help too on complex PRs (you can ask in the chats/forums). If you are unsure about something, just leave us a comment.
Next steps:

  • A maintainer will triage and assign priority to this PR, commenting on
    any missing things and potentially assigning a reviewer for high
    priority items.

  • The PR gets reviews, discussed and approvals as needed.

  • The PR is merged by maintainers when it has been approved and comments addressed.

We currently aim to provide initial feedback/triaging within two business days. Please keep an eye on any labelling actions, as these will indicate priorities and status of your contribution.
We are very grateful for your contribution!

Copy link
Contributor

@schomatis schomatis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a few notes:

  • We are conflating in the same commit/PR the actual fix (lock) with the refactor of a new childOrLink structure (which isn't minor in the line count nor impact in the patch).
  • I'm fine adding a lock provided we're confident that:
    • it doesn't affect performance
    • this was actually causing the panic in the parallel walk
  • I don't see what is being improved with childOrLink in terms of having both nil. It certainly doesn't make it worse improves upon the old code making a stronger coupling between child and link, having them in the same slice, so I'm fine landing this but we still have two pointers we reference and need to explicitly check for nil (compare with an API that exposes callbacks to process child or link and it actually enforces that only one will run).

@ajnavarro
Copy link
Member Author

ajnavarro commented Jul 25, 2022

@schomatis childOrLink is needed to avoid race conditions when reading. Before, when you call getChild and after checking it was nil you call getLink, other goroutines can change the value between calls and you might get nil again. Getting the shard and the link at the same time removes that problem from the code.

it doesn't affect performance

Locking for sure affects performance, but the behavior will be correct.

this was actually causing the panic in the parallel walk

I added a minimal test to reproduce the error before starting the bug fixing. It was the easiest way to reproduce the error. If you have a better idea for a test to reproduce the problem just let me know!

@schomatis
Copy link
Contributor

childOrLink is needed to avoid race conditions when reading. Before, when you call getChild and after checking it was nil you call getLink, other goroutines can change the value between calls and you might get nil again. Getting the shard and the link at the same time removes that problem from the code.

The mutex does that, not consolidating child and link (which is still a good thing to better understand the code). As mentioned before I get the feeling we're conflating different issues here.

@schomatis
Copy link
Contributor

I added a minimal test to reproduce the error before starting the bug fixing.

That's good, but the test is introducing the concept that we're using shards in parallel that I'm not getting where is this happening in the code.

@ajnavarro
Copy link
Member Author

Added more context here: ipfs/kubo#9063 (comment)

@BigLep
Copy link

BigLep commented Sep 1, 2022

Closing because we thnk when we attack ipfs/kubo#9063 in the future, we'll add locking at the MFS level.

@BigLep BigLep closed this Sep 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants