Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TASK: Prevent multiple catchup runs #4751

Open
wants to merge 9 commits into
base: 9.0
Choose a base branch
from

Conversation

kitsunet
Copy link
Member

Will use caches to try avoiding to even start async catchups if for that projection one is still running.
Also adds a simple single slot queue with incremental backup to ensure a requested catchup definitely happens.

Copy link
Member

@bwaidelich bwaidelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great already, thanks a lot for your efforts!
I find it a bit hard to understand the queueing logic (even though we came up with it together *g)
Maybe that can be formulated a bit more explicitly such that it is easier to find potential issues.
I added some first suggestions inline

@kitsunet kitsunet marked this pull request as ready for review November 15, 2023 17:52
Copy link
Member

@bwaidelich bwaidelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks so much cleaner and easier to comprehend, thank you!
Two more comments that I wasn't aware before..
Let me know if you could use a rubber duck :)

Copy link
Member

@bwaidelich bwaidelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kitsunet I thought about this one again realized that CATCHUPTRIGGER_ENABLE_SYNCHRONOUS_OPTION is a bad naming choice since this is not so much about sync vs async but more about whether we expect multiple parallel threads or not.
For example: I would like to be able to simply switch the catch up trigger to the synchronous one for individual CRs for performance reasons – but they would still have to be queued in order to prevent concurrent processes from running into the "Failed to acquire checkpoint" exception.

I have a, somewhat radical, suggestion:
What if we replaced AsynchronousCatchUpRunnerState by some central authority that can be used to queue catchups independently from their implementation?

final readonly class CatchUpQueue {

    public function __construct(
        private ContentRepositoryId $contentRepositoryId,
        private FrontendInterface $catchUpLock,
    ) {}

    public function queueCatchUp(ProjectionInterface $projection, \Closure $catchUpTrigger): void {
        // here goes your logic from SubprocessProjectionCatchUpTrigger and  AsynchronousCatchUpRunnerState
    }

    public function releaseCatchUpLock(string $projectionClassName): void {
        // to be called from the part that invokes ContentRepository::catchUp()
    }
}

I think that the change is smaller than it might first appear (since you already solved all of the logical issues) but it would allow us to use this for all implementations (for testing context we could consider replacing the cache with a NullBackend).

E.g. in SubprocessProjectionCatchUpTrigger::triggerCatchUp():

$catchUpQueue = $this->catchUpQueueFactory->build($contentRepositoryId);
foreach ($projections as $projection) {
    $catchUpQueue->queueCatchUp($projection, $this->startCatchUp(...));
}

Sorry for the forth and back, but I think we're getting there and I'm happy to help or take over!

@bwaidelich
Copy link
Member

The example code above is not correct yet because it would block catchups in the foreach, so maybe it has to be adjusted to:

final readonly class CatchUpQueue {
     // ...
    public function queueCatchUp(array $projections, \Closure $catchUpTrigger): void {
        // keep looping over $projections until there are no more running remaining
    }
}

@kitsunet
Copy link
Member Author

Oh, that's a very cool idea, and I think this all makes sense, I'll see how far I get tomorrow and then we can see if you want to take over but at least I get your idea and know this part of the code already so it makes sense for me to go on.

@kitsunet
Copy link
Member Author

kitsunet commented Nov 26, 2023

This is a though nut, I thought it would be nicer to wrap the debouncer - as I named it now - around the ProjectionCatchUpTriggerInterface also implementing the interface, this is nice as it just needs slight adjustments in \Neos\ContentRepositoryRegistry\ContentRepositoryRegistry::buildProjectionCatchUpTrigger, this seemed cleanest, but regardless of how we do it, we somehow need to "unlock" after a catchup is finished even in async. The clean way IMHO would be to use a CatchUpHook that triggers on onAfterCatchUp BUT that gets no arguments so I neither know which repository nor which projection was caught up :sadpanda: (I figured the repository is known to the factory creating the hook, so that is fine)
I guess I will extend the interface but hat will make this a breaking change if someone implemented the interface...

@kitsunet
Copy link
Member Author

kitsunet commented Nov 26, 2023

No, this also seems like a bad idea, because the hooks are per projection and I can't know from the outside which projection has said hook active. It feels unwise to have a global debouncer but a per projection (and repository) unlock of it (via the hook).

SignalSlot might be an option but then it becomes messy to get hold of the debouncer to unlock the state.

@kitsunet
Copy link
Member Author

I guess I would also like a rubber duck here :)

@bwaidelich
Copy link
Member

bwaidelich commented Nov 27, 2023

Using the CatchUpHook seems like a nice – although I did not fully understand why this is needed if we have a custom CLI handler to trigger the catch up.

I guess I will extend the interface but hat will make this a breaking change if someone implemented the interface...

I think that's fine. It would only break for users that actually created a custom hook implementation – and it's really easy to fix

SignalSlot might be an option

I would like the implementation to be independent from Flow as much as possible because something like this will be needed in other places, too..

I guess I would also like a rubber duck here :)

I'm up for it – tomorrow?

kitsunet and others added 3 commits November 28, 2023 20:32
Will use caches to try avoiding to even start async
catchups if for that projection one is still running.
Also adds a simple single slot queue with incremental
backup to ensure a requested catchup definitely happens.
…UpTrigger/SubprocessProjectionCatchUpTrigger.php

Co-authored-by: Bastian Waidelich <b.waidelich@wwwision.de>
@kitsunet kitsunet force-pushed the task/prevent-duplicate-catchup-runs branch 2 times, most recently from 060e8ac to 242b5f6 Compare November 29, 2023 09:48
@kitsunet kitsunet force-pushed the task/prevent-duplicate-catchup-runs branch from 242b5f6 to 387f409 Compare November 29, 2023 10:19
Copy link
Member

@bwaidelich bwaidelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kitsunet This looks awesome already, thanks for holding on!!

Beware: Most if not all of my inline comments are rather nitpicky – that's because I couldn't find any bigger issues ;)

Except for a two potential ones:

1. Global Queue concept

With the CR registry we tried to avoid our "global service" thinking, but now all CRs would share the same CatchUpDeduplicationQueue, conceptually, i.e. sharing the same cache backend (my point: they could internally, but now they have to)
That might not be an issue in practice, but if we were to add a factory like

class CatchUpDeduplicationQueueFactory {

   // ...

   public function build(ContentRepositoryId $contentRepositoryId): CatchUpDeduplicationQueue
   {
        // ...
    }
}

And with that, maybe, maybe it makes sense to take the one extra step and do introduce an interface for it..

To be honest: While writing it down, I'm not sure what issue there could be.. Maybe we can just postpone this part

2. Tests

Annoying one, but since this is a crucial part for the system to work, I would suggest we add some tests.
We could at least add a functional test, creating two instances and see how they interact?
But the best would be a test that covers this with multiple threads. I made some great experience with paratestphp/paratest (we use that for the consistency tests for neos/eventstore-doctrineadapter for example).

I'm happy to help with this, and we can – of course – add those in a separate PR as well (to get this one in asap)

{
$queuedProjections = $this->triggerCatchUpAndReturnQueued($projections);
$attempts = 0;
/** @phpstan-ignore-next-line */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity: What was PHPStan complaining about?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new Projections::emptyis in an internal class if you have suggestions I am happy, we could make Projections not internal or I could not use it, but it seems very convenient.

@bwaidelich
Copy link
Member

A few more potential issues/considerations (in addition to the potential race condition mentioned above):

  • Currently the queue is responsible for multiple subscribers which makes the code a bit hard to follow. One Queue per subscriber might simplify the implementation a lot (and provide more flexibility for future features like projection priorities etc)
  • ContentRepository::catchUpProjection() currently ignores the lock (by definition because it is meant as the low-level trigger). That could be error prone if people would use this instead of the coordinated catchUp. Maybe we can turn things around to only allow for queueing the catch up via CR API and invoke the actual catchup via some internal API
  • The lock is currently not released upon exceptions AFAIS

@kitsunet
Copy link
Member Author

kitsunet commented Dec 4, 2023

* Currently the queue is responsible for multiple subscribers which makes the code a bit hard to follow. One Queue per subscriber might simplify the implementation a lot (and provide more flexibility for future features like projection priorities etc)

I should have tried this as I thought the same, but I think that we might end up in a concurrent lock situation if we do not tackle the whole package at once (as then you end up in multiple while loops waiting for catchups to finish)

* `ContentRepository::catchUpProjection()` currently ignores the lock (by definition because it is meant as the low-level trigger). That could be error prone if people would use this instead of the coordinated catchUp. Maybe we can turn things around to only allow for queueing the catch up via CR API and invoke the actual catchup via some internal API

We could check if the lock is acquired in there and otherwise throw an exception (or just do nothing) ?

* The lock is currently not released upon exceptions AFAIS

Right that needs to be fixed for sure!

Using symfony/lock ensures atomic locking, which
should really prevent duplication even under load.
@kitsunet
Copy link
Member Author

kitsunet commented Dec 6, 2023

Step one, replacing cache with lock, the overall logic is the same still.

@kitsunet
Copy link
Member Author

kitsunet commented Dec 20, 2023

We now have two separate lock steps, a "run" lock around the catchUp of a specific projection, preventing that same projection catchUp to be run multiple times in parallel, and the deduplication "queue" lock that will check if a catchUp is already running and create a queue lock to prevent multiple processes to wait for another catchup run and all having to potentially spawn background processes. This second queue lock is not atomic as it depends on the "run" lock. This should be fine, it's there to prevent runaway spawning of parallel subrequests in busy installation and for that it should work fine with it's randomized back off.

/**
* This is not a regular unit test, it won't run with the rest of the testsuite.
* the following two commands after another would run the parallel tests and then the validation of results:
* requires "brianium/paratest": "^6.11"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see because the latest para test latest requires phpunit 10 but flow blocks that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Blocked
Development

Successfully merging this pull request may close these issues.

3 participants