Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Retry Mechanism and Concurrency Control for Jobs #21

Merged
merged 15 commits into from
Apr 25, 2024

Conversation

falvarez1
Copy link
Member

Overview

This pull request introduces two major enhancements to NCronJob:

  1. Retry Mechanism: To improve the robustness of job executions, especially in cases of transient failures.
  2. Supports Concurrency Attribute: To provide fine-grained control over the concurrency behavior of individual jobs.

Details

  1. Retry Mechanism
  • Purpose: Increases reliability by automatically retrying jobs that fail due to transient issues.
  • Implementation: Implemented using the Polly library with an exponential backoff strategy, allowing recoverable failures to be retried.
  • Configuration: Customizable per job, enabling specific settings for retry counts and backoff intervals.
  1. Supports Concurrency Attribute
  • Purpose: Adds control over job concurrency to prevent issues from multiple simultaneous instances of the same job type.
  • Implementation: New SupportsConcurrency attribute introduced to mark job classes that can run concurrently.
  • Default Behavior: Jobs are assumed to be non-concurrent by default. Scheduler checks this attribute before starting a new instance.
  • Impact: Prevents simultaneous execution of non-concurrent jobs, reducing potential conflicts or resource issues.

Justification

  • Retry Mechanism: Enhances resilience against common transient failures in distributed environments.
  • Concurrency Control: Ensures job execution integrity by preventing duplicate runs of critical or resource-intensive tasks or tasks that work on shared resources.

A new `TestCancellationJob` class has been added to simulate a long-running job and handle job cancellation.
Everything was made to support async structures to better support error handling and cancellation support.
Added preliminary parallel execution support.
@linkdotnet
Copy link
Member

linkdotnet commented Apr 17, 2024

Thanks for that PR! Really good job. Before going on a code base, I wanted to address some of the features. I very much like the features you provided but my experience pushes me into a different direction on sensible defaults than you took. So the following is just about that - not about the features itself, they are very welcome.

Retry Mechanism

In your current implementation retry is "always-on" without an option to opt-out. In general I would argue making this feature opt-in is the more sensible choice. That is how many of the .NET objects work and what I would expect as well.

Supports Concurrency Attribute

That is an interesting take, because my gut feeling would also go the opposite direction. Make every job concurrent by default and opt-out of this feature. And here I lean towards the same argumentation as above, it feels more natural (to me).
Many jobs, like generating reports or even sending e-mails (given not the identical receiver and message) should work in a concurrent setup just fine.

@falvarez1
Copy link
Member Author

In your current implementation retry is "always-on" without an option to opt-out. In general I would argue making this feature opt-in is the more sensible choice. That is how many of the .NET objects work and what I would expect as well.

I agree with you about the defaults, I did not integrate configuration bindings as this was more of an initial concept. However most orchestration systems default to retries because they assume the system is deterministic and Jobs are idempotent. Since neither of those are ensured here we can opt-in instead.

That is an interesting take, because my gut feeling would also go the opposite direction. Make every job concurrent by default and opt-out of this feature. And here I lean towards the same argumentation as above, it feels more natural (to me).
Many jobs, like generating reports or even sending e-mails (given not the identical receiver and message) should work in a concurrent setup just fine.

I understand the natural concurrency for tasks like generating reports or sending e-mails, but this is more about risk-mitigation. The default non-concurrent setting is a safer approach in a non-deterministic environment where shared resources or external states are involved. This reduces the risks of race conditions and unintended side effects. Ensuring that concurrency is only enabled after a developer explicitly verifies it helps prevent complex debugging issues, making the system more robust and easier to maintain. Funny enough the reason I added this to this PR was actually because I hit debugging issues while testing the cancellation support and having many of the same type Jobs running in parallel.

@linkdotnet
Copy link
Member

I understand the natural concurrency for tasks like generating reports or sending e-mails, but this is more about risk-mitigation. The default non-concurrent setting is a safer approach in a non-deterministic environment where shared resources or external states are involved. This reduces the risks of race conditions and unintended side effects. Ensuring that concurrency is only enabled after a developer explicitly verifies it helps prevent complex debugging issues, making the system more robust and easier to maintain. Funny enough the reason I added this to this PR was actually because I hit debugging issues while testing the cancellation support and having many of the same type Jobs running in parallel.

That is a fair point. Let's go with your approach and if we see that the usage is way different, we can still flip it in the next major version.

@falvarez1
Copy link
Member Author

That is a fair point. Let's go with your approach and if we see that the usage is way different, we can still flip it in the next major version.

If you prefer I can add a global optIn optOut approach. For example:

services.AddNCronJob(builder =>
{
    // Use this if you assume most jobs are safe to run concurrently and exceptions are explicitly handled
    builder.OptInConcurrency();  
});

services.AddNCronJob(builder =>
{
    // Use this if you assume most jobs should not run concurrently and only specific jobs are safe to handle it
    builder.OptOutConcurrency(); 
});

or

services.AddNCronJob(builder =>
{
    ...
}).OptOutConcurrency();

But this would have a minor breaking change since AddNCronJob would not return an IServiceCollection. But do you really want to give people the option to chain service registrations off of AddNCronJob?

Then decorate the Job according to how you set your global:

[SupportsConcurrency(true)]
public class ConcurrentJob : IJob
{
    // Implementation allowing concurrency
}

[SupportsConcurrency(false)]
public class NonConcurrentJob : IJob
{
    // Implementation disallowing concurrency
}

So if you use OptInConcurrency then all Jobs are assumed to support concurrency and you would have to explicitly add [SupportsConcurrency(false)] as a way to switch that off just for that Job, and vice-versa.

@linkdotnet
Copy link
Member

I do prefer the attribute (without any boolean flag) over the other given approaches.
That gives granularity and a much better API experience.

We can think of configuring this via the configuration builder:

Services.AddNCronJob(o => 
{
 o.AddJob<MySampleJob>().WithCronExpression("...").AllowMultipleInstances();
});

@linkdotnet linkdotnet mentioned this pull request Apr 17, 2024
4 tasks
@falvarez1
Copy link
Member Author

falvarez1 commented Apr 17, 2024

We can think of configuring this via the configuration builder:

Services.AddNCronJob(o => 
{
 o.AddJob<MySampleJob>().WithCronExpression("...").AllowMultipleInstances();
});

@linkdotnet The challenge with that approach is that the dev can easily do the following (because you're not making the concurrency support inherently a part of the Job):

builder.Services.AddNCronJob(o =>
{
    o.AddJob<MySampleJob>().WithCronExpression("*/10 * * * *").AllowMultipleInstances();
    o.AddJob<MySampleJob>().WithCronExpression("*/25 * * * *");
});

It's the same MySampleJob Job type but one describes support for concurrency while the other does not. I think this leads to confusion and having us explain which one takes precedence. Or just asking devs not to do that, which is not an ideal API experience.

@linkdotnet
Copy link
Member

@linkdotnet The challenge with that approach is that the dev can easily do the following (because you're not making the concurrency support inherently a part of the Job):

builder.Services.AddNCronJob(o =>
{
    o.AddJob<MySampleJob>().WithCronExpression("*/10 * * * *").AllowMultipleInstances();
    o.AddJob<MySampleJob>().WithCronExpression("*/25 * * * *");
});

It's the same MySampleJob Job type but one describes support for concurrency while the other does not. I think this leads to confusion and having us explain which one takes precedence. Or just asking devs not to do that, which is not an ideal API experience.

That is fair. Given that we already settled on the attribute, I'd propose utilizing this to control the concurrency level:

[SupportsConcurrency(MaxDegreeOfParallelism = 3)]
public class MyJob : IJob

That would also remove the need for the ConcurrencyConfig as this can be checked once at the beginning of the scheduler.

@falvarez1
Copy link
Member Author

falvarez1 commented Apr 18, 2024

[SupportsConcurrency(MaxDegreeOfParallelism = 3)]
public class MyJob : IJob

That would also remove the need for the ConcurrencyConfig as this can be checked once at the beginning of the scheduler.

I think you're mixing the global concurrency with the Job specific concurrency (basically it's 1 job with many triggers). Your suggestion of putting it in the attribute means it is specific only to that job. However the way I had coded the ConcurrencyConfig.MaxDegreeOfParallelism was to manage the concurrency of the Scheduler within the application, not the Job.

Either way I think properly supporting concurrency for both individual Jobs and globally can be a little challenging, but I think I have something you can test

@linkdotnet
Copy link
Member

I think you're mixing the global concurrency with the Job specific concurrency

Indeed I did - thanks for the clarification.

@linkdotnet
Copy link
Member

linkdotnet commented Apr 19, 2024

All in all, this PR goes in a really good direction - I really appreciate your work and discussion, @falvarez1 !

My open ToDo's for now:

  • The retry policy might be configurable as well and not set by default. To be consistent it could be an attribute just like SupportConcurrency is. So you can annotate a job with RetryPolicy(RetryCount = 3)
  • Some cleanup in the sample because it should show only the bare minimum without confusing someone that is new to the library
  • Some unit tests for the new behavior (which I can also do once it is merged).
  • Updating the README.md to showcase the new capabilities.

Also thanks for keeping the API stable so we don't have to do a major version bump. If you have time you can also extend the CHANGELOG.md. This file is used for creating the release notes. If not, I can pick this up as well.

@falvarez1
Copy link
Member Author

  • The retry policy might be configurable as well and not set by default. To be consistent it could be an attribute just like SupportConcurrency is. So you can annotate a job with RetryPolicy(RetryCount = 3)

Glad you said that because I already had that in mind to submit once this was approved ;-) I'll include it in this one instead.

  • Some cleanup in the sample because it should show only the bare minimum without confusing someone that is new to the library

I agree. I don't know the best way to isolate each use case as a separate example without literally creating a separate sample project. I'll clean it up as best I can but open to suggestions.

  • Some unit tests for the new behavior (which I can also do once it is merged).

Certainly, just waiting to finalize the design first. Don't want to keep changing them while it's in flux.

  • Updating the README.md to showcase the new capabilities.

I'll give it shot (both README and CHANGELOG).

Also thanks for keeping the API stable so we don't have to do a major version bump.

No worries. I do have other ideas but they may require changes to the API definition. Can discuss separately.

@linkdotnet
Copy link
Member

I agree. I don't know the best way to isolate each use case as a separate example without literally creating a separate sample project. I'll clean it up as best I can but open to suggestions.

I do agree - it might be tricky. I do think we need a separate documentation page.
It is almost impossible to showcase all the things the library can do, but keep it very simple and minimalistic.

Certainly, just waiting to finalize the design first. Don't want to keep changing them while it's in flux.

That makes absolutely sense - quite frankly I do think we are almost there.

No worries. I do have other ideas but they may require changes to the API definition. Can discuss separately.

Keep it coming ;) - anyway thanks again for the hard work!

@falvarez1 falvarez1 force-pushed the retry-feature branch 5 times, most recently from f7a5315 to a49372b Compare April 24, 2024 00:23
falvarez1 and others added 7 commits April 23, 2024 20:36
…nal changes to job execution, retry logic, and concurrency management within the system:

- **Job and Retry Logic Refactoring:**
  - Moved `retryHandler` from `CronScheduler` to `JobExecutor` to centralize execution logic.
  - Consolidated all retry policy-related code into the new `RetryPolicies` folder.
  - Added `RetryPolicy` attributes with two strategies: exponential backoff and fixed interval.

- **Concurrency Enhancements:**
  - Introduced `SupportsConcurrency` attribute with a `MaxDegreeOfParallelism` parameter to finely control job execution.
  - Replaced `ConcurrencyConfig` class with `ConcurrencySettings`, incorporating a `MaxDegreeOfParallelism` property.
  - Updated `CronScheduler` and `NCronJobOptionBuilder` to utilize `ConcurrencySettings`.
  - Added preliminary support for priority-based queuing of jobs to improve execution efficiency.

- **Testing and Samples:**
  - Added new jobs in `NCronJobSample` project to showcase concurrency capabilities.
  - Updated sample project configuration, including removal of machine-specific settings and adding `ConcurrencySettings` to `.editorconfig`.
  - Enhanced `TestCancellationJob` to support a `MaxDegreeOfParallelism` of 10 and added multiple job instances for testing.

- **Project Structure and Cleanup:**
  - Moved job definition files into a dedicated `Jobs` folder.
  - Removed obsolete pragma directives and updated project settings to reflect new structure and focus.

- **Additional Features and Fixes:**
  - Added auto-detection of precision in `WithCronExpression` when the second parameter is omitted.
  - Added default values to `SupportsConcurrencyAttribute` to simplify job configuration.
  - Implemented thread safety in `CronScheduler.cs` to ensure reliable operation under concurrent access.
  - CronRegistry.RunInstantJob now has a void return type, aligning with expected behavior.
…tionalities and associated tests. Key changes include:

- **Retry Attributes**: Minor updates to retry attributes to improve reliability.
- **Retry Logic Cleanup**: Cleaned up the retry code and added two new tests to ensure functionality.
- **Time Provider Fix**: Resolved issues in TimeProvider that affected tests, ensuring accurate time simulation.
- **Retry Issue Resolution**: Addressed specific bugs in the retry mechanisms that affected process stability.
- **Job Execution Context**: Changed JobExecutionContext to be an instance of Job Run rather than the Job itself, clarifying execution context.
- **Thread Safety in JobOptionBuilder**: Enhanced thread safety and added a test to verify concurrency handling in JobOptionBuilder.
- **CronExpression Tests**: Fixed previously broken tests related to CronExpressions and added additional tests to cover new cases.
- **optimize scheduler: use precomputed values for cron occurrence
…CronJob-Dev#24)

* Use GetRequiredService instead of GetService and added tests

* Removed redundant test

* Added entry in changelog

* Remove empty line in NCronJobIntegrationTests.cs

* Update CHANGELOG.md

Co-authored-by: Steven Giesel <stgiesel35@gmail.com>

---------

Co-authored-by: Steven Giesel <stgiesel35@gmail.com>
…astructure:

- **Test Fixes and Enhancements:**
  - Fixed all existing tests to ensure they are passing and accurately reflecting intended behavior.
  - Introduced a new version of `WaitForJobsOrTimeout` that allows time advancement for each run, facilitating more precise control over test timing and behavior.
@falvarez1 falvarez1 requested a review from linkdotnet April 24, 2024 01:48
CHANGELOG.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
@linkdotnet
Copy link
Member

linkdotnet commented Apr 25, 2024

There are small ordering things inside the README.md I'd like to move around (Migration not in between features for example) - but nothing that blocks merging.

If there are no open points from your side, I'd probably Squash and merge that thing.

I'd directly release a preview version for folks to test.

@falvarez1
Copy link
Member Author

I'd directly release a preview version for folks to test.

I agree, I think this needs a little more user testing. I also want to note that some other features I plan to contribute build upon this work, so figuring out a clear path forward would be beneficial for everyone. I suggest creating a develop branch that serves as a staging area for preview features before they're merged into main. This would allow any new PRs that build upon the preview features to be targeted against the develop branch, keeping the main branch stable while allowing more streamlined development of new features. What do you think about this approach?

There are small ordering things inside the README.md I'd like to move around (Migration not in between features for example) - but nothing that blocks merging.

Would it also help to put a ToC at the top that links to the sections?

@linkdotnet
Copy link
Member

I'd directly release a preview version for folks to test.

I agree, I think this needs a little more user testing. I also want to note that some other features I plan to contribute build upon this work, so figuring out a clear path forward would be beneficial for everyone. I suggest creating a develop branch that serves as a staging area for preview features before they're merged into main. This would allow any new PRs that build upon the preview features to be targeted against the develop branch, keeping the main branch stable while allowing more streamlined development of new features. What do you think about this approach?

Basically, main reflects that nature. I am okay with "half-baked" (not in the sense that they are unfinished but maybe need some polishing) features on main and stabilizing them over time until we have a release. That said I don't see the need of having a separate develop (or similar) branch.

Would it also help to put a ToC at the top that links to the sections?

That would be one thing to do - I plan on having a bit more detailed documentation that gets pushed to the GitHub Page (or buying a proper domain name in the long term - if there is an interest in the library).

@linkdotnet
Copy link
Member

@falvarez1 I would merge the changes into main - or do you have any open points?
Big thanks to you for the awesome work - really appreciated, that you took the time and effort!

@falvarez1
Copy link
Member Author

No other points, I'm good with the merge, thanks.

@linkdotnet linkdotnet merged commit c520711 into NCronJob-Dev:main Apr 25, 2024
3 checks passed
@falvarez1 falvarez1 deleted the retry-feature branch May 19, 2024 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Configuration of degree of parallelism
3 participants