Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource optimized placement strategy #8815

Merged
merged 53 commits into from
Jan 16, 2024

Conversation

ledjon-behluli
Copy link
Contributor

@ledjon-behluli ledjon-behluli commented Jan 11, 2024

This PR adds support for resource optimized placement strategy.

ResourceOptimizedPlacement is a placement strategy which attempts to optimize resource distribution across the cluster.

It assigns weights to runtime statistics to prioritize different resources and calculates a normalized score for each silo.
The silo with the lowest score is chosen for placing the activation. Normalization ensures that each property contributes proportionally to the overall score. Users can adjust the weights based on their specific requirements and priorities for load balancing.

In addition to normalization, an online adaptive algorithm provides a smoothing effect (filters out high frequency components) and avoids rapid signal drops by transforming it into a polynomial alike decay process. This contributes to avoiding resource saturation on the silos and especially newly joined silos.

Silos which are overloaded by definition of the load shedding mechanism are not considered as candidates for new placements.

When the local silo's score is within the preference margin of another remote silo, than the local silo is picked as the target.

Since there could be more than 1 silo that has the same exact score, we pick 1 of them randomly so that we don't continuously pick the first one, out of the shorted-listed once.

This strategy is 'static' because it will do the best possible placement considering the current view of the whole cluster, as we know this view may be change dramatically even if no new placement is requested, because of various business logic of the users code.
A 'dynamic' resource optimization may be attempted to rebalance the silos, but this is out of the scope of this PR as it is:

  1. More complicated.
  2. Should be part of a live-migration strategy, not a placement one.
Microsoft Reviewers: Open in CodeFlow

@ledjon-behluli
Copy link
Contributor Author

ledjon-behluli commented Jan 12, 2024

@ReubenBond Here are some benchmarks for moded filter (passing processNoiseCovariance, vs it being a field), pretty much the same in speed (1000 measurements), but there is a drop of 50% in allocs. Albeit its like ~60 bytes and its alloc'd only once upon instantiation, so we save ~6Kb for 100 silos 😂.

Method Mean Error StdDev Median Ratio RatioSD Allocated Alloc Ratio
OriginalFilter 18.02 μs 0.784 μs 2.311 μs 17.14 μs 1.00 0.00 120 B 1.00
ModedFilter 14.87 μs 0.310 μs 0.910 μs 14.69 μs 0.84 0.10 56 B 0.47

@ReubenBond
Copy link
Member

If we end up only using DualModeKalmanFilter<float>, then perhaps we should specialize it. Looking at the JIT output, it's quite a bit shorter, with fewer calls:

@ledjon-behluli
Copy link
Contributor Author

If we end up only using DualModeKalmanFilter<float>, then perhaps we should specialize it. Looking at the JIT output, it's quite a bit shorter, with fewer calls:

Wouldn't dynamic PGO take care of de-virtualizing them once it "warms up"

@ReubenBond
Copy link
Member

ReubenBond commented Jan 12, 2024

Wouldn't dynamic PGO take care of de-virtualizing them once it "warms up"

I don't know, but we can check. Do you have that benchmark code somewhere?

@ledjon-behluli
Copy link
Contributor Author

ledjon-behluli commented Jan 12, 2024

Wouldn't dynamic PGO take care of de-virtualizing them once it "warms up"

I don't know, but we can check. Do you have that benchmark code somewhere?

Yeah here:

[SimpleJob, MemoryDiagnoser]
public class KalmanFilterBenchmarks
{
    double[] _measurements = new double[1000];

    [GlobalSetup]
    public void Setup() =>  _measurements = 
        Enumerable.Range(0, 1000).Select(_ => Random.Shared.NextDouble() * 99.8 + 0.1).ToArray();

    [Benchmark(Baseline = true)]
    public double OriginalFilter()
    {
        DualModeKalmanFilter<double> _originalFilter = new();
        double result = 0;
        foreach (double measurement in _measurements)
        {
            result = _originalFilter.Filter(measurement);
        }
        return result; 
    }

    [Benchmark]
    public double ModedFilter()
    {
        DualModeKalmanFilter_Moded<double> _modedFilter = new();
        double result = 0;
        foreach (double measurement in _measurements)
        {
            result = _modedFilter.Filter(measurement);
        }
        return result;
    }
}

@ReubenBond
Copy link
Member

| Method     | Mean     | Error     | StdDev    | Ratio | Code Size | Allocated | Alloc Ratio |
|----------- |---------:|----------:|----------:|------:|----------:|----------:|------------:|
| NonGeneric | 7.512 us | 0.0093 us | 0.0082 us |  1.00 |     460 B |      40 B |        1.00 |
| Generic    | 9.770 us | 0.0224 us | 0.0209 us |  1.30 |     376 B |      40 B |        1.00 |

Generic is 30% slower than non-generic on my machine. The alloc is the filter instance itself. Unsure about the code size, but my guess would be that non-generic inlines more.

@ledjon-behluli
Copy link
Contributor Author

@ReubenBond non-generic is a bit faster

Method Mean Error StdDev Median Ratio RatioSD Allocated Alloc Ratio
OriginalFilter 19.48 μs 1.130 μs 3.331 μs 19.24 μs 1.00 0.00 104 B 1.00
ModedFilter 15.53 μs 0.418 μs 1.231 μs 15.47 μs 0.82 0.16 40 B 0.38
ModedFilter_NonGeneric 11.39 μs 0.327 μs 0.889 μs 11.11 μs 0.60 0.10 40 B 0.38

@ReubenBond
Copy link
Member

I think we should go with non-generic for now

@ledjon-behluli
Copy link
Contributor Author

Posted at the same time 😂

@ledjon-behluli
Copy link
Contributor Author

I think we should go with non-generic for now

Agree, even for a RAM of 256GB, thats 2.56e+11 which is still well within the range of float's 3.4e+38

…ke it easier for the users to understand + add comments to explain that weights are relative to each other + modified the director to take into account potential totalWeight = 0 + removed config exception throwing if sum = 0; as the score will be 0 but due to the jitter it will act as it were RandomPlacement
@ledjon-behluli
Copy link
Contributor Author

@ReubenBond I've made some small little fixes, added some comments, and switched options to take int instead of float to make it more natural for the end users.

The weights don't strictly need to have a hard upper limit (currently 100) due to normalization, but I believe its better to place a boundary for the sake of sanity. This is debatable ofc!

Other than the above "issue", I don't see any further things we need to do, please let me know if you have something else in mind, otherwise this LGTM and is ready for merging.

@ledjon-behluli
Copy link
Contributor Author

Update:

  1. I've changed ResourceStatistics to contain only non-nullable elements for 2 reasons:
  • It simplified CalculateScore by not having to perform null checking. Which is no problem as the logic remains the same and is correct. CDM-KF either way was treating nulls as 0s.
  • By removing the nullable types from this struct we make its size go down from 56 bytes -> 32 bytes.

It would be expected that 'float?' has a size of 5 = 4 (float) + 1 (hasValue), but the alignment of the type is the size of its largest field.
For 'float?' -> 4 (float) + 1 (hasValue) + 3 [padding to reach largest field i.e. 'float' (4 bytes)] = 8 (total)
For 'long?' -> 8 (long) + 1 (hasValue) + 7 [padding to reach largest field i.e. 'long' (8 bytes)] = 16 (total)
Total (nullable): 8 (float?) + 8 (float?) + 16 (long?) + 16 (long?) + 1 (bool) + 7 (padding) = 56 bytes
Total (non-nullable): 4 (float) + 4 (float) + 8 (long) + 8 (long) + 1 (bool) + 7 (padding) = 32 bytes

  1. Applied packing to the struct to shave off an extra 7 bytes, making it at the end 25 bytes

This will help increase the number of ValueTuple<int, ResourceStatistics> (inside MakePick) that can be stack allocated from 64 -> 128 (at 4KB stack limit), therefor covering clusters with up to 128 silos, before switching to ArrayPool.

@ledjon-behluli ledjon-behluli changed the title Static resource optimized placement strategy Resource optimized placement strategy Jan 15, 2024
@ReubenBond ReubenBond merged commit 2e7714c into dotnet:main Jan 16, 2024
19 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Feb 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants