Skip to content

Latest commit

 

History

History
73 lines (64 loc) · 4.79 KB

2.5. Service-level Agreements (SLA).md

File metadata and controls

73 lines (64 loc) · 4.79 KB

Service-level Agreements (SLA)

  • Formal documents to define the performance standards that apply to Azure.
  • Specify also what happens if a service or product fails to perform to a governing SLA's specification.
  • There are SLAs for individual Azure products and services.
  • ❗ Azure does not provide SLAs for most services under the Free or Shared tiers
    • e.g. Azure Advisor
  • Three key characteristics of SLAs for Azure products and services:
    1. Performance Targets
      • Specific to each Azure product and service.
      • E.g. uptime guarantees or connectivity rates
    2. Uptime and Connectivity Guarantees
      • 📝 Monthly Uptime % = (Maximum Available Minutes-Downtime) / Maximum Available Minutes X 100
      • 📝 Range from 99.9% ("three nines") to 99.999% ("five nines") for any paid tier service.
        • In other words minimum SLA for all non-free Azure services are 99.9%
      • E.g. Azure Cosmos DB (Database) service SLA offers 99.999 percent uptime
        • meaning it allows for about 5 minutes of total downtime per year.
        • also includes low-latency commitments of less than 10 milliseconds on DB read + write operations.
    3. 📝 Service credits
      • Given to paying Azure customers if uptime percentage is lower than given in SLA.
      • Describe how Microsoft will respond if an Azure product or service fails to perform to its governing SLAs specification.
      • E.g. customers may have a discount applied to their Azure bill, as compensation for an under-performing Azure product or service.
  • Read more: SLA Summary for Azure Services

Composite SLA

  • Result of combining SLAs across different service offerings.
  • 📝 Calculating downtime
    • E.g. web app (99.95% SLA from Azure) writes to SQL database (99.99% SLA from Azure)
      • Composite SLA = 99.95 percent × 99.99 percent = 99.94 percent
        • = 0.9995 * 0.9999 = 0.9994
      • Means combined probability of failure is higher than the individual SLA values
  • You can improve the composite SLA by creating independent fallback paths.
    • E.g. if the SQL Database is unavailable, you can put transactions into a queue for processing at a later time.
      • Web app (99.95%) writes to either SQL Database (99.99%) or queue (99.9%)
      • Application is still available even if it can't connect to the database.
        • ❗But it fails if both the database and the queue fail simultaneously.
      • If the expected percentage of time for a simultaneous failure is 0.0001 × 0.001
        • the composite SLA for this combined path of a database or queue would be:
          • 1.0 − (0.0001 × 0.001) = 99.99999 percent
      • If we add the queue to our web app, the total composite SLA is:
      • 99.95 percent × 99.99999 percent = ~99.95 percent
      • Improves SLA but application logic gets more complicated
        • You are paying more to add the queue support and there may be data-consistency issues you'll have to deal with due to retry behavior.

Application SLA

  • By creating your own SLAs, you can set performance targets to suit your specific Azure application.
  • 💡 >= four 9's (99.99%) SLA performance targets =>
    • manual intervention from failures may not be enough (difficult to be quick enough)
    • should have self-diagnosing & self-healing solutions.

Resiliency

  • Resiliency is the ability of a system to recover from failures and continue to function.
  • High availability and disaster recovery are two crucial components of resiliency
    • 📝 Disaster recovery: When Godzilla destroys your data center, you do have alternative locations to keep providing your service and protocols/means for the other location to know how to keep delivering the service.
  • Failure Mode Analysis (FMA)
    • Goal:
      • Identify possible points of failure.
      • Define how the application will respond to those failures.
  • Read more: Designing resilient applications for Azure

High availability

  • 📝 Availability is often given as percentage uptime
  • Refers to the time that a system is functional and working.
  • Most providers prefer to maximize the availability of their Azure solutions by minimizing downtime.
    • ❗ As you increase availability, you also increase the cost and complexity of your solution.
    • As your solution grows in complexity, you will have more services depending on each other.
      • You might overlook possible failure points in your solution if you have any interdependent services.
      • 💡E.g. a workload that requires 99.99 percent uptime shouldn't depend upon a service with a 99.9 percent SLA.
  • Read more: Availability choices for Azure compute