This article provides best practices for maintaining the reliability and security of your Azure cloud estate. Reliability ensures your cloud services remain operational with minimal downtime. Security safeguards the confidentiality, integrity, and availability of your resources. Both reliability and security are critical for successful cloud operations.
Manage reliability
Reliability management involves using redundancy, replication, and defined recovery strategies to minimize downtime and protect your business. Table 1 provides an example of three workload priorities, reliability requirements (uptime SLO, max downtime, redundancy, load balancing, replication), and example scenarios that align with service-level objectives (SLOs)
Table 1. Example of workload priority and reliability requirements.
Priority
Business impact
Minimum uptime SLO
Max downtime per month
Architecture redundancy
Load balancing
Data replication and backups
Example scenario
High (mission-critical)
Immediate and severe effects on company reputation or revenue.
99.99%
4.32 minutes
Multi-region & Multiple availability zones in each region
Active-active
Synchronous, cross-region data replication & backups for recovery
Reliability responsibilities vary by deployment model. Use the following table to identify your management responsibilities for infrastructure (IaaS), platform (PaaS), software (SaaS), and on-premises deployments.
Clearly defined reliability requirements are critical for uptime targets, recovery, and data loss tolerance. Follow these steps to define reliability requirements:
Prioritize workloads. Assign high, medium (default), or low priorities to workloads based on business criticality and financial investment levels. Regularly review priorities to maintain alignment with business goals.
Assign uptime service level objective (SLO) to all workloads. Establish uptime targets according to workload priority. Higher-priority workloads require stricter uptime goals. Your SLO influences your architecture, data management strategies, recovery processes, and costs.
Identify service level indicators (SLIs). Use SLIs to measure uptime performance against your SLO. Examples include service health monitoring and error rates.
Assign a recovery time objective (RTO) to all workloads. The RTO defines the maximum acceptable downtime for your workload. RTO should be shorter than your annual downtime allowance. For example, an uptime SLO 99.99% requires less than 52 minutes of annual downtime (4.32 minutes per month). Follow these steps:
Estimate the number of failures. Estimate how often you think each workload might fail per year. For workloads with operational history, use your SLIs. For new workloads, perform a failure mode analysis to get an accurate estimate.
Estimate the RTO. Divide your annual allowable downtime by the estimated number of failures. If you estimate four failures per year, then your RTO must be 13 minutes or less (52 minutes / 4 failures = 13-minute RTO).
Test your recovery time. Track the average time it takes to recover during failover tests and live failures. The time it takes you to recover from failure must be less than your RTO. If your business continuity solution takes hours to
Define recovery point objectives (RPO) for all workloads. Determine how much data loss your business can tolerate. This objective influences how frequently you replicate and back up your data.
Data reliability involves data replication (replicas) and backups (point in time copies) to maintain availability and consistency. See Table 2 for examples of workload priority aligned with data reliability targets.
Table 2. Workload priority with example data reliability configurations.
Workload priority
Uptime SLO
Data replication
Data backups
Example scenario
High
99.99%
Synchronous data replication across regions
Synchronous data replication across availability zones
High frequency, cross-region backups. Frequency should support RTO and RPO.
Manage data backups. Backups are for disaster recovery (service failure), data recovery (deletion or corruption), and incident response (security). Backups must support your RTO and RPO requirements for each workload. Choose backup solutions that align with your RTO and RPO goals. Prefer Azure’s built-in solutions, such as Azure Cosmos DB and Azure SQL Database native backups. For other cases, including on-premises data, use Azure Backup. For more information, see Backup.
Design workload data reliability. For workload data reliability design, see the Well-Architected Framework Data partitioning guide and Azure service guides (start with the Reliability section).
Managing the reliability of your cloud resources often requires architecture redundancy (duplicate service instances) and an effective load-balancing strategy. See Table 3 for examples of architecture redundancy aligned with workload priority.
Table 3. Workload priority and architecture redundancy examples.
Your approach must implement architecture redundancy to meet the reliability requirements of your workloads. Follow these steps:
Estimate the uptime of your architectures. For each workload, calculate the composite SLA. Only include services that could cause the workload to fail (critical path). Follow these steps:
Gather the Microsoft uptime SLAs for every service on the critical path of your workload.
If you have no independent critical paths, calculate single-region composite SLA by multiplying the uptime percentages of each relevant service. If you have independent critical paths, move to step 3 before calculating.
When two Azure services provide independent critical paths, apply the independent critical paths formula to those services.
For multi-region applications, input the single-region composite SLA (N) into the multi-region uptime formula.
Compare your calculated uptime with your uptime SLO. Adjust service tiers or architecture redundancy if necessary.
Use case
Formula
Variables
Example
Explanation
Single-region uptime estimate
N = S1 × S2 × S3 × … × Un
N: Composite SLA of Azure services on a single-region critical path. S: SLA uptime percentage of each Azure service. n: Total number of Azure services on critical path.
N = 99.99% (app) × 99.95% (database) × 99.9% (cache)
Simple workload with app (99.99%), database (99.95%), and cache (99.9%) in a single critical path.
Two independent critical paths. Either database (99.95%) or cache (99.9%) can fail without downtime.
Multi-region uptime estimate
M = 1 - (1 - N)^R
M: Multi-region uptime estimate. N: Single-region composite SLA. R: Number of regions used.
If N = 99.95% and R = 2, then M = 1 - (1 - 99.95%)^2
Workload deployed in two regions.
Adjust service tiers. Before modifying architectures, evaluate whether different Azure service tiers (SKUs) can meet your reliability requirements. Some Azure service tiers can have different uptime SLAs, such as Azure Managed Disks.
Add architecture redundancy. If your current uptime estimate falls short of your SLO, increase redundancy:
Use multiple availability zones. Configure your workloads to use multiple availability zones. How availability zones improve your uptime can be difficult to estimate. Only a select number of services have uptime SLAs that account for availability zones. Where SLAs account for availability zones, use them in your uptime estimates. See the following table for some examples.
Azure service type
Azure services with Availability Zone SLAs
Compute Platform
App Service Azure Kubernetes Service Virtual Machines
Datastore
Azure Service Bus Azure Storage Accounts Azure Cache for Redis Azure Files Premium Tier
Database
Azure Cosmos DB Azure SQL Database Azure Database for MySQL Azure Database for PostgreSQL Azure Managed Instance for Apache Cassandra
Load Balancer
Application Gateway
Security
Azure Firewall
Use multiple regions. Multiple regions are often necessary to meet uptime SLOs. Use global load balancers (Azure Front Door or Traffic Manager) for traffic distribution. Multi-region architectures require careful data consistency management.
Manage architecture redundancy. Decide how to use redundancy: You can use architecture redundancy as part of daily operations (active). Or you can use architecture redundancy in disaster recovery scenarios (passive). For examples, see Table 3.
Load balance across availability zones. Use all availability actively. Many Azure PaaS services manage load balancing across availability zones automatically. IaaS workloads must use an internal load balancer to load balance across availability zones.
Load balance across regions. Determine whether multi-region workloads should run active-active or active-passive based on reliability needs.
Manage service configurations. Consistently apply configurations across redundant instances of Azure resources, so the resources behave in the same way. Use infrastructure as code to maintain consistency. For more information, see Duplicate resource configuration.
Design workload reliability. For workload reliability design, see the Well-Architected Framework:
Recovering from a failure requires a clear strategy to restore services quickly and minimize disruption to maintain user satisfaction. Follow these steps:
Prepare for failures. Create separate recovery procedures for workloads based on high, medium, and low priorities. Data reliability, code and runtime reliability, and cloud resource reliability are the foundation of preparing for failure. Select other recovery tools to assist with business continuity preparation. For example, use Azure Site Recovery for on-premises and virtual-machine based server workloads.
Test and document recovery plan. Regularly test your failover and failback processes to confirm your workloads meet recovery time objectives (RTO) and recovery point objectives (RPO). Clearly document each step of the recovery plan for easy reference during incidents. Verify that recovery tools, such as Azure Site Recovery, consistently meet your specified RTO.
Detect failures. Adopt a proactive approach to identifying outages quickly, even if this method increases false positives. Prioritize customer experience by minimizing downtime and maintaining user trust.
Monitor failures. Monitor workloads to detect outages within one minute. Use Azure Service Health and Azure Resources Health and use Azure Monitor alerts to notify relevant teams. Integrate these alerts with Azure DevOps or IT Service Management (ITSM) tools.
Collect service level indicators (SLIs). Track performance by defining and gathering metrics that serve as SLIs. Ensure your teams use these metrics to measure workload performance against your service level objectives (SLOs).
Respond to failures. Align your recovery response to the workload priority. Implement failover procedures to reroute requests to redundant infrastructure and data replicas immediately. Once systems stabilize, resolve the root cause, synchronize data, and execute failback procedures. For more information, see Failover and failback.
Analyze failures. Identify the root causes of the issues and then address the problem. Document any lessons and make the necessary changes.
Manage workload failures. For workload disaster recovery, see the Well-Architected Framework's disaster recovery guide and Azure service guides (start with the Reliability section).
Use an iterative security process to identify and mitigate threats in your cloud environment. Follow these steps:
Manage security operations
Manage your security controls to detect threats to your cloud estate. Follow these steps:
Standardize security tooling. Use standardized tools to detect threats, fix vulnerabilities, investigate issues, secure data, harden resources, and enforce compliance at scale. Refer to Azure security tools.
Baseline your environment. Document the normal state of your cloud estate. Monitor security and document network traffic patterns and user behaviors. Use Azure security baselines and Azure service guides to develop baseline configurations for services. This baseline makes it easier to detect anomalies and potential security weaknesses.
Apply security controls. Implement security measures, such as access controls, encryption, and multifactor authentication, strengthens the environment and reduces the probability of compromise. For more information, see Manage security.
Assign security responsibilities. Designate responsibility for security monitoring across your cloud environment. Regular monitoring and comparisons to the baseline enable quick identification of incidents, such as unauthorized access or unusual data transfers. Regular updates and audits keep your security baseline effective against evolving threats.
Adopt a process and tools to recover from security incidents, such as ransomware, denial of service, or threat actor intrusion. Follow these steps:
Prepare for incidents. Develop an incident response plan that clearly defines roles for investigation, mitigation, and communication. Regularly test the effectiveness of your plan. Evaluate and implement vulnerability management tools, threat detection systems, and infrastructure monitoring solutions. Reduce your attack surface through infrastructure hardening and create workload-specific recovery strategies. See Incident response overview and Incident response playbooks.
Respond to incidents. Immediately activate your incident response plan upon detecting an incident. Quickly start investigation and mitigation procedures. Activate your disaster recovery plan to restore affected systems, and clearly communicate incident details to your team.
Analyze security incidents. After each incident, review threat intelligence and update your incident response plan based on lessons learned and insights from public resources, such as the MITRE ATT&CK knowledge base. Evaluate the effectiveness of your vulnerability management and detection tools and refine strategies based on post-incident analysis.
Demonstrate the skills needed to implement security controls, maintain an organization’s security posture, and identify and remediate security vulnerabilities.