99.9% Uptime & How to Achieve It With Microsoft Azure Hosting

Every minute your website is running it is generating potential revenue, making downtime the top reason behind lost profits. A one-hour outage cost Amazon an estimated $34 million. Meta's outage in 2024 cost nearly $100 million in revenue. And a 20-minute crash during Singles' Day cost Alibaba billions. Fortunately, modern day cloud solutions make preventing downtime, or rather, ensuring uptime, not only possible, but more affordable and approachable than ever.

When choosing a cloud provider, uptime is one of the most critical metrics for ensuring your applications and services are always accessible to users. We here at Liquid prefer Microsoft Azure, a leading cloud service provider, which advertises a 99.9% uptime guarantee for many of its services. Reach out to us for help in achieving maximum uptime and minimizing downtime, and/or read on to explore what this guarantee really means, why it doesn’t promise 100% availability, and how to achieve it (no, it's not automatic!) for your application(s).

What Does 99.9% Uptime Mean?

Azure’s 99.9% uptime guarantee promises that services will be available to users at least 99.9% of the time within a given month. This level of availability is part of Azure's Service Level Agreement (SLA) for several core services, including virtual machines, storage, and databases.

To translate this, 99.9% uptime means the service can be down for about:

43.8 minutes per month or
8.77 hours per year

This calculation can help developers and organizations set realistic expectations about service reliability and better understand the services' downtime tolerance.

Why Isn’t It 100%?

A common misconception is that in this era, "99.9%" is treated like a glass half empty. No service provider can realistically offer 100% uptime for several reasons:

Maintenance and Updates: Cloud providers like Azure must regularly maintain and update their infrastructure. Some of this maintenance, though scheduled for off-peak hours, can lead to short periods of downtime.
Hardware Failures: Despite best efforts, occasional hardware failures are inevitable in any data center. Azure uses redundant systems to minimize the impact, but certain failures can affect availability temporarily.
Software Bugs and Patching: Occasionally, bugs or vulnerabilities arise that require immediate attention. Even with automated testing and thorough QA processes, some issues only appear in production and require patching that may cause brief downtimes.
Network Issues: Connectivity can be interrupted by factors beyond Azure’s control, such as Internet Service Provider (ISP) problems or large-scale network incidents.

These are all natural byproducts of running extensive cloud infrastructure. Azure minimizes these factors using redundancy, fault tolerance, and rapid response teams, but offering 100% uptime is virtually impossible in real-world conditions. Does this mean downtime is inevitable? Absolutely not! This is Azure's (refreshingly) transparent means of communicating to the world realistic expectations, but more importantly, just how good their guarantee actually is.

How to Achieve Maximum Availability in Azure

Azure offers a 99.9% uptime SLA, but it only applies when you follow their recommended architecture and use their related services. Everyone has a budget, and each service adds another layer of protection against downtime as it could occur from different vectors. Here are some best practices, working from the inside out to help paint the picture of the "scale" at which uptime can be guaranteed:

Leverage Azure Site Recovery and Backup Solutions

Always backup your data. How many times have you heard this before? Azure’s Site Recovery and Backup services offer disaster recovery solutions that help keep your application operational even during major incidents. Site Recovery replicates applications and workloads to a secondary location, allowing for quick failover if primary services go down. These services can be set up so seamlessly that the creation and restoration processes are relegated to a few clicks.
Monitor with Azure Monitor and Application Insights

Applications may log critical information about errors and bad transactions, but who is checking them? Azure Monitor and Application Insights allow you to set up automatic alerts, track metrics, and even predict potential issues with proactive monitoring of not only your application's hardware, but its custom data output (i.e. logs and transactions) too! This real-time insight can help you respond quickly to prevent issues from escalating into downtime events.
Use Load Balancers and Auto-Scaling

The traffic your application receives can fluctuate heavily. Have a big day you're promoting? A sale? A product launch? Your application may have been provisioned the resources it needs to handle your average daily traffic, but a big day can see those numbers multiply. Azure offers automatic scaling and load balancing to compensate for these scenarios.

For load balancing, Azure's Load Balancer service is a simple, cost-effective solution for distributing traffic based on IP and port. Alternatively, Azure's Application Gateways service is a more feature-rich, Layer 7 load balancer designed specifically for web applications. It offers advanced traffic routing, SSL offloading, and built-in Web Application Firewall (described below!) capabilities for enhanced security and traffic management.

All in all, you can start to paint the picture of how these tools each enable your application to handle traffic spikes by dynamically adjusting resources to match demand. By doing so, they reduce the risk of an outage caused by resource saturation, helping you stay within your uptime SLA even during high-demand periods.
Protect against malicious attacks with a Web Application Firewall (WAF)

How many times have you heard the dreaded word "DDoS" before? DDoS stands for "Distributed Denial of Service", in which a network of compromised machines, or "bots" floods a website with traffic in a very short period of time. A Web Application Firewall, or WAF, offers protection not only against DDoS attacks through a strategy of pattern analysis and rate limiting to suspected attackers, but also protects against a host of other common threats that are all identifiable by inspecting an incoming web request to your application. Think of a WAF as the doorway to your website, which also serves as the first line of defense. By identifying these attempted attacks at the door, it can block them before your application spends a single nanosecond to indulge it.

The best part is that the Azure WAF seamlessly integrates with the Azure Application Gateway service, as well as Azure Front Door, which acts as a global load balancer and can route traffic to different regions. Speaking of which...
Use Availability Zones and Geographical Redundancy to overcome unavoidable risks

Availability Zones (AZs) are physically separate data centers within a region. They are dedicated pieces of hardware, from the power to the cooling and the networking. By deploying your resources across multiple Availability Zones, you ensure uptime in the event one zone has something as common as hardware failure affecting it.

On a similar note, for critical applications, especially those that offer national or multi-national services, consider deploying resources across multiple Azure regions via geo-redundancy. This not only provides protection against regional outages but also enables disaster recovery capabilities. By setting up Active-Active or Active-Passive configurations, you can ensure that if an entire region experiences downtime, your application remains available in another, preserving availability.

Planning for Downtime with SLAs in Mind

Microsoft Azure’s SLA covers only the infrastructure and core services it manages; ensuring high availability for your applications on this infrastructure requires careful planning. Here's how to approach it:

Understand Service SLAs: Each Azure service has its own SLA, so know the SLA tiers for any services you're using and design your architecture accordingly.
Account for Business Requirements: Assess how much downtime is tolerable for your business and, based on that, make use of Azure’s various high-availability features.
Consider Premium Services: Azure offers higher SLA guarantees for certain premium services or configurations, which may be beneficial for mission-critical applications.

The Conclusion

Remembering those devastating, downright-scary numbers we opened this post with? Surprise, surprise, those were from companies with downtime prevention services, and those losses were relative to their equivalent of 0.01% downtime. Throughout this post you should have come to recognize that uptime is all about mitigating the maximum potential loss due to downtime. These companies may have reported what was lost, but think about what was saved in the 99.9% they achieved thanks to their usage of availability technologies like Microsoft Azure.

The beauty of Microsoft Azure's services is that they make it more accessible than ever, as we demonstrated in this post; working hand-in-hand with every other service mentioned in order to provide a rigid, highly available, and performant platform to host your application. Each is a layer that, depending on your business needs and budget, offers protection in areas specific to businesses of all sizes, and by combining them all together, you guarantee that golden bar known as "99.9% uptime".

If you’re interested in learning more about how Liquid can help you achieve 99.9% uptime for your application, or the many other technology services we offer, hit us up!