High Availability >99.99%

6 min readFeb 20, 2021

In the past 6 years, good or bad, I have been in an engineering leader role owning the most critical uptime or service availability in a few hyper growth companies. The tech stack varies from private data centers, AWS, and GCP. It is not an easy job. But I have led teams to achieve great success, often higher > 99.99%, some quarter result >99.999%, and learned a few important lessons that I am glad to share with you in this article.

(I learned a lot from my network and LinkedIn. Time to contribute back a bit.)

If the architecture is not right, address that first

If the product is a big monolithic app running on one box, it is important to recognize that there is very little room to achieve high availability. It cannot scale out, and it can only scale up.

There are a lot of optimization tricks to play, such as rate limiting, concurrency, local queuing, DB connection pool size limit, optimizing logging, tuning JVM, Hibernate, disk, mysql, hardware … However, it is important to recognize, this is not a scalable architecture to achieve high availability, at least not as a state of art of our industry, no matter how much effort to put in.

Cloud does not necessarily make things easier

While Cloud like AWS has been much more stable than early days, and we rarely hear a big outage from it now, there are a lot of pitfalls when building a SaaS solution in a Cloud.

Here are a few real world examples that I experienced.

Aurora DB replication failed
REDIS cluster was overwhelmed by too many connections
No enough capacity in a region for a new EC2 instance type so new instances failed to launch per auto scaling policy.
CPU quota hit for limit for a region

In summary, it is important to understand Cloud does not have a 100% guarantee. We should consider it as part of architecture, design and prepare for potential failures.

See https://aws.amazon.com/compute/sla/. =99% availability means 432 minutes, or >7 hours of downtime in a month, or 15 min downtime per day. AWS will refund 10% of cost to you. But will customers who rely on your product and service 24/7 be okay with it? Probably not.

MicroServices expose more failure points

The architecture looks perfect. Logic is decoupled in many MicroServices. The scalable and highly available queue system and data stores like Kafka, S3, Druid are in place.

In reality, the system is complex in a different way than the monolithic alternative.

Each service is 99.99% available. The combination of the whole system is likely < 99.99%.
Ownership of these services are often distributed to different teams. Who will own the overall uptime?
APM tool gives insights how these systems are connected but no insights where it may fail and how the whole system will behave due to those failures.

Human Errors

Due to the system complexity, human errors happen now and then. A few examples:

Ran test script while mistakenly connected with production environment, deleted the account database.
Rolled out incompatible changes that forced a restart of dependent services, causing downtime.
Migration to a new AWS service hit failures and was forced rollback.
Facing service degradation, took wrong on-call action and made things much worse.

Path to high availability > 99.99%

There is no one size fitting one. I offer a few thoughts based on my experience.

It is achievable

It may sound very scary or impractical to set an uptime goal > 99.99%, especially for a complex system composed of dozens if not hundreds of services, under heavy traffic for both write and read, running fully in Cloud, used by thousands or millions of customers.

My experience proves it is achievable.

Start with architecture review

Not all services are equally important. Start with the critical paths of tier 1 services that are most important to customers. Typically, there is a data write path and data read/query path. For write paths, the #1 is to prevent data loss, once accepted. For query paths, latency should stay within a customer satisfactory range.

Analyze service dependencies, type of failures, the recovery path, and recovery speed, and degradation’ impact to customers. Create a list of prioritized work items.

A few examples:

Kafka persistence layer failed. Is there a standby cluster? If yes, how quick can the fail over finish? Meanwhile, is there data loss?
Will high traffic of one customer take down the data ingestion for all other customers?
If the account DB is down, will data ingestion stop? Will UI continue to work?
If one customer sends a super big query, scanning 4 years’ data, will all customers’ queries hit timeout?
If REDIS cache layer fails, will all services fail catastrophically?

Do it pragmatically and consider ROI

At company A, the real change made that boosted quarterly uptime from 98.1% to >99.99 only took one engineer 2 weeks to implement.

Here are the details:

When data comes to ingestion API, apiKey is validated before the data is written into Kafka. ApiKeys are saved in PostgreSQL and cached locally in each ingestion server. Once the cache expires periodically, the server reaches the account database to retrieve valid apiKey for the given account.

When account DB is down, apiKey in ingestion servers will expire shortly but cannot refresh. The ingestion service begins to reject incoming data for the customer with those apiKeys.

The fix consists of two parts. First, if the account DB is offline, keep the apiKey in cache even after its expiration. Second, write a cron job to back up apiKeys to S3 files periodically. Why is this needed? The data ingestion service is auto scaling. While the first part helps the existing servers, it won’t work for new added servers as those have an empty cache to populate with valid apiKeys from somewhere. The S3 file works for that and new servers can fall back and read the latest apiKeys from S3 when DB is offline.

Once we implemented the above at Company A, uptime of data ingestion jumped to 99.999% in the next quarter. Later, when the team migrated from self hosted PostgreSQL to AWS Aurora DB, Aurora DB failed twice due to alter table replication failure. The data ingestion breezed through the two events without any issue.

Test your system proactively

Rather than being confident theoretically, test it in real.

Stress the system with 5 to 10 times of regular peak traffic.
Intentionally take down some service and data store, and verify the system’s behavior against assumptions.

It is quite often to find that some service is not configured for auto scaling or new bottleneck, never identified before surfaces.

Put these into the priority list and iterate again.

Standardize the process of rollout of major changes

We are all human and we all make mistakes. That is why it is critical to put a good process in place whenever a change is going to be made to the main paths of availability. It can be a group review, thorough testing, incremental rollout, rollback plan or something else good.

Failure mitigation and fast recovery

No matter how hard to try, some failure will happen eventually. So it is important to prepare for it. Have the tools that can help mitigating the situation quickly, such as scaling up the service, rolling back a deployment, blocking traffic from a malicious IP for DDOS, etc.

Prepare the team

Document well the failure scenarios, corresponding actions in order, communication protocol, who owns what, approval process and more.

Fire drill the team.

Do not be overconfident

We often use cloud services or open source services. My advice is to allocate time to study them thoroughly before it becomes part of your system: service limitation, how it scales, how it persists data, how query is distributed, any rate limiting, security, monitoring and alerting and best practices.

When we onboard a new service where the code is not written by us, the natural assumption is that it should just work. In reality, it often does not.

One example: we recently used an open source analytics engine. It is designed for scalability and reliability. However, under load, its dependency ZooKeeper crashed a few times due to filling up the disk, halting the whole system. Fortunately it was detected in test and staging, never impacted production.

Conclusion

High availability is critical to provide world-class quality SaaS 24/7 to your customers. Despite many challenges, it is achievable through architecture analysis, engineering work and a few good practices.

Hope you enjoy reading this article and love to hear your experiences and thoughts.