High Availability >99.99%

High Availability Services

In the past 6 years, good or bad, I have been in an engineering leader role owning the most critical uptime or service availability in a few hyper growth companies. The tech stack varies from private data centers, AWS, and GCP. It is not an easy job. But I have led teams to achieve great success, often higher > 99.99%, some quarter result >99.999%, and learned a few important lessons that I am glad to share with you in this article.

(I learned a lot from my network and LinkedIn. Time to contribute back a bit.)

If the architecture is not right, address that first

There are a lot of optimization tricks to play, such as rate limiting, concurrency, local queuing, DB connection pool size limit, optimizing logging, tuning JVM, Hibernate, disk, mysql, hardware … However, it is important to recognize, this is not a scalable architecture to achieve high availability, at least not as a state of art of our industry, no matter how much effort to put in.

Cloud does not necessarily make things easier

Here are a few real world examples that I experienced.

  • Aurora DB replication failed
  • REDIS cluster was overwhelmed by too many connections
  • No enough capacity in a region for a new EC2 instance type so new instances failed to launch per auto scaling policy.
  • CPU quota hit for limit for a region

In summary, it is important to understand Cloud does not have a 100% guarantee. We should consider it as part of architecture, design and prepare for potential failures.

See https://aws.amazon.com/compute/sla/. =99% availability means 432 minutes, or >7 hours of downtime in a month, or 15 min downtime per day. AWS will refund 10% of cost to you. But will customers who rely on your product and service 24/7 be okay with it? Probably not.

MicroServices expose more failure points

In reality, the system is complex in a different way than the monolithic alternative.

  • Each service is 99.99% available. The combination of the whole system is likely < 99.99%.
  • Ownership of these services are often distributed to different teams. Who will own the overall uptime?
  • APM tool gives insights how these systems are connected but no insights where it may fail and how the whole system will behave due to those failures.

Human Errors

  • Ran test script while mistakenly connected with production environment, deleted the account database.
  • Rolled out incompatible changes that forced a restart of dependent services, causing downtime.
  • Migration to a new AWS service hit failures and was forced rollback.
  • Facing service degradation, took wrong on-call action and made things much worse.

Path to high availability > 99.99%

It is achievable

My experience proves it is achievable.

Start with architecture review

Analyze service dependencies, type of failures, the recovery path, and recovery speed, and degradation’ impact to customers. Create a list of prioritized work items.

A few examples:

  • Kafka persistence layer failed. Is there a standby cluster? If yes, how quick can the fail over finish? Meanwhile, is there data loss?
  • Will high traffic of one customer take down the data ingestion for all other customers?
  • If the account DB is down, will data ingestion stop? Will UI continue to work?
  • If one customer sends a super big query, scanning 4 years’ data, will all customers’ queries hit timeout?
  • If REDIS cache layer fails, will all services fail catastrophically?

Do it pragmatically and consider ROI

Here are the details:

When data comes to ingestion API, apiKey is validated before the data is written into Kafka. ApiKeys are saved in PostgreSQL and cached locally in each ingestion server. Once the cache expires periodically, the server reaches the account database to retrieve valid apiKey for the given account.

When account DB is down, apiKey in ingestion servers will expire shortly but cannot refresh. The ingestion service begins to reject incoming data for the customer with those apiKeys.

The fix consists of two parts. First, if the account DB is offline, keep the apiKey in cache even after its expiration. Second, write a cron job to back up apiKeys to S3 files periodically. Why is this needed? The data ingestion service is auto scaling. While the first part helps the existing servers, it won’t work for new added servers as those have an empty cache to populate with valid apiKeys from somewhere. The S3 file works for that and new servers can fall back and read the latest apiKeys from S3 when DB is offline.

Once we implemented the above at Company A, uptime of data ingestion jumped to 99.999% in the next quarter. Later, when the team migrated from self hosted PostgreSQL to AWS Aurora DB, Aurora DB failed twice due to alter table replication failure. The data ingestion breezed through the two events without any issue.

Test your system proactively

  • Stress the system with 5 to 10 times of regular peak traffic.
  • Intentionally take down some service and data store, and verify the system’s behavior against assumptions.

It is quite often to find that some service is not configured for auto scaling or new bottleneck, never identified before surfaces.

Put these into the priority list and iterate again.

Standardize the process of rollout of major changes

Failure mitigation and fast recovery

Prepare the team

Fire drill the team.

Do not be overconfident

When we onboard a new service where the code is not written by us, the natural assumption is that it should just work. In reality, it often does not.

One example: we recently used an open source analytics engine. It is designed for scalability and reliability. However, under load, its dependency ZooKeeper crashed a few times due to filling up the disk, halting the whole system. Fortunately it was detected in test and staging, never impacted production.

Conclusion

Hope you enjoy reading this article and love to hear your experiences and thoughts.

VP Engineering