availability

csc 510-001, (1877)
fall 2024, software engineering
Tim Menzies, timm@ieee.org, com sci, nc state

home :: syllabus :: corpus :: groups :: moodle :: license

Availability

When you’re first starting out with deploying applications and making them accessible, it’s a good idea to start small. The first step many go for is to use their own personal computer as a server to host the application. If one has a sufficiently powerful computer, they can host a small application and potentially even expose an api for limited use without much difficulty. However if your computer hosting the application is broken, stolen, etc, your application will no longer be accessible as it once was. At a hobby level or even for small research tasks, such risks are tolerated. As customers or users become dependent on your application, service outages are tolerated less and less.

Availability is a measure of how often a call sent to your service will be successfully responded to, calculated by dividing the number of successful API calls from a customer by the total number of API calls they made.

Suppose you form a company around your project from this course, and you’ve attracted a customer. They are willing to pay a weekly fee for a max rate of 100 API calls per minute, but have little tolerance for service outages. If you are confident you can keep your availability above their expectations, you would sign a Service Level Agreement with them. The agreement guarantees that the customer will pay you for a contracted amount of time (ex. 1 year) at a cost of $X per time period (i.e weekly). However, if the percentage of failed API calls is below your availability guarantee, the customer is entitled to a refund. This can either be a monetary refund, or providing the customer with a number of API calls for free equal to the difference between the availability guarantee and the true availability of the service.

Question: For your CSC 510 project:

Are you able to calculate your service availability?
- What would you have to change about your project in order to start calculating your availability?
  - Do you log errors?
  - Do you define error types?
  - What are the API calls / customer interactions that need to be tracked?
  - Do you have a database set up to consume the availability data?
  - How would you extract the results from your availability database in order to do analysis?

Based on the industry your software is targeting, customers will have wildly different thresholds they are willing to agree with. Cloud providers such as Amazon or Microsoft guarantee a minimum of 99.9% success rate on API calls to request virtual machines and other such cloud operations, otherwise known as “three nines” availability. The chart below shows how much downtime is allowed based on your availability.

Improving Availability

Using your personal computer as a server to host a web-facing application results in low service availability. How do we improve the availability of our service?

Improving Hardware: The simplest and often most effective way to quickly improve availability is to deploy your service on dedicated server hardware. Modern server hardware is designed to provide high bandwidth and is much less prone to random faults than a personal computer. Furthermore, deploying your service to a server maintained by a cloud provider like AWS, Azure, or GCloud, ensures that the server your service is deployed to will be running 99.9% of the time, due to their SLA guarantees.

Improving Software: A majority of failed API calls arise from faults within the deployed software, rather than from hardware. Bugs and other unexpected behavior often results in an error like HTTP 500 Internal Server error being sent back to the requester as a response. When calculating the availability of a service only errors that are considered internal server errors are counted as failed API calls, not customer errors where the correct behavior is to return an error like HTTP 400 Bad Request. By outright fixing internal server errors, or more often than not correctly categorizing server errors as customer errors, the availability of the service will increase dramatically.

Software and hardware improvements have their limits. Once you’ve fixed all the bugs in your software, and are deployed onto highly reliable, technician maintained servers, you’ll achieve service availability around 99%.

Question: What required downtime does your CSC510 project have in order to keep it functioning? Can we make the required downtime NOT required via clever engineering?

The main blocker for improving your reliability will be required downtime, like software and firmware updates. Imagine you find a large bug in your software, and it needs immediate fixing. If your service is deployed on a singular server, you’ll be forced to make the service unavailable in order to deploy a newer copy of the code. The same is true for firmware updates. In order to achieve higher availability then, we need to get multiple copies of our software running in parallel, and only route traffic to server instances where the software is running healthily; otherwise known as ensuring the redundancy of your service.

Redundancy

How is redundancy achieved then? A useful tool for deploying multiple instances of a service, proxying requests to healthy servers, and updating services with zero downtime is Kubernetes. Kubernetes (otherwise known as K8s) allows developers to provide a manifest describing how you’d like your service deployed, and which docker repo your containerized service lives in. K8s interprets the manifest to then deploy your service to the hardware you specify, and create as many parallel copies on that hardware as specified in the manifest. K8s supports no-downtime service updating by taking down one copy of your service at a time and redirecting traffic to the healthy copies while the downed copy has the new code deployed to it. Once the new copy is back online, traffic begins to flow back to it. This process is repeated until all copies of your service are updated to the latest version. By leveraging K8s to perform no-downtime code updates, your service will be reaching around 99.9% availability.

To further increase availability, you can use K8s to manage deployments across server instances, with the K8s control layer running on every single server to ensure K8s is always able to execute commands as long as a single copy of your service is running. Additionally, By writing your service as many small individual applications, you can leverage K8s to direct traffic between the applications to ensure even more redundancy than if your application was monolithic. With all of these optimizations (and more context specific optimizations), your service will begin to achieve close to 99.99% availability. To achieve 99.999% availability, you must deploy your service on different electrical grids, and be able to route traffic to the healthy servers across large distances. This protects against random electrical outages causing a complete service outage, but if implemented incorrectly can cause the execution time of your operations to increase dramatically. Here is an example: suppose your service is split into a front end, and a back end. The front end handles all quick operations, and longer operations that require waiting are handled by the backend. To communicate from front end to back end, the front end places any operation that would take longer than a certain threshold into a queue, and the backend constantly polls the queue for operations to work on. If your queue is not implemented correctly, you could have back ends in different datacenters attempting to pull jobs from the other data center’s queue, which would incur a waiting time for the data to be sent from one datacenter to the other.

Question: For your CSC 510 project:

Is your application monolithic?
Is your application split into separate microservices?
- If so, how are they connected? What channels do they use to pass traffic from one to the other?

Question: Your service can no longer afford to improve its hardware, but has achieved 90% availability through purely software and hardware improvements with no redundancy. Forced downtime for updates accounts for 25% of your downtime, 25% is caused by hardware malfunctions, 25% is caused by deployment errors, and the last 25% is caused by erroneous responses from your service to requesters (i.e bugs).
Discuss or calculate the following:

Does ensuring redundancy of your service reduce the rate at which bugs cause erroneous responses to be sent to requesters?
Does ensuring redundancy of your service reduce the negative effect of random hardware malfunctions?
If you ensured redundancy by deploying 4 copies of your service w/ no-downtime updating, what would your new availability be?
- After eliminating updating downtime, which of the remaining three errors should you focus your engineering efforts on?