Many times developers (and sometimes even project managers) consider the website to be a “product”, meaning once it’s live in production they can stop thinking about it and move on to the next project. But in fact, websites are more like services and keeping them working as expected in the long-run is a full-time job and where the real challenge begins.
Keeping the site running usually falls into IT department, although in recent years it became more of a combined effort (this is the topic of the next chapter, Delivery).
Operations can be divided into 3 parts: Reliability, Availability, and Delivery:
- Reliability – the system will work correctly even in a case of data corruption or component failure
- Availability – The system can be used by many users at the same time
- Delivery (serviceability/maintainability) – the system can be easily be repaired and improved
In this chapter, we will learn about Reliability.
Reliability is the ability of the system to work correctly even in a case of data corruption or component failure. This has become increasingly more important during the last decade – for example when Google had a 5-minute outage in 2013, they lost 1/2 million dollars and the entire world net traffic went down by 40%.
Based on 2015 disaster recovery statistics, downtime that lasts for one hour can cost small companies as much as $8,000, mid-size organizations $74,000, and $700,000 for large enterprises.
High-reliability sites usually have to be designed from the ground up by selecting the right architecture and employing a strict quality assurance regime. Once online, they have to be monitored constantly in order to assure their reliability.
How do you know if you have a reliable service?
The first option is to wait until customers start to complain. But a much better alternative is to try to be proactive by testing your site before it goes into production.
Quality Assurance is the process of trying to make sure the site function as expected by performing tests. QA aims to find out if the delivered system matches the required business functionality, usually before it goes live.
When so many things can go wrong – a single bug in millions of lines of code, an unexpected or corrupted data, or even servers/services failures, having strong QA is a major part of any site and QA engineers are now in demand as much as developers themselves.
QA is usually done by performing tests, which can be manual or automated and be done using a wide variety of techniques like Unit testing, Integration Testing, System Testing, Acceptance Testing and even Continuous Testing.
QA is a critical part of any successful reliable site.
High availability design
High Availability design is a way to keep the site Up-time as high as possible.
Site Up time is the percentage of time during which the site is accessible during a year. A decade ago, 99% was considered an industry standard, but today 99.95% is expected. This leaves only 22 minutes per month for maintenance jobs, site upgrades or any technical issue (like network failure, hardware failure, configuration problems). And as our back-end infrastructure keeps getting a lot more complicated (each site require multiple servers and network components to work), keeping everything working smoothly is a very big challenge.
- Elimination of single points of failure – this means adding redundancy to the system so that failure of a single component does not mean failure of the entire system.
- Reliable crossover – in redundant systems, the crossover point itself tends to become a single point of failure. Reliable systems must provide for reliable crossover.
Another important aspect of High Availability is the detection of failures as they occur. This allows (if the two principles above are observed) to quickly recover so that a user may never see the failure.
Since new bugs and problems occur usually on a daily basis for most sites, we need to actively Monitor every issue that rises, by using tools like Network monitoring, Website Monitoring, and log analytics like Splunk or the newer Elastic. And to be truly effective, monitoring also need to be designed into the system from the beginning (and not put there after it’s finished).
Disaster Recovery(DR) is the digital equivalent of emergency planning. It starts with the assumption that the entire system completely failed, due to a natural or human-induced disaster (or just a critical human mistake). DR plan describes a set of policies, tools, and procedures to make sure the system goes back online ASAP.
Creating a DR solution is usually take a lot of time and effort. Since every system is different, DR plans need to be tailored to each system, which requires a deep understanding of all it’s moving parts. Then, a back-up of all of the components of the system needs to be kept updated at all times. Sometimes those backup systems are required to replace the main system in a time of crisis, which makes keeping them in sync even more critical.
There are many companies who offer disaster recovery solutions, like Microsoft Azure site recovery, Zerto, Zetta, Vmware Business Continuity, Cloudberry, and Acronis. But no matter what the chosen solution is, Disaster recovery always requires a lot of time, effort, and money.
Fault tolerance design
Fault Tolerance design means that if a part of the system fails the entire system doesn’t completely shut down – instead it continues to function in a limited capacity. This is different from High Availability: HA describes how to avoid component failure (by the elimination of single points of failure). Fault tolerance starts with the assumption that our HA design failed and one component (with all of its redundancy) has stopped working, and ask the question – can other parts of the system continue to function properly? can the system can still be used, even with limited functionality?
For example, if we have a site with both members area and public area, and we upload membership code with a fatal flaw which prevents users to log in – can they still use the public area? Can our site be used in “read-only” mode? Another example of Fault Tolerance can be found on Netflix, where notification disruption does not affect the video streaming functionality.
In order to achieve FT, HA design with Microservices architecture is usually required.
Fault Tolerance completes the Reliability picture:
- Quality Assurance – making sure the system function 100% before shipping to production,
- High Availability design & Monitoring – system automatically trying to keep things running 100%
- Fault tolerance design – system can still work with less than 100%
- Disaster recovery – what happens when system completely fail (require manual intervention)
In the next chapter, we will continue to learn about the next aspect of operations: Availability.
Next part: Part 11 – Availability