The more highly available system means that the less downtime you have to deal with...
Imagine,
The five nines, 99.999 availability means ~5 mins of downtime a year
The four nines, 99.99 availability means ~ 50 mins of downtime a year
The three nines, 99.9 availability means ~9 hours of downtime a year
The two nines, 99 availability means roughly 3 and half days of downtime a year
So, why are we concerned about the downtime? the more downtime you have means the less you attract your customers as I presume most of the systems or applications are accessible 24/7 and it means that round the clock!
We might be thinking that - oh, yeah - we have good ample of downtime in our hands to deal with it, but believe me, if you don't design the system very well with right practices and processes in place then it would be very hard to even achieving the two nines availability too, so, you have to really work very hard and smart to deal with it in an effective manner.
Some tips to utilize the downtime effectively for your systems or platforms are:
Imagine,
The five nines, 99.999 availability means ~5 mins of downtime a year
The four nines, 99.99 availability means ~ 50 mins of downtime a year
The three nines, 99.9 availability means ~9 hours of downtime a year
The two nines, 99 availability means roughly 3 and half days of downtime a year
So, why are we concerned about the downtime? the more downtime you have means the less you attract your customers as I presume most of the systems or applications are accessible 24/7 and it means that round the clock!
We might be thinking that - oh, yeah - we have good ample of downtime in our hands to deal with it, but believe me, if you don't design the system very well with right practices and processes in place then it would be very hard to even achieving the two nines availability too, so, you have to really work very hard and smart to deal with it in an effective manner.
Some tips to utilize the downtime effectively for your systems or platforms are:
- Make sure to eliminate any single point of failures; Especially handing with the third party API endpoints or dependent systems those are not in your control by having right retry frameworks and circuit breaker patterns
- Observability and Monitoring with clear alerts and notifications in place
- Have an efficient CI/CD pipelines with right checks on various stages of your configured environment to stop promoting the bad code (faulty code which has issues, and it can be performance issues as well)
- Have well-defined deployments and the changed code propagation with
- Blue-green deployments (having two identical production environments, one is blue and other one is green)
- Canary deployments (by routing smaller traffic to the new changes and slowly increase the traffic)
- Encourage A/B testing within your product features release where you are not sure on some of the features; also, promote canary releases too
- Lastly, try to have a very diligent process by imagining all dimensions how your system can fail with scenarios(and that has come from your proactive resiliency testing activities) so that you have right tools to troubleshoot, tune and respond for any unknown hardware or network failures
No comments:
Post a Comment