We all want to design for success and invariably have a vision towards it too but give very little thought and attention to the ‘Design for Failure’. As an organization that deals with customers at all levels, ezDI’s team thinks it’s wise to design for failure.
You’d be wondering why design for failure?
Well, being a pessimist can save you the catastrophe especially when you’re dealing with sensitive data and designing architectures on the cloud. So be prepared and assume things will fail. In other words, make sure to design, implement and deploy for automated recovery from failure.
Being a pessimist involves that you assume the worst from all angles possible. Assume that your hardware will fail, outages will occur, some disaster will strike your application, you will be overloaded with a spiking number of requests per second or that your application software will fail due to some reason.
Not a pleasant thought I agree, but by being a pessimist, you will end up thinking about recovery strategies and build scalable mechanisms to handle that failure during the inception and design stage itself, which will help you in an overall dynamic system with a fault-tolerant architecture that is optimized for the cloud.
Are you wondering how you want to go about it?
You must build mechanisms to handle failure. Just like designing for hardware failures, you absolutely must also design for software failures.
- For starters, question what will happen if something in your system fails?
- How do you recognize that failure?
- How will you replace or rebuild that particular failure?
- How will that failure affect your system?
Once you ponder over these questions, make a mind map or a table with your strengths and drawbacks. This will help you clarify and understand your failures better and help take action.
For example, here are some strategies that can help in an event of failure:
- Have a coherent backup and restore strategy for your data and automate it
- Build process threads that resume on reboot
- Allow the state of the system to re-sync by reloading messages from queues
- Keep preconfigured and pre-optimized virtual images to support (2) and (3) on launch/boot
- Avoid in-memory sessions or stateful user context, move that to data stores.
A good cloud architecture should be impervious to reboots and relaunches. Using a combination of Amazon SQS and Amazon Autoscaling, the overall controller architecture is very resilient to various types of failures.
For example, if the instance on which the controller thread was running dies, it can be brought up and resume the previous state as if nothing had happened. This can be accomplished by creating a preconfigured Amazon Machine Image, which when launched dequeues all the messages from the Amazon SQS queue and reads their states from an Amazon ElastiCache on reboot.
The purpose to design with a pessimistic assumption that something in your system will fail will help you prepare for the future when it actually might fail. This process will help you design operations-friendly applications to pro-actively measure and balance load dynamically. You might also be able to deal with variance in network and disk performance that exists due to the multi-tenant nature of the cloud.
AWS Tactics and Best Practices:
- Elastic Load Balancer:
It automatically distributes incoming application traffic across multiple Amazon EC2 instances. It enables you to achieve fault tolerance in your applications, seamlessly providing the required amount of load balancing capacity needed to route application traffic.
- Elastic IPs:
Elastic IP is a static IP that is dynamically re-mappable. You can quickly remap and fail over to another set of servers so that your traffic is routed to the new servers. It works great when you want to upgrade from old to new versions or in case of hardware failures.
- Availability Zones:
Availability Zones are conceptually like logical data centers. By deploying your architecture to multiple availability zones, you can ensure high availability. Utilize Amazon RDS Multi-AZ deployment functionality to automatically replicate database updates across multiple Availability Zones.
- Amazon Machine Image:
Maintain an AMI so that you can restore and clone environments very easily in a different Availability Zone; Maintain multiple Database slaves across Availability Zones and set up hot replication.
- Amazon CloudWatch:
Utilize Amazon CloudWatch to get more visibility and take appropriate actions in case of hardware failure or performance degradation. Set up an Autoscaling group to maintain a fixed fleet size so that it replaces unhealthy Amazon EC2 instances by new ones.
- Amazon EBS:
Utilize Amazon EBS and set up cron jobs so that incremental snapshots are automatically uploaded to Amazon S3 and data is persisted independent of your instances.
- Amazon RDS:
Utilize Amazon RDS and set the retention period for backups, so that it can perform automated backups.
As healthcare IT solution and service provider, ezDI believes in designing for failure because it has been one of our biggest key to success. Our systems, since inception, are designed on the cloud making it extremely safe and easy. We have been following and implementing this strategy because we know our clients’ data is of extreme value and we understand that one failure on our end can affect millions of patient lives.
For more information visit our website www.ezdi.com/security-compliance/.