InfoQ Homepage Disaster Recovery Content on InfoQ
-
Summary of Chaos Community Day v4.0: Resilience, Observability, and Gamedays
Earlier in the year, the fourth edition of “Chaos Community Day” was held at Work-Bench in New York City. Key takeaways from the day included: the topic of chaos engineering draws heavily from other domains, which software engineers can also learn from; understanding systems, and communicating and exchanging the related mental models, is vital for establishing resilience.
-
Building Production-Ready Applications: Michael Kehoe Shares Lessons Learned from LinkedIn
At QCon San Francisco, Michael Kehoe presented “Building Production-Ready Applications”. Drawing on his experience with site reliability engineering (SRE), he introduced the tenets of “production-readiness” that all engineers across the organisation should focus on as: stability and reliability; scalability and performance; fault tolerance and disaster recovery; monitoring; and documentation.
-
Why the World Needs More Resilient Systems: Tammy Butow Discusses Chaos Engineering at QCon London
At QCon London, Tammy Butow, explained why the world needs more resilient systems, and how this can be achieved with the practice of chaos engineering. Three primary prerequisites for chaos engineering were provided -- high severity “SEV” incident management, monitoring, and measuring the impact -- and a series of guidelines, tools and practices presented.
-
Microsoft Introduces Azure Availability Zones, Completes MAREA Transatlantic Connection
In a recent blog post, Microsoft announced the expansion of High Availability (HA) and resiliency options for customers. The update comes in the form of Azure Availability Zones which increase the availability of certain Azure services within a specific region by providing complete redundancy and isolation of the infrastructure. Azure Availability Zones include a financially-backed SLA of 99.99%.
-
Public Preview of Azure IaaS Disaster Recovery Announced
In a recent announcement, Microsoft released details about its public preview for Infrastructure-as-a-Service (IaaS) disaster recovery using Azure Site Recovery (ASR). Using the ASR service, organizations can protect IaaS workloads in one Azure region and have it replicated to a different Azure region within a geographical cluster.
-
GitLab.com Postmortem Digs into Root Causes of 18 Hour Outage
GitLab's postmortem into the root cause of their 18 hour site outage is a detailed look at how the incident began, how it got worse before it got better, and how they plan to learn from the mistakes and improve the service.
-
BitBucket Introduces Disaster Recovery and Merge Strategies
Recently released BitBucket Server and BitBucket Data Center 4.9 bring the possibility of defining a strategy for disaster recovery, setting a preferred merge strategy, and more.
-
Too Big To Fail: Lessons Learnt from Google and HealthCare.gov
At QCon New York 2015, Nori Heikkinen shared stories of failure and lessons learnt during her time working as a site reliability engineer (SRE) at Google and HealthCare.gov. The discussion of managing large-scale outages included recommendations for preparation, response, analysis and prevention.
-
CenturyLink Acquires DataGardens to Offer DR as a Service
CenturyLink, one of the largest telecommunications and cloud providers has announced the acquisition of Canada based disaster recovery software company, DataGardens.