InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage Site Reliability Engineering Content on InfoQ

News

RSS Feed

Newer Older

DevOps

Google Meet’s Scaling Challenges during COVID-19

Google wrote about their challenges in scaling Google Meet due to increased usage since the COVID-19 pandemic led to more people using it. The SRE team at Google used their existing incident management framework with modifications to tackle the challenge of increased traffic that started earlier this year.

Hrishikesh Barua
on Aug 16, 2020
Cloud

New Report Shows "Overwhelming" Cloud Usage

The new Cloud Adoption in 2020 report from O'Reilly Media paints a picture of "overwhelming" usage of cloud computing. The survey results also revealed growing adoption of Site Reliability Engineering, high but flattening usage of microservices, and limited interest in serverless computing.

Richard Seroter
on Jun 15, 2020
DevOps

Exploring Costs of Coordination During Outages with Laura Maguire at QCon London

Laura Maguire talked at QCon London about how the coordinative efforts during outages cause a high cognitive cost. Maguire found out that coordination during anomaly response is difficult, that existing models can undermine speedy resolution, and that the strategies to control the cost of coordination are adaptive to the type of incident. Moreover, tooling has additional costs of coordination.

Christian Melendez
on Mar 13, 2020
DevOps

How Twitter Improves Resource Usage with a Deterministic Load Balancing Algorithm

Twitter recently shared the details of why their RPC framework Finagle implements a client-side load balancing using a deterministic aperture algorithm for their microservices architecture. Twitter ran different experiments but confirmed that with a deterministic approach, requests are better distributed, connections count reduces drastically, and they even need less infrastructure.

Christian Melendez
on Jan 31, 2020
Culture & Methods

Scaling Infrastructure as Code at Challenger Bank N26

To launch their banking platform globally in the US, Brazil, and beyond, the challenges bank N26 introduced a new layer for the configuration of regions in their architecture, where product development teams can add application needs. At FlowCon France, Kat Liu presented why and how they introduced this layer, the benefits that it brings, and the things they learned.

Ben Linders
on Jan 02, 2020
Culture & Methods

The Importance of Fun in the Workplace

Things at work that make us smile or laugh can improve team cohesion, productivity and organisational performance. Fun can’t be forced, but it can be fostered, said Holly Cummins at FlowCon France 2019, where she spoke about the importance of fun in the workplace.

Ben Linders
on Dec 19, 2019
DevOps

How Did Things Go Right? Learning More from Incidents at Netflix: Ryan Kitchens at QCon New York

At QCon New York, Ryan Kitchens presented “How Did Things Go Right? Learning More from Incidents”. Key takeaways from the talk included: recovery is better than prevention; an incident occurs when there is a “perfect storm” of events -- there is no root cause; “stop reporting on the nines”, as user happiness is more important; and there is value in learning how things go right.

Daniel Bryant
on Jul 05, 2019
Development

The Evolution of Full Cycle Developers at Netflix: Greg Burrell at QCon SF

At QCon San Francisco, Greg Burrell talked about the journey towards “full cycle developers” within the Netflix edge engineering team. Following the principle of “operate what you build”, developers within this team chose to take on more operational responsibility for their services, and were facilitated by comprehensive tooling, training and management support.

Daniel Bryant
on Jan 06, 2019
Development

GitHub Incident Analysis Shows How to Improve Service Reliability

On October 21, 2018, GitHub users experienced a degraded service during 24 hours due to an incident caused by routine maintenance work. This led to the display of outdated and inconsistent information and to the unavailability of webhooks and other internal services for 24 hours. GitHub post-incident report shows where things failed and suggests how to improve site reliability.

Sergio De Simone
on Nov 01, 2018
DevOps

Google Explains Why Others Are Doing SRE Wrong

Stephen Thorne, customer reliability engineer at Google, recently spoke at the DevOps Enterprise Summit London on what Site Reliability Engineering (SRE) is and why many organizations are failing to understand its basic premises and benefits.

Manuel Pais
on Jul 01, 2018
DevOps

Full Cycle Developers at Netflix: from Mindsets to Self-Service Tooling

The Netflix Tech Blog has shared the story of the “Edge Engineering” team’s journey of experimenting with approaches to building and operating services, which has culminated in “Full Cycle Developers”. This approach is showing promise with Netflix, where developers are responsible for certain operational aspects of service delivery, and are supported through a range of self-service tooling.

Daniel Bryant
on Jun 17, 2018
DevOps

From Darwin to DevOps: John Willis and Gene Kim Talk about Life after The Phoenix Project

IT Revolution recently published an audiobook with nearly eight hours of conversation between Gene Kim and John Willis; Beyond the Phoenix Project – the Origins and Evolution of DevOps.

Helen Beal
on May 23, 2018
Architecture & Design

Microservices and Site Reliability Engineering

A recent article talks about how the complexities introduced by microservices initially seem at odds with the concept of Site Reliability Engineering (SRE), and how companies such as Google are tackling that to ensure that whilst development groups can continue to embrace microservices, they and their SRE teams have the necessary tools and understandings to make them work well together.

Mark Little
on Apr 29, 2018
DevOps

What It Means to Be a Site Reliability Engineer According to a Survey from Catchpoint

Site Reliability Engineering intersects software engineering with IT Operations and is an approach created at Google in 2003 and described in detail in their 2016 book, Site Reliability Engineering, How Google Runs Production Systems. Digital experience intelligence provider, Catchpoint, surveyed 416 Site Reliability Engineers (SREs) with the goal of understanding what it means to be a SRE.

Helen Beal
on Apr 13, 2018
DevOps

How DevOps Principles Are Being Applied to Networking

Practices from the DevOps world are being adopted into managing networking services. Vendor hardware, configuration tools and deployment modes have eased programmable configuration and automation of network devices and functions.

Hrishikesh Barua
on Jan 22, 2018

Newer News

Older News

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

News