BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Q&A with Jeff Smith on His DevOpsDays NZ Keynote on DevOps Transformations

Q&A with Jeff Smith on His DevOpsDays NZ Keynote on DevOps Transformations

This item in japanese

Jeff Smith, manager of production operations for Centro, a Chicago based organisation which provides a platform for digital marketing, will deliver a keynote at DevOpsDays NZ this November in Wellington, New Zealand, with a talk titled Moving from Ops to DevOps: Centro's Journey to the Promiseland. Smith also spoke at DevOpsDays Indianapolis in July about the misalignment which can arise in an organisation, simply due to the differing motivational lenses through which professional silos examine the same subject matter.

InfoQ caught up with Smith to discuss Centro’s journey and compare it with Grubhub’s DevOps transformation, which he spoke about in 2016 at DevOpsDays Minneapolis.

InfoQ: You talked at DevOpsDays Indianapolis this July about the misalignment which can arise from the differently biased "lenses" through which various parts of an organisation look at the same problem. Can you tell us a little more about this?

Jeff Smith: Sure. It’s this problem where we don’t really see the problem from the same perspectives. It reminds me a lot of the Blind Men And The Elephant parable from India. A group of blind men who have never encountered an elephant before are feeling the elephant and describing it to the other blind men. But because each of them is touching a different part of the elephant, their perspectives don’t always overlap. It’s the same problem in organizations. We’re grossly misaligned in both our goals and our incentives, which results in wasted work and wasted effort.

Establishing the context in which each of us is working in helps us to achieve each of our goals more efficiently.  If my goal is to build the most robust Infrastructure possible and your goal is to contain costs, we’re going to have an impasse if we don’t communicate. Through dialogue, we can either both meet our goals or at least present the conflict to leadership in order to strategically decide which one is more important.    

InfoQ: How did Centro deal with this misalignment in understanding?

Smith: Like most organizations, it’s a constant struggle. The key thing is to always, always, always ask for context around the problem being solved. Another thing is to be astute and make sure you’re getting the problem and not just feedback on someone’s proposed solution. This is referred to as the XY Problem. If someone is asking you for feedback on their proposed solution, versus their actual problem you will end up with a sub-par solution.

An example of this is something most people are probably familiar with, someone asking for production access to a system. The truth is they don’t need production access, they just need to be able to execute task X. The solution they’ve come up with, however, involves them SSH’ing to a box and running a command. But if we rephrase and say "I need to be able to run this script in production without involving other people", it opens the range of possibilities up. Maybe we automate it and turn it into a Jenkins job. Maybe we schedule it on a regular basis to run. Simply granting production access may solve your problem, but create a host of audit problems for another team.

InfoQ: Your talk for DevOps NZ will be about Centro’s journey from Ops to DevOps. What do these two terms mean to you?

Smith: OPS is basically the function of what my team does. I like to think of it as the technical version of business operations. Operations try to manage the process of bringing our code to the customer. Production systems are part of that, but there are many pre-production systems that are a piece of that as well and it all falls under the purview of OPS in my view. This includes build servers, testing environments, code repository management etc. DevOps is about the approach we take to doing that. DevOps is not a team, it’s not a third silo, it’s a method of collaborating between Dev and Ops. That collaboration usually involves some cultural changes.

OPS has to give a bit more control to the developers of the environment, as well as a bit more visibility. Development needs to take a large role of responsibility for managing and operating their code once it hits production. The world of "only OPS touches production" and "development ends once its off my laptop" are gone. We need a tighter method of working together and that’s what DevOps is in my view.

InfoQ: How was this journey affected by Centro’s own particular starting state and context?

Smith: Centro was kind of struggling with what DevOps actually meant for the org. When I was originally interviewing for the position, the title was called "DevOps Manager", which for me exposed some of the core issues with how they’ve been trying to manage their transition. When I started we had a lot of Iron Fisted OPS sorts of policies. But there were a few key players in the org, that saw the benefits of a different style of working and craved more ownership.

One developer, in particular, had an amazing amount of metrics around the systems he was responsible for. He had a real interest in knowing what was happening in the systems he was responsible for. Every company has a few of those people and the key is to find them and unleash them! Seek their input on the problems and the solutions the team is facing. Find all the allies you can and leverage them.

InfoQ: What types of challenges did Centro and its teams have to overcome during this transformation?

Smith: The biggest challenge remains to be instilling a sense of ownership for the code that gets produced and deployed to production. Because we operate a monolith, there are a fair number of developers that commit to the same codebase. This can produce the bystander effect, where all of the involvement serves as a sort of diffusion of responsibility. How does someone know it was their change that broke things? Maybe it was some other change!

The best way we’ve found to instil ownership is to make sure problems are assigned to specific people. If it gets assigned to general areas, what ends up happening is everyone assumes someone else is looking at it. But when an issue is owned by a specific person, it’s on them to either solve the problem or generate enough evidence to point it at the real issue owner and transfer it to them. Explicit ownership of problems is key.

InfoQ: How has the culture at Centro changed since its DevOps transformation?

Smith: People want to know why a thing is being done. That’s probably the biggest change I’ve witnessed. The act of asking "why?" shows that there is a level of engagement that goes beyond just getting a request off their plate so they can move on to the next thing. When people don’t understand something, they ask good probing questions in order to understand something. One simple question about a failed job run might end with an in-depth discussion of how Write Ahead Log replication in the database works. People just naturally want to learn more!

InfoQ: What measures do you take to avoid reverting back into practice based silos?

Smith: Discipline on how we approach problems. A knee-jerk reaction when something goes wrong is to add another layer of approval or another layer of supervision, but it doesn’t solve any of the actual issues that lead to the situation you’re in.  We also really enjoy doing post-mortems, but we focus on the human side of the problem rather than the order of events. What were people thinking when they made a particular decision? Why did that seem like a rational choice? What can we learn about the incident to make sure we haven’t slipped into an old way of doing things? We have to talk about things that go wrong in a much deeper sense than the way we typically talk about failure in retributive cultures.

InfoQ: How have non-technical partners responded to Centro’s journey to a DevOps culture?

Smith: Non-technical resources honestly don’t interface with us at the level where they see the cultural differences, but they see the change in capabilities. When a person can have their own mini-environment to test out a specific feature, it’s powerful. And when they can get it in minutes as opposed to days, it speaks to the types of changes and the speed at which we intend to move.

InfoQ: You previously spoke at DevOps Minneapolis about Grubhub’s DevOps transformation. How did these two experiences compare?

Smith: The experiences are largely different because Grubhub was a company born in technology. It was always a technology company. Centro feels more like a company within a company. Centro existed as a long time as a services company, but as we add our SAAS offering we’re experiencing growing pains that differ from a company that was always a tech company. Mindsets are different, views are imported from previous lives that seem much more entrenched than they were at Grubhub. I wouldn’t say that one was particularly better than the other, but there were different challenges for sure. Another big difference was the size of the technical organizations.

At Grubhub, we had an OPS Engineer for every stream team. The OPS Engineer was embedded in the teams, attending stand-ups and IPM sessions. My team at Centro is much smaller and we just don’t have the resources to do that, so we need to find ways to offer expertise in more of a pull vs push method. People have to come to us to ask our opinion on something, so we have to constantly find ways to incentivize that behaviour. It mainly boils down to relationship building and creating a culture where people feel comfortable asking questions.

InfoQ: Were there any particular lessons which stand out across both journeys?

Smith: People are always looking for a better way to do things. It’s not desire that stops transformations from happening, but fear. Fear of failure, fear of mistakes, fear of change. Ignoring fear is not a winning solution. Fear is like any other emotion, it needs to be acknowledged and accepted as real before it can be managed and dealt with.

InfoQ: What approaches have proven themselves to you in dealing with this general presence of fear?

Smith: The first thing you can do as a leader is to show vulnerability. You can’t be afraid to admit mistakes because your team and others around you will follow your example. Admit when you don’t know something in order to show that you don’t have the answers and that an inquisitive nature is not only allowed, but encouraged! Be curious about what others are doing in your environment and give them a chance to teach you something. Fear is rooted in a lack of trust, so before you can dispel fear, you need to build up trust.

Ultimately something will happen that is an incident or an outage. When that happens, look beyond the who did what when. Look deeper into the real root of the problem, not just that Frank restarted the nodes during the day. Why did Frank restart the nodes? What signals lead into his decision making? What signals were missing that would have informed Frank that this was a bad decision? How do you bring all of these things to light through training, through radiating information so that the next time this situation arises, we’ve addressed all the parts of the system that contribute to someone making an error? Once you demonstrate this willingness to indict not just the person, but all of the little factors that lead up to the event, you generate an open honest environment.

One last important thing, there’s a right way and a wrong way to fail. If you deploy some new code that’s showing signs of instability and we then need to roll it back with another deploy, that’s one way of failing. The preferred way is if we can feature flag code. If we can turn it on and off without deploying, but through configuration, that’s the way I want my organization to fail. Celebrate those times! When you have a developer that deploys codes, monitors the telemetry, sees the problem and turns the feature off, that’s still a win. You're never going to eliminate failure. But you can eliminate failing poorly.

InfoQ: Which of the DevOps Topology Patterns best describe the model of DevOps adopted by Centro?

Smith: We’re currently a Type 1 DevOps organization with the goal of moving to a hybrid of Type 2 and Type 3. I’d love to live in a world where the OPS team is simply providing a platform for developers in an Infrastructure as a Service kind of model, with the sprint teams moving to a "you build it, you run it" kind of mode. There are a few hurdles here.

  1. We need to get to a place where we’ve codified the typical patterns that our applications have in terms of infrastructure, in order to reliably provide those services in an automated fashion. Sometimes your design can get too specific, so when someone needs to adopt a new service, the infrastructure template you’ve created isn’t compatible.
  2. We need to continue to level up developers on their understanding of the Operations point of view. It’s less about technical knowledge and more about thinking from the perspective of resiliency, failure tolerance and focusing on the critical path of your application. When you go to the Amazon home page, there’s a lot of separate services that are involved. But the only thing you actually need on that page is search and shopping cart functionality. Everything else can be short-circuited in the event of a failure to deliver those key features. Getting developers to constantly think in that mindset is an opportunity to get them thinking more in the Operations POV.

You can catch Smith’s keynote at DevOpsDays NZ in Wellington, New Zealand, as of 5th and 6th November.

Rate this Article

Adoption
Style

BT