BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Podcasts Lessons Learned from the CrowdStrike Incident: InfoQ Dev Summit Munich 2024 Preview

Lessons Learned from the CrowdStrike Incident: InfoQ Dev Summit Munich 2024 Preview

In this podcast episode, speakers from the InfoQ Dev Summit Munich 2024 discuss the recent CrowdStrike incident, which triggered widespread outages and highlighted vulnerabilities in cloud infrastructure. The panel shares personal experiences and emphasizes the importance of resilience in IT systems, the implications of cloud dependency, and the lessons learned about risk management and automation in organizations.

Key Takeaways

  • The CrowdStrike incident is a reminder that even well-established companies can experience significant outages, highlighting the necessity of robust processes to manage risks and ensure critical paths remain operational.
  • As organizations increasingly adopt cloud solutions, there’s a growing importance of transparency regarding third-party infrastructure. Companies must evaluate vendors’ technologies and configurations to mitigate risks associated with hidden vulnerabilities.
  • Having a well-defined rollback strategy is crucial, and organizations should implement feature flags and design documentation to allow seamless rollbacks.
  • While human mistakes often contribute to failures, the underlying systems and culture also play a significant role. Organizations should encourage a culture of continuous learning and risk assessment.
  • Investing in resilience must be carefully balanced with budgetary constraints, and organizations need to allocate resources effectively to enhance system reliability.

InfoQ Dev Summit Munich 2024 Overview [00:05]

Renato Losio: Hello everyone, and welcome to the InfoQ Podcast. I'm Renato Losio, the chair of the upcoming InfoQ Dev Summit in Munich. It's a new event, and if you're looking to stay ahead on critical topics like generative AI, security, modern web applications, and more, we've got you covered. If you want to learn not just from our three amazing panelists today but from over 20 senior software practitioners with diverse experiences, backgrounds, and countries, join us in Munich! There will be plenty of time to connect with the speakers and your fellow practitioners. Last but not least, the end of September is Oktoberfest in Munich, so if you want to add a few days to enjoy the event, there's that opportunity as well.

The InfoQ Dev Summit is a two-day event brought to you by the team behind InfoQ and QCon Software Development Conferences. You may already be familiar with QCon London and QCon San Francisco, but there are also two new events. The inaugural InfoQ Dev Summit in Boston took place in June, and the next one is Dev Summit Munich at the end of September (September 26th and 27th). I look forward to connecting with you in Munich this September. I hope to see you there!

Introduction to the Panel [01:20]

Renato Losio: Today, I have the pleasure of sitting down with a panel of three amazing speakers from different backgrounds and companies, all of whom will be speaking at the upcoming InfoQ Dev Summit in Munich. They are Danielle, Santosh, and Mykhailo. In just a moment, I’ll let the panelists introduce themselves, giving each one a minute to share who they are and what they do. Then we’ll launch into a conversation covering some of the critical technical decisions that senior software developers face today and that we’ll dig deeper into in Munich next month.

Specifically, we’re going to talk about resilience, automation, and security for your cloud and non-cloud deployments. Let’s go ahead and introduce our speakers. Danielle, do you want to go first?

Danielle Sudai: I can go first. Very nice to meet you. My name is Danielle Sudai. I've been the Security Operations Manager at Deliveroo for the past three years. In addition to managing talented teams and focusing on defending against real-time threats, I've also been working in DevSecOps for the last decade across various firms. My focus is on developing the detection and triage of the alerts we receive for real-time threats, based on our organizational vulnerability management.

Renato Losio: Thank you so much, Danielle. Santosh, do you want to go next?

Santosh Yadav: Nice to be here. My name is Santosh Yadav, and I live near Hamburg. I work as a Senior Software Engineer at Celonis, which is based in Munich, so I'll get a chance to visit my employer after a long time. I mostly work with monorepo, and my team is responsible for managing the entire monorepo, which drives our main product at Celonis, Studio. We coordinate with nearly 40 teams contributing to the monorepo, ensuring it stays healthy, follows best practices, and incorporates new tools with POCs to make the teams' lives easier, focusing on developer experience.

Apart from that, I'm part of multiple community programs and have contributed to open-source projects. I'm part of the Google Developer Expert (GDE) program for Angular, the GitHub Stars program, and the NX Champions program. You can find me at multiple conferences speaking about NX and Angular.

Renato Losio: Thank you so much, Santosh. And looking forward to meeting you in person in Munich. Mykhailo, you're the last one.

Mykhailo Brodskyi: Hello everyone, my name is Mykhailo Brodskyi. I'm a Principal Software Architect based in Munich, working in the FinTech industry. Currently, I'm focused on defining the architecture strategy for our platform and am responsible for multiple business-critical projects. One of these projects is our cloud platform migration. I'm also a member of the Cyber and Risk Committees, where we're responsible for enhancing the security of our platform and mitigating potential vulnerabilities across all our platforms and related applications.

Renato Losio: Thank you so much. We're now going to talk a bit about resilience, automation, and security.

The CrowdStrike Incident  [04:44]

Of course, this is a bit of a special summer. So let's start with a topic to trigger some discussion, one that we're probably all familiar with, whether as travelers, IT experts, security experts, or automation experts. Many of us had the experience last month (well, hopefully not all of us) of being stuck in an airport with a blue screen, thinking that something went wrong. For one of the few times, an IT outage on a large scale made headlines on major news sites. That’s the day you realize something big went wrong. The first reaction is, "Ha-ha, that can happen just to them". My second reaction is, "Oh, maybe that can happen to me".

I don't want to dig into what happened with CrowdStrike or what the implications were; I just want to use that example as a trigger for our discussion about resilience and risk—not necessarily about a global IT meltdown, but maybe just at the company level. How do we handle that? How do we manage automation and those processes? The very first question I have for you is: Were you surprised? What surprised you most about what happened? What do you think is the critical lesson we should learn to avoid saying, "That’s never going to happen to me"?

Santosh Yadav: First, I was surprised that something so big could happen. A company with billions in revenue is responsible for a global outage like this. I remember I was traveling that day—not at the airport, thankfully. I was traveling by train in Germany with Deutsche Bahn, which is never on time, so of course, it was delayed. Regardless of the global outage, I saw some travelers struggling because they were trying to buy food at Frankfurt Station, and their credit cards weren’t working.

People don't realize that a small mistake can cause something so significant. For someone traveling internationally, not having a working credit card while trying to buy food is a big issue. I was really surprised that this happened. But I think the lesson for everyone is that no system is completely safe. We all make mistakes, but it's everyone's responsibility to ensure that the critical paths work. These are critical elements that can impact millions of systems and people. So that was my takeaway: Make sure your critical paths are taken care of.

Renato Losio: Thank you, Santosh. Danielle?

Danielle Sudai: To be honest, before becoming DevSecOps, I was in security IT. The shift to that role and the move to the cloud made me wonder what we are going to do with agent-based SaaS solutions when they impact infrastructure. When we say we are hybrid and moving to the cloud, it’s important to remember that these are other firms' servers, which are not visible to us. Whenever there is an impact, it also affects our cloud vendors to some extent. What will happen if something agent-based impacts the actual infrastructure? What happens when the on-prem and cloud approaches clash?

I was slightly amazed—I don’t want to speak about the specific company outage because things like that can happen and there is no blame. However, I was surprised by the scope. The company itself is trusted and serves many vendors worldwide, and I think the impact was what surprised me. People may not have been fully conscious of zero trust, leading them to spread agents across all their endpoints or systems, which eventually contributed to the issue. They did this for a good reason, to protect the endpoint and ensure it signals back to their management framework, allowing them to understand what’s happening there.

It’s a bit of a chicken-and-egg situation. But yes, I was definitely surprised by the impact, especially in the U.S. That was crazy. I think it opened another area that isn’t classic security but more of an infrastructure issue, which is massively impactful. All of a sudden, we have to face that reality. It reveals a lot of things that our current platforms cannot solve. We must remember there is an on-prem component behind it.

Renato Losio: That's an interesting point you made, Danielle. It’s a different security issue in the sense that it’s not a classic denial of service attack. It’s not someone who intentionally caused a denial of service. But in the end, there was a denial of service for the end users, for anyone impacted by the outage. Mykhailo, since you work in a strictly regulated financial industry, do you see anything different? Do you think there are any lessons to be learned? Or were you really surprised, or did you feel like, "Oh yes, I’m surprised that didn’t happen before"?

Mykhailo Brodskyi: Unfortunately, I was also surprised. Payment platforms were affected as well. For me, it wasn’t a shock, but it was a big surprise. In the FinTech industry, we have a huge set of regulations—internal, external, and even some specific to banking. Yet all these regulations didn’t help to avoid this situation. A big question is, how can we improve this? How can we prevent this situation in the future? I’ve had multiple conversations with different departments and applications involved in various payment platforms, and I can say that the deployment and testing approaches should be improved dramatically. That's what I see.

The Importance of Alternative Solutions and Rollback [11:08]

Renato Losio: Thank you. That’s one point I’d like to come back to later—what we can do to improve. But even going back to the basics, one thing I noticed on my side as a "not expert" was the two entire trends in the very first few days. Apart from blaming the company and saying, "It’s not going to happen to me", there were thoughts like, "How can they deploy that way while we’re the experts who don’t make those mistakes?" That’s a very risky approach. I’ve always found that attitude not just unfriendly, but also something I wouldn’t want to be in that position in the future, so I’d like to learn from them more than to really blame anyone.

On a practical level, two things happened that day. First, people tried to find different solutions. You could see people in the airport checking in with a piece of paper. On the other hand, people were trying to roll back, but they had no real rollback because we were in a scenario where rolling back was much harder than deploying. I was wondering about those two aspects: what we should do in our deployments to ensure we have a rollback plan. If I'm stuck in that moment, should I think more about a rollback plan, or should I focus on finding another way to bring back what Santosh defined earlier as our critical path and feature? I don’t know, Danielle, if you have any thoughts.

Danielle Sudai: I think it's really interesting. It's an industry where we all talk and share experiences. I agree with you; there are a lot of people who place blame. But we should focus on what to do in this situation. This isn't the moment to say, "Now we're cutting the contract, and we are looking for other options". We have to deal with it tactically. We need to communicate with our endpoints and ensure that, regardless of the scenario, we are mitigating the issue and have repair items in place.

I want to take a step back to highlight the fact of zero trust as well. Mainly with third parties and fourth parties, there is a lot of room for technical issues, SLA issues, and threats based on the tools they use and the infrastructure they rely on. Renato, you're a cloud architect; many solutions are built on cloud infrastructure. It's not always visible how the cloud environments are configured. They can suddenly tell us, "We actually use Java libraries, and now you're experiencing Log4J". You didn't choose that; it's so fragile.

That's why I would expect to see a more detailed understanding of the infrastructure they use before onboarding a solution. We need to bring the right technical people to ask the right questions and work closely with solution engineers to explain how our environment looks, what we need from them, and what our concerns are. We should validate that there are no unrealistic promises from vendors that don't reflect the reality of what's happening in their infrastructure or the regulated frameworks they adhere to. We need to be very transparent and persistent until we find the answers we're looking for.

That's something I love about Deliveroo: we always think about what can happen because we have taken on a SaaS solution. It’s not just about us using them to manage threats; it’s also about how we manage them if they experience a threat. This is something we’ve experienced over the past three years. Six months after I joined, Log4J started to be an issue, and suddenly, there were a lot of SaaS solutions leveraging Java libraries.

Renato Losio: That's a very good point you made about going down the chain of suppliers. It can be quite challenging. Sometimes, you don't even realize where they are based. You might end up deploying your stuff on one cloud provider in a specific region, while deciding to monitor using a third-party solution that you don't even know is deployed in the very same data center you're monitoring. You might hope that when it goes down, it won’t take both your service and your monitoring solution down at the same time—while you’re still paying for it. But usually, you don’t have that visibility. You need to dig a bit; it’s not like all the answers are readily available. You need to go and find them yourself.

The Costs of Resilience and Human Errors [15:47]

Maybe Mykhailo can provide some feedback from his perspective. Danielle mentioned, "Oh yes, it’s different with the cloud; you might have everything in the cloud, and you need to trust them as well". At the same time, I was thinking that as a cloud architect, sometimes it’s easy to do more. If my data is on S3 or the equivalent from another provider, and then suddenly someone tells me, "Oh no, why do you have them in just one region? If that region goes down, you want to have a multi-region setup?" Yes, we can do multi-region. In fact, it’s just one click away; it’s not that complicated. Can we implement a multi-cloud solution? Sure, we can do that too.

But how much money do you want to throw at this problem? That is often the real challenge—the balance. How do you decide where to stop? I can make my data super reliable, but the problem is that those petabytes of data I might be managing will become incredibly expensive for the company. I might be throwing money at the wrong problem. Looking back at the CrowdStrike incident, Mykhailo, where should I invest? Where should I focus on making things safer? Is just throwing money at the problem the solution?

Mykhailo Brodskyi: Interesting question. In this case, we are talking about the environment, how we test our application, and our application in general. From my experience, I can see that when we have a very critical application that we want to deploy, we sometimes come up with a runbook with detailed steps that we need to execute in a specific sequence. In case something goes wrong, we have a go decision for when we can roll back everything.

But here we are talking about the environment and how we can test all these different components, and it's not directly related to the application. That’s why the approach we take to implement new updates to particular elements of the environment needs to be rethought. In some cases, we do not have a straightforward and documented rule on how we should do it, which is very important.

Regarding the data, we usually define and try to cluster the data in different ways. For some critical data, we choose not to store it in specific parts of the application, which is why it's completely separated.

Renato Losio: Thank you, Mykhailo. Santosh, I don’t want you to reveal anything about your session at the summit, but since the title is "How to Ship Updates of 40+ Apps Every Week", I was wondering how you can deal with that if you find yourself in a CrowdStrike scenario.

Santosh Yadav: Very good question. Of course, I have seen that all the engineers we have today, like the senior engineers, have broken production at least once in their lives, and they have done it pretty badly. I have done that as well. But when something like this happens, everyone becomes a pro. I mean, they start sharing that this might have been avoided by doing this or that. But the reality is not like that. I think most of the issues go back to the company culture.

For example, at Celonis, the first thing we do is write a design doc. The most important part of the design doc is our release strategy. We outline how we are going to do the release and how we are supposed to roll back if something goes wrong. We always recommend using feature flags in case something is very critical and could impact the system. This way, we can roll back without even impacting the end users; they aren’t even aware of what happened. If something goes wrong, we just roll back. That's the first step. That’s how we ship our products. As you said, we ship 40-plus apps every week, and that’s the first goal. We don’t even test the code unless we know the plan and the release strategy if something is more critical. Of course, there are small features that can go without a design document. But if there is something critical, like what happened in the case of CrowdStrike, we don’t go ahead without having a design doc in place.

Renato Losio: Thank you, Santosh. Danielle, I was thinking, as Santosh just mentioned culture and the importance of culture in the team. Even when I don’t think about software, when I read the news in newspapers or online, there’s often something bad happening, and it’s frequently attributed to human error. My first reaction when I read about a significant problem is that it wasn’t solely a human error. Yes, there probably was a human mistake, but something was missing. Take a train crash, for example. Maybe the train didn’t stop, but we can’t rely on it stopping just because someone is pressing a button and then blame the driver. There should be multiple safety nets.

When I think about software, I assume culture plays a role, but how do we protect against a developer's mistakes? I’m not saying a developer wants to do something wrong; sometimes, there’s a bit of complacency or laziness. If I’ve been working on the same project for many years, how do I avoid feeling so safe that I might unintentionally overlook important checks? I’m not suggesting anything is done intentionally wrong, but how do you maintain the right culture?

Danielle Sudai: It's interesting because Santosh highlights release notes and the importance of assessing what might go wrong and documenting that. I'll take that and add that mainly in security, we take human mistakes as fact. We're not blaming anyone; it's something we are calculating as part of our overall risk, and we're trying to, you know, ultimately it's education. There comes a point in your career when you need to understand that maybe you have done something wrong within your configuration, documentation, pull requests, or review of your pull requests, or within your Terraform deployment. That might have a big impact. As you said, how many production engineers have broken production? Many. That also relates to what my session at InfoQ is about, where we will discuss the security aspects that have been impacted by misconfigurations in cloud environments.

An Overview of InfoQ Dev Summit Munich Talks [22:19]

Renato Losio: That's a good time for each of you to do your elevator pitch for Munich. All three of you are going to Munich and will be doing a session. There are fellow practitioners listening to the podcast: why should they join you in Munich? What are they going to learn from your session? So please, Danielle, go ahead. You're first.

Danielle Sudai: My session is for cloud security experienced practitioners, but it’s also for developers, compliance people, and security specialists. It mainly discusses misconfiguration in our cloud infrastructure: how do we gain visibility for that? In one of my former roles, I co-engineered a CSPM solution, and it's simpler than people think.

It’s a very good outlook for professionals—engineer professionals, security professionals—to observe how cloud infrastructures look. Are we making sure that our assets are encrypted? Is our network zoning properly set up? Have we opened a port with a security group that might allow inbound traffic? How easy is it to identify that misconfiguration?

As of today, we are heavily invested in CSPM and Synapse solutions. Before we can do all of that, we really need to understand our environment, how it can be misconfigured, and where our crown jewels are and how they can be impacted. It's for all professional levels, and I'm excited.

Renato Losio: Thank you so much, Danielle. Santosh, you've already mentioned the title of your session, but please go ahead and tell us more.

Santosh Yadav: As I mentioned, we ship more than 40 apps every week. I'm excited to share how we do that, as it’s very critical when you have to work with multiple teams. Forty-plus apps means forty-plus teams to work with. How can you communicate well? Whenever you're bringing in a new tool, how can you educate, promote best practices, and create better architectures? All those aspects will be part of my talk. I'm really excited to do this because I've been thinking about preparing this talk for a long time. I couldn’t be more excited because it’s happening in the city where our headquarters are based. So, yes, I'm really excited to be there!

Renato Losio: Thank you so much, Santosh. Mykhailo, I believe you are going to talk about supply chain security. Please go ahead!

Mykhailo Brodskyi: My topic is going to be around security, focusing on a comprehensive approach to software supply chain security. Creating and developing good solutions in a highly regulated FinTech industry environment is super challenging. I'm going to elaborate more on regulation and how to navigate these regulations. Because it's not enough just to understand some technical aspects of the implementation; it's also important to understand the environment. That's why I will go deep into standards and regulations. Then, I'm going to show some use cases from real examples of how we try to solve some challenges. People who are not from the FinTech industry can use this knowledge and apply it to their business domain. So that is going to be an interesting talk about the FinTech industry and our technical architecture. I will show you some examples, use cases, and explain how we can make architectural decisions faster and correctly.

Renato Losio: Thank you so much. I'll give my final pitch. These three talks were amazing. There will be 20 other incredible talks, and I hope to see you next month in Munich on September 26th and 27th. Thanks again, Mykhailo, Danielle, and Santosh, for being with us today. Goodbye!

Mentioned:

About the Authors

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Rate this Article

Adoption
Style

BT