In this podcast Shane Hastie, Lead Editor for Culture & Methods spoke to Erez Kaminski about the challenges and importance of developing regulated software for safety-critical systems, emphasizing the need for validated DevOps and AI integration in industries like healthcare and automotive, while highlighting the balance between innovation, safety, and regulatory compliance.
Key Takeaways
- Today it is necessary to develop regulated software at the speed of regular software, particularly in safety-critical systems like healthcare and automotive.
- Software validation and risk management is important to ensure that safety-critical systems operate reliably and consistently.
- There is a need for "validated DevOps", integrating automation and quality assurance within development processes to enhance efficiency while maintaining safety standards.
- Outdated methodologies and tools in the development of safety-critical software pose a significant challenge, and there is a need for modernization.
- Society's expectation for regulatory oversight in safety-critical industries requires balancing innovation with necessary safety measures.
Subscribe on:
Transcript
Shane Hastie: Good day, folks. This is Shane Hastie for the InfoQ Engineering Culture Podcast. Today, I'm sitting down with Erez Kaminski. Erez, welcome. Thanks for taking the time to talk to us.
Erez Kaminski: Thanks, Shane. Excited to be here.
Shane Hastie: My normal first question on these sessions is who's Erez?
Introductions [00:47]
Erez Kaminski: Today, I'm the founder and CEO of a company called Ketryx. I help people develop regulated software at the speed of regular software, but historically I used to be a physicist and then a developer. I started my career working on control methods and simulations for fusion reactors and fusion reactions. I guess my specialty is writing I wouldn't say great performance code but complex code. Someone else came in and made it more high performance.
And then I went to work in industry. I went to work for a company called Wolfram Research that develops Mathematica and Wolfram Alpha and the Wolfram Cloud. Worked very closely with the founder and CEO of that company on a bunch of different technologies there from, really, the start of the AI revolution in 2012. I joined a little bit later but got the big wave of that. Helped do certain parts of Wolfram Alpha or guide certain parts of that. I was very fortunate to work on a big team that developed the largest expert rule-based engine that's ever been built. It's a monumental feat of engineering that was built by hundreds of people under the guidance of Stephen Wolfram and a great staff there.
And then after a few years there, I was hired by a large multinational pharmaceutical company called Amgen, Amgen is the largest independent biotechnology company, to be the head of AI for their medical device group and started getting into the design and development of safety critical systems. Left Amgen to get some higher education over here in Cambridge at MIT and then started this company while at MIT, basically trying to combine the world and my passion for developer tooling and developers and the developer ecosystem with a need I saw, which is to make it possible to develop regulated software the same speed of regular software.
Other than that, live in Boston, come from a healthcare family and love coding, love mathematics, love applied mathematics specifically. I think a big part of my career has been to help apply the revolution of applied mathematics that is so advanced in physics into more common domains, because the math we use in the ivory towers of physics is so far ahead of what we see in day-to-day products that it's pointless to keep developing it if no one else is catching up to it. I kind of want to be on the other side of the curve helping apply that stuff through software
Shane Hastie: Helping apply advanced applied mathematics through software in the realm of safety critical systems. What does that look like?
Applying advanced mathematics to software for safety-critical systems [03:22]
Erez Kaminski: It's a good question. We hear that quite often like, "Does any safety critical system use any advanced algorithm?" It looks like thinking of what does it mean to develop something that's safety critical. First of all, I think most people who have never built something that can injure or kill another person, it's hard for them to imagine the journey development teams and engineering teams go under in order to do that, because I've been in enough software development teams and workshops and conferences where people say, "But it's not like this is going into a pacemaker. Let's ignore all this complexity and their runtime and all kind of reliability aspects of it in order to just ship faster".
Safety-critical development proves consistent, reliable, and maintainable functionality [04:02]
In safety critical, that's not how it looks. You need to have a lot of control and a lot of reliability in the way things are executed. As a result of that, that leads to a lot of complexity and how you can actually do that, especially the more featureful the software is, the more decisions it makes, and the more powerful the algorithms are. Because fundamentally, safety critical product development is about proving that something can do the thing you claim it does consistently and reliably in a maintainable fashion.
I have family members that are patients that have class three in the US medical devices implanted in them that help them. Here, for example, I have a family member that has a cochlear implant, which is a device that replaces the human sense of sounds. It's been around since the seventies. Not as far used as one would hope, but when you think of that technology, it uses quite a bit of complicated algorithms to make it all work. And then it's a question, how do you put this in a person who might live for another 70 or 80 or 90 years after implantation? And how do you make sure it's safe, reliable, and serviceable so it can continuously be maintained? Even though all the engineers who worked on it are already long gone from the company, the company and society can still help these patients and not kind of leave them.
When you think of what that means to algorithms, that becomes very complicated. I think, in most applied mathematics situations, people actually try to ignore that feature of it, that the more complicated the software is, the more issues it can have, the more mistakes it can have, and then how do you mitigate that? Generally speaking, that's done through something called software validation and software risk management, which is, on one hand, validation is the act of providing objective evidence that a system conforms to its attended use, which means write down all the things you claim that it does or you think that it does, and then have evidence that shows that it does all those things.
That used to be quite simple. When we used to make tiny little widgets that did one or two things on an embedded piece of software, it didn't do many things, so it was easier to prove that it works. Now, with connected medical devices, connected cars, SaaS systems, AI systems, what do they do? What does Facebook do? Facebook's use case is vast. There are so many different features and functionalities in it. And on the other hand, the risk management aspect is understanding the hazards that emerge from the different things that the software does and the different ways the software is built in order to make sure it's both doing its intended use and is safe.
One example I always give is I can have two types of cars. One car has its seats made of leather. The other car is done with the exact same use case. It's driving. It does everything except I chose for the seats to make them out of uranium. As a result of that uranium, there's a lot of risk that happens and maybe I'd rather not make the seats out of uranium. Those are the types of decisions that people need to think about in safety critical scenarios is the cost of the part as well as its safety profile appropriate for what we're doing with it.
That involves a lot of planning and critical thinking, which progressively over the last 50 years has become impossible to manage. We've taken it so far that now just developing these types of systems is extremely slow because managing their life cycle and the documented evidence required by their life cycle is extremely slow. Basically, there needs to become a way to do that much, much faster because there's going to be a lot more safety critical software around society. There already is, and if you read the news, the frequency of critical systems failing is just increasing over time.
Every year, there are more and more catastrophic newsworthy events and I don't think that's going to change unless we change the way we build products that are used in those scenarios.
Shane Hastie: But then we face the pressure of we want software faster. What you're talking about is a process that would put the brakes on.
Processes designed to ensure safety slow development [08:18]
Erez Kaminski: Yes. Again, I'm more talking about it from where they already put the brakes on it to make pacemaker software, to make everything from pacemakers to toothbrushes. A toothbrush is a class one medical device in the US. It does take more time and I just think that the amount of time, the amount of brakes they've put on it don't make sense anymore.
To me, that comes from two places. One is the methodologies that are used are very old school and two is the tools that we have developed to reduce the complexity of software development, tools like task management systems, DevOps tools, observability tools, cloud management tools, traceability tools. All this kind of AI/ML ops, data ops, and so on were never really built to have control in them for safety critical systems. It's even worse. One is they're using antiquated processes and they also can't use the modern tool set to do it. I'm in the business of taking the leg off the brakes a little bit, but making sure we still have all the quality, reliability, and safety checks that patients and society expects us to have.
Shane Hastie: Society expects us to have. How does society give us their voice?
Challenges getting the balance right with regulation [09:30]
Erez Kaminski: There's a reason that in all countries or almost any country in the world, there is a regulator for, for example, medical device products, pharmaceutical products, automotive. By that, society giving us a voice saying, "Hey, we want to make sure that someone's showing us that it's safe". I think we expect that if you talk to people for long enough that... I mean, a lot of people who say, "Listen, all this stuff is crazy to do. There's so much work. It'll increase the cost of the device so much. Why should people do this?" That's like I'd say the classic West Coast developers mantra, CI/CD release every minute if you could, which I'm actually a firm believer in.
For the right use case, you should release as much as possible, and I want people to release safety critical systems as much as possible. But every time when you talk to a person like that and you say, "Yes, but that thing that's untested, what if that's going to a pacemaker into your child's chest? Are you sure you still want to not test it 100% or as much as you can?" And then everybody says, "Well, in that case, no, no, no, no, no. I don't want automated minute style release. I want some checks here. I want to have some confirmation of the validation again that it does what you claim it does".
I think that's how society... We both know it. We all kind of have a sense that things that can injure or kill people should be made a little bit differently. Also through our voting acts, we have voted to ask the government to regulate things. Now, not all regulation is good and not all regulation is bad. Some of it is appropriate, in many cases there's significant regulatory overreach that is inhibiting innovation, inhibiting safety, but I don't think that's true as a rule.
I think there's a lot of great regulation out there. Something I often tell to people is that what does it mean to be regulated? To be regulated or developing a regulated product, it means you're building a product that matters and it matters so much that society has decided there needs to be some rules about developing that type of product for that use case. I think society does give us a lot of feedback. There's a reason that every country in the world has some agency that monitors the development of medical products.
Shane Hastie: I know from our conversation before we started that you've been involved in setting standards and writing standards. What does that entail?
Contributing to standards [11:45]
Erez Kaminski: The standard writing world, it's a complicated universe. I'm like a developer. I did not think where I'd find myself in my career is sitting on standard committees and deliberating different aspects of change management and impact analysis and risk analysis. It happened more out of a need. I felt that someone needs to join the fight and help build standards that are more lean and are more developer-first, but also incorporate all the safety features one needs to have.
I work with a wonderful gentleman who used to lead software development and software systems engineering for the US FDA for about a quarter of a century and he wrote many of the global standards for software. Me and him kind of thought together, "What can we do to have a bigger impact and help both ensure that standards are not becoming too burdensome, but also make sure that they're aligned with an approach that is focused on safety?"
You could join as a professional, a bunch of different committees, whether that's ISO, the international body, whether that's in the United States. For medical devices, you have AAMI, American Association for Medical Instrumentation. For Pharmaceuticals, there's ISPE, International Society for Pharmaceutical Engineers. There are many different societies in many different regions. The published standards, those standards are then used by companies and organizations to show conformance to them. The assumption is that if you follow this checklist, you won't forget something. That's what standards are really about, making sure you just did the obvious things, whether you have the experience or not.
And then standard writing is an interesting experience. It's very different than software development. You need a lot of organization and committees and you need to have a lot of patience. As a mathematician, I don't think we're famous for our patience. We love to go pretty fast and talk fast. What ends up happening is you just join one of these committees and you sit and deliberate and develop basically guidances or technical reports or standards that can help others develop safe things.
One of the standards I worked on is for risk management for machine learning in medical devices. That was pretty interesting. Another one I worked on is for medical device clouds and trying to think of how we can help people comply with the regulation and with safety requirements, but still do it in a way that makes sense and is developer first. That's kind of been my path, patience and then the people building it.
Shane Hastie: You made the point more and more of our lives, our world is being automated, so more and more of the devices, the things that we work with on a day-to-day basis, certainly in the medical field, but automotive and everything around us is embedded with software now. What does validation really look like in that space?
Validation in complex interconnected systems [14:42]
Erez Kaminski: I could tell you the way it looks like right now and the way I'd like it to look like. Today, it looks like massive teams. In some cases, more people working on validation and compliance generation than the people working on developing the product. Sometimes many, many more work on validation than on developing the product. For some systems, especially cloud-connected medical devices, cars, which what I'm more familiar with, pharmaceutical factory equipment require thousands and tens of thousands of pages of evidence and a documented PDF that prove that the system has a list of use cases.
Sometimes, the use cases can go on to be hundreds of pages long and then both evidence of process steps that say you've done certain things. For example, your staff is well-trained to do this type of work. You have gone through rigorous process of analysis. You have performed design verification, which is the activity of making sure that the software you're about to build or have built, in fact, performs the intended use, the requirements, and the use cases.
And then after that, basically you run all the tests associated with all those features, requirements, specifications to show that all these things happen. And then you have traceability from each thing you claim about the software or the product. I claim that I can connect to the internet and I can click this button. Okay, is there a screen recording or an example of someone running a test that does that thing for this version? Of course, there is flexibility, because at the end of the day, the person making the finished product is in charge of ensuring it is safe. Sometimes you can say, "Hey, sure, I'm on version 200, but I didn't really modify that particular button so I don't need to retest everything". You take responsibility for that, but fundamentally it's that act of basically showing that a system can reliably and safely meet its intended use through objective evidence.
But I think the first time you hear that someone is producing 17,000 or 20,000 pages to make a software system, it really makes you think how much time is being spent on that compared to spending and building better software in that. Today, the equation doesn't make sense.
But then in the future, I think when you introduce DevOps and what I call validated DevOps, a way to execute development operations in a safe and reliable way through computer automation and computerized quality assurance agents that can sit within your infrastructure and make sure you're doing all the right things... I think in that world, producing that amount of evidence which is needed for an audit so someone else can confirm what you did and is needed for maintainability. Someone can look at that pacemaker in 20 years and say, "I understand what's going on here and why it's failing or not failing and what other pacemakers might fail now because of this failure". I think it'll become not very burdensome and will just be part of doing work.
I actually think validation 10, 20 years from now is going to be part of basically most B2B mission critical, not just safety critical applications, because it is a way to prove that something works. If you talk to CrowdStrike or Southwest or many other companies, I'm not sure that the cost of testing their software and document it properly wouldn't have been worth the challenge they found and they had.
Shane Hastie: Let's dig into this validated DevOps. How is that going to change our current DevOps pipeline? What are we adding to it?
What is Validated DevOps? [18:12]
Erez Kaminski: Yes. Validated DevOps is an approach to make sure that validated systems are developed in a controlled manner but still in a fast manner through traditional DevOps methodologies. It basically allows you to connect different IT systems in a way that checks that activities done in one IT systems that are a prerequisite for an activity done in another IT system makes sense, and then connecting it all in a robust manner.
For example, if you say that for this piece of software, all these tests have to be met in traceability from a use case to a test needs to exist, and any modification that happened between the two versions needs to be assessed to make sure you didn't modify more than what you have, is basically integrated deep into the CICD pipelines to prevent you from making mistakes and providing teams with a lot more freedom to just work much, much faster.
I think that the way we're doing it, the way we see clients doing it, the way we see developers in the startup ecosystem using this approach is saying, "Let's put some guardrails in place and an automated, verifiable, computerized quality assurance agent in the centre of this that makes sure I'm doing everything right". Actually gives me a lot more freedom to work the way I want to work and work in a true CI/CD fashion, because the things I have to do, the computer will force me to do.
As a result of that, the computer will also generate all the evidence that I need to generate that I'm compliant and I just need to either review it or review parts of it, but I can spend my time focusing on making a better product and releasing faster.
Shane Hastie: And still being safe.
Erez Kaminski: And still being safe because you're doing all the controls. When I first came to the industry, Shane, I could not believe how people were working. It didn't make sense to me, but then the more you do it, the more you read the regulations... It's not really the regulations that are problematic. It's the way people have decided to comply with them. Most of it is held in a methodology that existed in the seventies and eighties before significant automation existed. Most of the systems people use to develop safety critical software also come from the nineties and eighties and even seventies. And that's really the issue. I think that most of the regulations about quality are actually not that burdensome if you would do them right.
Shane Hastie: You mentioned SaaS and architectural layers and you use the example of the pacemaker all the time. I don't want a cloud-connected pacemaker, but actually I might want a cloud connected pacemaker to provide my doctor with real time notification of what's happening in my heart.
Erez Kaminski: Exactly.
Shane Hastie: How is that?
Choices around architectures and safety [21:02]
Erez Kaminski: That's a complicated question, right? Pacemaker is kind of an easy example because it's so scary also, but let's look at something a little more also scary but something that's so obvious why you would need to have it. It's an infusion pump. Similar like a pacemaker, there's just so many different parts of an infusion pump. Let's say I'm a patient that needs insulin. I want my doctor to be able to access and understand my dosing on a daily need. I want to have access to my testing results from the hospital, mix that with my dosing with other parameters like a continuous glucose monitor on me. I want to be able to share that with maybe my caregiver or with other family members. That creates a lot of complexity. One obvious one is a lot of different attack vectors that now have emerged into the pacemaker, into the infusion pump that didn't exist earlier.
Another one is the fact that connecting to so many different systems, using so many different open source libraries, so many different versions. You can have all kinds of configuration management challenges because you're maybe using an older pacemaker in a fusion pump and the SaaS software system is updated and something turns out of sync and you're using a different protocol to connect between them and then suddenly you can't connect to the cloud in a very critical time. That's one challenge in this kind of reliability aspect.
Another is how do you break down the architecture to make sure that the safety critical parts, the things that actually manage the injection of insulin, for example, or in a pacemaker sense, that burst of electricity to the heart, that they're not actually controlled by the cloud and, in fact, a bad actor or a faulty kind of software can't produce a wrong event? That's really, really complicated. The FDA now and also the EU and many different regulators require a significant amount of evidence that you understand how you architecting the system and that, from a reliability and security standpoint, there's no possible way or there isn't a reasonable way for things to go wrong.
What's amazing is that people in some companies have gotten so good at it that there's even evidence that it really works all this time. There was a big story last year that a large medical device company, one of the libraries they were using in the infusion pump had a vulnerability. Vulnerabilities come up all the time out of the control of the manufacturer. Because they did do the right thing and they did architect the system right and they did put proper barriers, it was actually not possible for a bad actor to control your machine from the internet. You could if you were right close to me and connect through the Bluetooth, but that's a very different scenario, someone next to me with a Bluetooth and any person in the world with a computer terminal.
But it's very, very hard and it's time-consuming and it's frustrating and it requires a lot of subject matter expertise and the interoperability aspect, which is what we're talking about of modern medical devices and our expectations of consumers is a real problem for manufacturers that have never thought of doing that.
Shane Hastie: It expands just how we have to think about the products we're working with.
The paradigm shift needed in thinking about safety first [24:14]
Erez Kaminski: Yes, it's exactly what I told you. People who don't continuously develop safety critical products, it's hard to think like this. I have a great developer, amazing guy. He worked on language design for a bunch of different languages. One of them is called ReScript. He's part of the ReScript Association. He used to say, "Before I came to work here", and he spent his whole career in kind of SaaS and cloud... He's a well-known Austrian cloud developer.
He used to say, "As a developer, I'm always taught to only increase functionality". That's the purpose of cloud development. Only add more features. No one ever said you should think of how to reduce features to make sure it's safe and you're doing only the things you need to do. He said, "It's such a different paradigm of thinking about software development". I think that we'll see, over the next 10 and 15 years, how that is going to be an ever-growing need in software development.
If you think of all the things that people are excited about now, deep tech, fusion reactors, autonomous vehicles, autonomous drones, all this AI for pharmaceutical development and medical devices, autonomous surgical robots... We work with clients who have semi-autonomous surgical robots. People are going to get really involved in not just creating more and more features, but creating the right features and creating them in a safe way. I think that that's going to create a big fracture in the developer community because that's not how most developers are thinking about it.
That's what I'm trying to do and my team is just make it a little easier to do that so we don't have so many people coming into healthcare or other regulated industries and leaving after six months to a year because they're just not willing to sit around doing documents. They want to work.
Shane Hastie: We can't end without talking about AI. Where is AI taking us and what are the risks and opportunities in this space?
Risks and opportunities of AI in safety critical products [26:04]
Erez Kaminski: Yes. I've spent most of my adult life thinking of AI-enabled systems and, now the second part of it, thinking of safety critical AI systems. As I said earlier, I used to help develop AI systems, future AI systems in one of the largest pharmaceutical companies in the world, deployed to tens and hundreds of millions of patients at the scale that I just didn't understand how you would act if you're thinking that the next release you're going to do with an AI system, which is a statistical system, is going to go and impact tens of millions of patients that have real problems in their life, that have cancer, that have severe illnesses. It made me think a lot about how people develop AI right now and what's the limits of it.
I see a future where most commerce and most B2B interactions as well as a lot of our personal life is going to be dominated by automation. Some of that is going to be routine automation that is deterministic in nature. A lot of that is going to be traditional machine learning and a lot of that is going to be generative machine learning. I think we will need to think long and hard about which use cases we allow or not, but the vast majority of that I believe will need to be validated, because how else would we know that they're all interacting and working together correctly? We would need to figure out a way to, if not validate generative AI systems because I think it's actually quite hard to validate them because there's so many use cases that can perform, at least perform proper risk management for them, which companies are starting to do.
I see the day where a huge amount of software is going to be developed in validated DevOps and the agents themselves are going to check each other and make sure that they're doing the right things. I think that FDA, at least in the US, there's now 1,000 medical devices that have been approved that have machine learning in them. There's this really famous graph that just keeps growing every year of how many approvals. I'm lucky to work with a bunch of those companies and understand how they're actually making the sausage internally. I think folks are trying to figure out how to do it the right way and how to do it the way that's appropriate for the risk they're taking and the risk patients are taking.
But I'm also seeing devices that are truly revolutionary, that are changing how people are working. I can tell you one of them. There's a great company called HeartFlow that develops a very complicated AI system that saves the lives of thousands, tens, and hundreds of thousands of people every year. About a quarter of a million patients use them. And what it does is you go to the hospital... It's an amazing AI use case. You have a potential stoppage or a different plaque build-up in your heart. The doctor measures that and says, "You need to go get an interventional radiologist or an interventional cardiologist to put a lead down your arteries and then check your flow rate of your heart".
Now, we all know what that means. That means a delay, two, three months from now. In the meantime, you could have an event and something can happen. This is people's lives. They're not asking you to get a measurement of your heart because everything is okay. With their device, you can go the same day, get a CT scan in a generic CT machine, and be provided with the results the same day or the next day that are equivalent to that interventional lead and insertion, which I think is just unbelievable.
There's people that they save their lives with advanced AI systems every day using this technology. Those are the things that excite me and also making sure that they work well, because I think similar to our conversation of applied mathematics, I think most of the software that we're seeing in consumer and cloud web domain is just not really hitting these industries that needed the most. It's impossible for us to train physicians and other subject matter experts across a variety of industries fast enough for the growing population. What we need to do is to augment those people's ability, reduce the cognitive load of doing their work in a way that allows them to be more productive and to be easier to train.
I go to surgeries often. I really love seeing people do medicine. We live in Boston. Boston is the global hub for all kind of medical biomedicine, traditional medical device development, and you sometimes see people perform surgeries, new surgeries that physically are difficult to do. You need to hold your hand in a position that is just uncomfortable to hold and not everybody has the shoulder strength, for example, to do that position, but the treatment is revolutionary. I was seeing one treatment that would prevent people from needing to do radiation treatment for a certain type of cancer if you come once every two years to a doctor who kind of cleans something off a certain surface of the body internally with a laser.
But when you see a physician, the physician who invented it were in their lab, seeing how they do it, you're saying, "Wow, physically this is really challenging to do. How can we make it easier to learn, make it easier to do it from a biomechanic perspective, and even introduce some automation through AI that makes it executable or more automatable?" Today, you have cataract machines that remove certain parts of your eye in a more automated way. That's basically allowing a lot of clinics that can't find that doctor in rural areas or in mainstream urban cities to do more of that. And I think a lot of that is going to come to pass.
I think so far in medicine, we have a pretty good reputation of this. In automotive, I don't think if the reputation of autonomous vehicles is great yet, and we're just going to see that emerge over time more and more. Society is heading in a very clear direction. Automation is going to be dominant in our life and we need to figure out how to make that automation safe and reliable for our children's safety.
Shane Hastie: That's a really good point for us to wrap things up on. Erez, some really great insights and interesting points there. If people want to continue the conversation, where do they find you?
Erez Kaminski: They can find me on Twitter, on LinkedIn.
Shane Hastie: Thank you so much.
Erez Kaminski: Thank you, Shane.
Mentioned: