Transcript
Pilyankevich: My name is Eugene. I've been doing different kinds of security since the mid-90s. Most of my career, I've been defending other people's data and sensitive assets. My lifelong interest is, why other people fail at doing that, because you only get better by analyzing and learning from the mistakes of others. Yours, you will make anyway.
Design Stack Agnostic, Modern, Secure Architectures
At some point, I realized that we are typically treating security attributes like they are in contradiction to everything else we do. We want to have it secure, but we run it on outdated Windows. We want to make it secure, but we can't integrate it with our convenient infrastructure as a code thing. For many engineers and decision makers, being stack agnostic in your choices, being modern in your implementations, and being secure sounds like a CAP theorem. You can have only two of the three. In fact, it's not. We will try to spend time on learning the mental framework for doing it easier. First of all, the stack agnostic point. When you are thinking about architecture and risks, you are not thinking about tools. You're thinking about processes in the real world. Once you do realize the architectural parts, it's easy to pick the tools that work for your stack. We want architectures that enable modern tools and that address modern, relevant, valid risks. Once you start looking at risks, and stop looking at tools, it's easier. The definition of what is secure and what we actually strive to achieve with our efforts is tricky. For simplicity, I will be mentioning all these three in this simple visual coding.
Outline
We'll talk about how to set goals to your security designs. What to strive for, because when you don't know where you're going, you're never going to get there. We will understand the necessary thinking steps, rather than Googling and choosing steps. We will understand how, when our thinking meets the limitation of the real world, how not to struggle, but sometimes to even benefit from that. We'll set the motivation, why we're doing this. We'll talk about the building blocks. We'll talk about how it gets real.
Why We Need Security Architecture
Why do we even need security architecture, whatever that is? I'll start with a story. The first time I was allowed to make large decisions about security was a banking product, which stored a lot of sensitive data and which was exposed to a lot of not too nice guys. We were not an easy target. It was almost a dozen years ago. We already had good ISO certification. We had a crazy banking compliance rating. We had annual audits, Pen tests. In that age, we thought we were pretty much ahead of the crowd. Yet, we have crazy customer fraud. We had people automating our web interface, turning it into APIs, circumventing our security controls, stealing our data and stuff.
Perfect User Fraud Prevention Solution
The first set of engineering efforts was catch and run. We did something, they adapted. Then at some point, we escalated that to my direct manager, one of the owners of the company. He said, "You're burning a lot of money for free. Why don't you warn account abusers that their contracts will be void, regardless if it is API or web interface, we'll charge them per request." The company changed course in one day. The fraud has dropped to almost unnoticeable levels. The engineer's decision would be, let me write more code. Let me bring more tools. It challenges the attacker. The manager's decision would be, let's look at the risks. Let's adapt to the risks.
Prevent Injections on the Public Front
The same system goes even more public. Due to changes in local legislation, we have to serve requests from unauthenticated users, from the general public. We store credit history records, and they have to be able to verify whether we have something on them or not. It was the early days in my country. What did we do? The obvious stuff. We got to sanitize everything. The system was in PHP, so there was a lot of sanitization in place. We're going to write tricky mod security configurations, and stuff like that. Still the attackers now through open public frontend, would be able to get some data, to even seize someone's account. There were a lot of problems. At this point, I was able to get intervention from the system's architect, who said, "Validation isn't the problem. We didn't have the 4-layer domain validation in place but we'll have now." That instructed prepared statements. That instructed the model for materialized views in the database. The injections from the public interface would be gone. Security engineer's decisions would be, do more stuff I know. Architects decision would be, remove the cause that creates trouble.
Why Do Large Companies Struggle With This?
We were small in a highly regulated market, and most of the people who could make a decision sat in one room. It was very easy. My interest for decades is how big, well-funded, nicely coordinated, quite compliant companies still fail: Equifax, Heartland Payment Systems, JP Morgan. Come to think of it, large banking is covered with compliance. The RSA Security, the guys who were at the beginning of the cryptographic world. Their founders were at the beginning of the cryptographic revolution. Google, Juniper, people who are highly competent in what they do. How come they end up failing as well? When I talk to my colleagues at larger companies, the typical excuse is, the big company is hard. Big infrastructure is really hard to manage, lots of processes. Nobody knew that this was even possible. How are we expected to protect from that? Someone even quoted the Fight Club, "You're going to get hacked anyway." Exactly. Nobody in the end had the capability to say, "This is a systematic problem, we need to do something about that."
Why? There are three directions we can go here. Humans are unpredictable, not that you can do anything about that. Technology is broken. We are gradually improving that, but it is given. However, poor design decisions in any given moment is something that you can improve. Why are design decisions actually poor? Because most of the time we are thinking of how to get the security goal done, how to get rid of this concern. We are addressing the symptom not the cause. Why? Because for most of us, even for some of us in the security industry, the security controls and security efforts themselves have negative business value. It doesn't bring new features. It doesn't improve your product. It doesn't make your users happier. It is frequently hard to understand, full of crazy lingo, and tools which are built by aliens. It is confusing and contradictory. It creates a lot of conflicts in your system designs. Unless, of course, you're employed in the information security business, then it's really messy, impossible to grok, and totally confusing. Why? The issue is, when you're doing something, it works or it doesn't. It meets certain criteria or it doesn't. In security, you never know if it's secure or not, until it's broken. Then it's definitely not secure.
4 Types of Knowing
In military thinking, there is a simple framework of four types of knowing. There are things you know that you know. There are things you know that you don't know but you can know them, by reading, listening, talking to somebody, and so on. There are unknown knowns. There are things which you definitely are going to know someday, for example, the cause of somebody's death before it happens. However, you don't know it now. There are unknown unknowns. There are threats, which you don't know, which in turn you don't know they exist. Once they come, you don't know which form they take. Unfortunately, with security, it gets worse. With these four vectors, we've got another four types of knowing. We can stay confused. We can stay in doubt for what we do. We can stay afraid of compliance punishments, security risks. Your boss being angry at you overspending your budget. We can be risk averse. We can just pretend that nothing is wrong.
I have almost exhausted the list of why's. Let's get closer to the causes. There are too many things we have to think about. They are poorly systematized within the security context. Once we are facing something risky, our evolution tells us it is fight or flight. You don't really turn on most of your cognitive abilities to their best. Why? Because you don't have a simple framework for that. People make mistakes under pressure all the time. One of my favorite stories is, why some of the people in high-profile banking, who should understand risks really well, get caught in adultery? Because cognitive skills are domain specific. Something you do in your office within a certain narrow range of activities, that doesn't translate to something else you do. The same thing with security, you can be a brilliant engineer. Yet, some of these skills simply shut down when you face uncertainty and risk.
Remember the Giants
The remaining part, we will be trying to give mental models for your brain to simplify the picture, and make decisions easier, because otherwise, you're going to be like the hero of our story. We had four sets of decisions: manager, security engineer, software engineer, and system architect. These fellows could talk to each other easily. The questions are typical. What is bad for us? How to prevent that bad. What can we do under our narrow set of circumstances and technical capabilities? What is the right way for this doing, to be systematic, for it to exist and not to constrain anything else? What has Google learned out of Operation Aurora Bridge? They made the architectural conclusions about the access control. Instead of just tightening up the policy, doing some new tools, they redesigned the access control architecture for their employees. When you are analyzing the root causes for security failures, you realize you quite quickly get to decisions in the architecture.
Security Architecture 101
Why are we talking about architecture in the security context? Because we want under pressure, simple, understandable, and implementable systems. We don't want a security policy which only three people in your company have read fully. We want a simple set of guiding principles, which basically do two things, prevent the damage and manage the risks in a way that doesn't burn your budget and doesn't constrain the growth of your business. What is security architecture? It is a combination of security decisions, which are focused on the actual risk your system is having, not the imaginary, not recommended by OWASP Top 10. The actual risks. You got to figure them out in a chosen manner, which means you should still have some choice in, "I'm going to do it this way or that way," because otherwise, this is not your architecture. This is the prescribed way. While maintaining what is important for your business on a sound level. How do you do that?
How to Design the Security Architecture
There are three parts to this equation. One is understanding, managing, and minding the risks during your decision. Second is understanding and managing the attack surface. This is where you and your sensitive assets meet the world. Then in the end, balance trade-offs, because the first two bring quite a complex set of requirements. Before we do these three things, any security effort is like this door. It is not actually insecure. It protects itself, nicely, because it is tied to these concrete sticks on the sides of it.
Understanding Risks
I think many of you have already spent significant efforts in your businesses to ensure that your product is resilient when you are scaling it up. That you can control hardware and resource consumption. That you can control against failure by redundancy, and so on. For a well-educated engineer, actually, the security architectural efforts are the same. You are designing against a certain set of risks. There are two schools of thoughts in terms of designing against risks. This is not how it happens in reality. This is just a beautiful metaphor that's frequently used in our industry. First, there is the NASA way of fighting against risks, which is, you are designing against the risks of laws of nature and randomness. You can model most of the randomness in certain ways, so you can engineer your system to be resilient against it. Then there is the U.S. Navy way. It is full of unknown unknowns, and it is full of active adversaries. The asteroid is not looking to hit the rockets, it just flies by, whereas your enemy is intentionally looking to impose damage on you. Most of the security efforts fall closer to the second ball of it, whereas many scalability and redundancy efforts fall to the NASA way. However, where you are on this gamut is your choice, which should be educated by what you're defending against.
What Risks Should Be
Risk management is an infinite effort, and even go in a few circles. Around this snake is better than doing none. For the engineering decision making, there are two facets of it which are important. You should be able to measure risks in numbers to compare and decide. Otherwise, that's just theory and shaking hands. You should be able, upon this decision, to manage them adequately with your infrastructure. The typical circle says we need to define what we accept. Are we accepting that, for example, somebody uncovers the new way to capture the radiation from your processor and record all states of every server in the world? Or, we want to protect against that as well. Identification, naming those risks, assessing how much your posture against these risks matches their severity treatment. Choosing what you do. You want to mitigate. You want to insure. You want to ignore. Many things. Accept that level of risk which you produced are appropriate for you and monitor them. It's a long way. The easy way is identify, measure, find adequate treatment.
Risk Management
There are two questions which don't require huge formal frameworks. What is the most important to protect? What is the relationship of priority between things? Should we spend more time on this and that? There are five simple approaches. The first two are quite simple. The rest are not so, which enable you to quantitatively decide this. In a simple way, it's explained like this. The risk is proportional to two things, the probability of something bad happening, and probable damage if that bad thing happens. Measuring this allows you to prioritize. If we are talking about probability of some event, and your system processes millions of events an hour, probabilities start to look different for you.
Understanding the Attack Surface
You've got your sensitive assets and you've got bad people outside. The attack surface is a single line that separates one from another. It is any point in your product infrastructure or whatever, which will enable attackers to impose a certain type of damage. Attack surface is actually helpful for you because it allows you to prioritize what is the attack surface and what you should spend time protecting, and what you can simply ignore in the context of a certain risk or a certain asset. The typical attacker is looking for an asset, for gain: directly or indirectly. Whereas most engineers, in the past or sometimes in the present, think in servers. We are protecting systems. The attackers think in chains of servers, which enable them to get to a certain goal. We think in lists of things we should do to think it's secure. The problem is, attackers optimize for results. Our activities are not prioritized by risk, and are not optimized for optimal posture. That's unfair because the attacker has to do a few things right. We have to do most things right. It's easier when we systematically think about risk and attack surface to do that.
Managing Attack Surface
There are five things you can do about the attack surface. First of all, you should know it. Your infrastructure monitoring, your asset management, whichever way it goes in your company, you need to know, what is the attack surface on the sensitive data or sensitive processes you run? Before imposing security controls, you can actually minimize it. In many places, you don't need those sensitive assets. Then, where exactly do you need them? You get to think about controlling the attack surface. This is where access control, encryption, and all things come to play. Then, unless you don't monitor what you control, you don't control anything. The next thing is once you minimize and set out appropriate security controls, you're going to monitor them. The final thing is, it is all nice in theory, unless you get to practice, drills. When you hire Pen testers, this is not some people on the other side of the planet doing random stuff. They are your drill in understanding what your attack surface is. How good you control and monitor it.
Balancing Trade-offs
We get closer to how this theory looks in the real world. Balancing trade-offs between security and everything else is as easy as spinning the tabletop spinner. If you apply the right type of force, it's easy. If you don't, it's a struggle. Typical trade-offs for doing security choices are well known. It's expensive. It constrains usability of my product. This type of authentication doesn't look beautiful in my mobile application. What do you do about that? The more we as a security industry develop new security controls and new ways of interacting with people, we realize that security controls which are not usable, are not secure, because they get ignored and circumvented. Security controls which are too expensive are bought just to be there. They are never fully integrated, because you ran out of your budget. Security controls which impair maintainability are security preventers, because if you prevent normal system administration operation of your infrastructure, DevOps, whatever, it is less manageable. In the end, flexibility. Sometimes people fear that by doing this, I will end up becoming less flexible in my development, which is partly true, but partly you have many choices done before you. If it's too expensive, you will not be able to afford to maintain it. If it constrains usability, this security control will be circumvented. If it constraints maintainability, it lowers the security of your system. You don't need this security control. The only thing you sacrifice is a little bit of flexibility. This is not something versus something. This is pick your battles. This is pick what you care for, and accept the rest. Because perfection isn't attainable. You just got to look out that, whichever solutions you choose, they're acceptable in terms of their risk impact. They're acceptable in terms of how they alter your quality attributes or your non-functional requirements.
Designing For Security
When I was younger, and when I was designing and building and sometimes breaking systems, within boundaries of laws, of course, I was always like, "This theory is nice, but I'm an engineer and the practice is what makes the difference." However, my goal for the next section is to illustrate how all the engineering is nice, but only by carefully thinking about what we should do and how. We do make the difference. The real world doesn't exist in nice boxes and lines. It doesn't exist in sweet functional specifications. It exists with its limitations. It doesn't care whether they are beautiful for us or not. My growing into a manager and into a decision maker within security was accompanied by this quote all the time. I would tell other people, "In theory, it sounds good, but in practice it isn't." I come from the ground. I built systems. I broke systems. Then, eventually, I learned that this is true. However, when you look at everything through the lens of certain mental frameworks, your decisions land in the right place.
The Attack Surface Is Always Too Big
The real attack surface of your business, product, and infrastructure is always huge. You always end up having more to protect than what you've got. You've got too many technologies, which are not covered properly by the security tools you buy, or integrate. The only thing that is not extremely big is your budget and the amount of normal human hours you've got, because you've got a life as well. One of our customers is a national power grid, the people who coordinate transmission of commands and the actual electricity between generation and consumption. This is quite a critical business for certain reasons. These folks are quite afraid of getting the control signals hijacked, getting the monitoring signals hijacked. There are good reasons for that. Historically, these people lived off the assumption that we live in a separate physical world. Our operational technology infrastructure doesn't get connected to our IT infrastructure, so we're good. Then, most of the power grids around the world still operate equipment that is older than me. It's not fairly easy to plug an Ethernet cable there. You might be really lucky if you end up having a couple of contact pairs to extract something to [inaudible 00:28:20]. Getting closer to the context. This large, nationwide power grid is compliant to old legacy standards. Yet small parts of it, privately owned, should comply to European regulations because it is a part of European balancing agreement. Both of them have to monitor hundreds of devices which emit hundreds of telemetry signals.
What do you think? How different are they with a totally different set of demands? How different are they in their choices of tools and choices of trade. They're not. In the end, they are much more constrained by what they run on. They are much less constrained by what they try to comply to. What happens in the real world is, we had to find the sweet spot of compromise, not within compliance requirements and procedures, but within the technical standards which enable power equipment to emit, this is what I'm consuming. This is what I'm producing. This is the mode I'm working in. The sweet intersection of compatibility turned out to be data, not anything else. Sometimes you have to step down until you find the place where this compromise exists.
The other one is quite recent. A customer of ours inquired, "I've got an issue. My auditors have requested me to increase the coverage of my security monitoring system. It should cover more systems for whatever reason. Yet, I've got only three people. I can't even pass some of these login data to my SIEM system. What do I do?" Instead of going and selling some of our stuff that could help in this case, at least secure aggregation of the data. We said, "The easiest way to have more hands and to define what you do need to cover with your security monitoring is to revise the risk model. See whether what you're monitoring actually fits your risk model. Should you even monitor that in the first place?" Three simple rules on the firewall ended up eliminating half a gigabyte of logs a day. Three simple rules just removed part of the attack surface, which stopped being a priority target for security monitoring. Managing the attack surface allows you to prioritize. After you have narrowed it, you still need to prioritize within it. You still need to choose things which you can control, things which you can only monitor, and things which you have to accept.
Is It Secure? Trust Levels
Once you hit the tools level with your decisions, you get into trouble. Because what expect mentally is, what is the best way to secure this? The best way doesn't exist, because the definition of secure is hazy. What is secure? Ultimate security is when your server is turned off, and hidden in the basement, and covered with half a kilometer concrete for no physical being to be able to steal it. The definition of provably secure which is very popular in cryptography, is, we've got security proofs. Actually, we ascend our assumptions to some core assumption behind the whole thing. Another speaker at the Security Track, will talk about how quantum computing challenges some of the core assumptions of cryptography, and which new assumptions we need to make, to make new provably secure systems. 99% of the things we do is raising the bar from, it's insecure, which is easy to know, to probably, it's more secure. The remaining 1% for the smart and the tricky people is, aside from preventing attacks, or preventing a certain percentage of attacks and damage, you can control the attacks. You can direct the intruders in a certain direction. Still, even after we remove the issues with trust, even after we remove the issues with, we've got too much to do. We still have requirements which conflict with each other. We see conflicts when we don't understand the root causes. We see conflicts when each problem has to have a solution. We have 50 problems, 50 solutions, they will definitely conflict with each other. However, if the 50 problems are a function of one or two root causes, addressing these root causes will make the 50 problems disappear, and conflicts disappear.
Example: Optimizing SIEM Coverage
The same security monitoring coverage problem we have for a day, before realizing that the risk profile is wrong. We have accessed, for a day, the possibility of bringing some of our tools which provide encryption, data masking, and is pluggable in the way that the customer wanted. We found a very tricky requirement. Whatever the login system is, it should prevent data leakage through audit logs. How do you even do that? As it turns out, it is a typical organizational scar where, previously, some auditors have found personal data in the audit logs, unsanitized or unsynchronized properly, and said, "You shouldn't be doing that." An organization tries hard not to do that, without understanding the nature of the issue.
We dove deeper. We've got the PCI logging requirements, which are quite detailed in what is the coverage of your logging in terms of the actual content, or the technical content, the meaningful content. Then you've got GDPR requirements, which say that some of the data and profiling requires consent. In case your publicly-facing system processes certain transactions, and you don't do anything else, how do you get this consent? In some cases, legal is the answer, in some cases it's not. Because in this very case, you realize that proper extracts from the audit log end up helping with profiling customer activity. Remove the personally identifiable data. Fine, it's still profilable. This is where it gets to the customer's decision of, I need to sanitize some of the data, to tokenize personally identifiable data, to encrypt everything in one place, store elsewhere, provide on-demand for the security monitoring system, and stuff like that. We should protect the data in certain ways. However, what is the appropriate amount of effort here? No good answer. That's why we ascended back to the risk model and realized that half of the data doesn't even have to get through the system. Otherwise, having brought encryption into it all, how do we manage the keys? How do we manage the key life cycle between various systems? You've got an infinite rabbit hole of questions to ask yourself and never get the result.
Good Architecture Is a Framework
Good security architecture is as much a framework for how to choose what to do, as a framework for how not to do things, and how to recognize things that you don't have to do. Before doing the three steps of understanding the risks, understanding and managing the attack surface, and understanding the ways you can balance trade-offs for your system, it's still too early to think about the tools. It's still too early to compare their attributes. They increase the attack surface, because logs in a security monitoring system could leak the data. They increase operational burden and complexity. The human time which could have been spent on doing what is actually appropriate for your risk profile. This amazing picture outlines a lot of enterprises. It didn't even fit the 16 to 9 proportion. A lot of amazing security tools the market has to offer. However, all of them are extremely efficient and bring a lot of joy when you use them and play with them. When you understand, in the first place, why you're having this in your hands.
Good architecture is a decision framework. You figure out when you will decide to do something and when you will decide not to do something. On which level of your abilities, technical, architectural, business risk management part, will you address certain risks? It's a design guide. Because just by asking yourself all these questions, you get enough small decisions to inform many of the daily decisions your developers will make, or you yourself will make when facing pressure, when facing uncertainty. Your life will be much easier. When you are focused on risks, instead of technology, you're not actually that much prone to, "It doesn't work for me." You have a lot of choices, because you're actually thinking about how to mitigate risks rather than how to make this tool do this security function.
Example: IAM + SSO + Zero Trust On Top Of Legacy AD/LDAP System with a Dozen Applications You Can't Update
The last example would be out of my tricky experience of having to integrate proper identity management, single sign-on. That turned out to be a byproduct of the zero trust approach within large, legacy infrastructure which still operated Windows Server 2003, with extremely old Active Directory. Big servers running in the basement, powering a huge organization, and there is a dozen applications which rely only on LDAP to authenticate. Suddenly, these people are looking to re-engineer everything, yet they don't have the money. They are looking to do the proper centralization of login and temporary tokens, yet they can't do it because their servers can't even run all the appropriate Microsoft tooling. It runs the old Active Directory and it almost dies because the organization has a lot of users. What do we do? We ended up going down to the same level, as with the power grids, to the data. We ended up building a very small service, which authenticates the user and gives him a temporary token, and writes that temporary token into the Active Directory's password field. Whenever the user through any of the dozen applications has to show his credentials, he has a temporary token there. Whenever the user is logging in to any of those, we realize it's an active valid session, just by the fact that he presented a token which we temporarily rolled into Active Directory. There is an issue with that. We suddenly see that this service shouldn't really be accepting passwords in the first place because it's new code, problem number one. It involves changing clients who are not too trustworthy. We don't want to have replay attacks where a simple authentication exchange would end up being repeated by an attacker to generate a new temporary token.
Each day there are more solutions. At that point, there are not so many. There are interactive cryptographic protocols which enable you to confirm that the other party has the secret without revealing it. We add it to the mix, and we get simple bare bones. It was a service with few hundred lines of code, which doesn't leak credentials, avoids replay attacks, and avoids overloading the Active Directory server. It doesn't change the load pattern of the Active Directory server, and makes the Windows admins happy. There is a nice bonus to that. Once we were done, we realized that since we're the first point where the user shows up, the customer can actually firewall the sensitive systems away from the users. Only when the user shows up at the authentication gateway, the same authentication gateway can instruct firewalls to enable this IP address. This way, they didn't have to rely on VPNs for distant offices, and stuff like that, because this stuff gets broken and it imposes cruft to manage it.
Recap
Having looked at different directions your thoughts might take in practice based on this framework, I still would like you to take away not the shiny examples of, "This dude integrated weird stuff into weird stuff and it worked." That's not the goal of the story. What I would like you to take away is a thought that if you're facing new design requirements, and coming up with the new security regiment for your company or your business, this combination of security decisions that focus on actual risks and focuses on not breaking your quality attributes, is the number one thing you should have. You should just simplify your choice. In the beginning, we set out a number of goals for ourselves. We wanted our systems to be secure. Both three efforts increase security and make it more relevant. We wanted our systems to be modern, balancing trade-offs and understanding the actual attack surface allows you to use modern tools. We want it to be stack agnostic. Proper risk management doesn't care about your operating system. Proper risk management cares about where the problems are, and what the solutions are. Attack surface management is all about what is present within the risk profile, not which connection protocol it runs.
You want to design against risks. You want to have laser focus on battles you can win. You want to put effort on removing the conflict. This way, risk management turns out to be business decisions, mostly. What was the attack flavor? The attack surface turns out to be technical and architectural decisions and trade-offs. That's an architect's pain anyway. There are various ways to improve security. There are various directions your mind can take. One is to improve risk management, improve the business way of coping with risks. Another is add tools and controls. Secure architecture is about combining those two into systematic risk treatment and to not making one choice. Choosing not to sacrifice neither risks nor the technical elegance.
See more presentations with transcripts