InfoQ Homepage Presentations Trends in InfoSec: Data Minimization, Autoclassification, and Ethical AI

Trends in InfoSec: Data Minimization, Autoclassification, and Ethical AI

View Presentation

Speed:

Download

46:23

Summary

Rachael Greaves provides a summary of the requirements for data lifecycle management, the technology approaches, and the risks, and includes a Data Minimization Best Practice Checklist.

Bio

Rachael Greaves is a CIP, CISA, CISM, CDPSE, and is certified in project, change, and records management. With a cultural anthropology and linguistics background, Rachael brings ethical, global, and sustainable practices to the sector. She is Australia’s Most Outstanding Woman in IT Security and RegTech Female Entrepreneur of the Year 2022, and is listed on the Women in Fintech Powerlist.

About the conference

Software is changing the world. QCon London empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Greaves: I'm Rachael. I'm the CEO and co-founder of Castlepoint. That's my company. I'm a certified information professional, certified auditor, security manager, privacy engineer, project change and records manager as well. I'm the regulatory lead of the company. I designed our solution. I did that after many years auditing large government and corporates and identifying a gap in technology to help address the tension that we see between minimizing data risk and maximizing data value.

These are our five parts that we'll go through. I'm going to intersperse the talk with some real-life stories of data minimization gone wrong and gone right. I'll provide a checklist at the end, which will cover the summary of what we talked about and some key takeaways.

Data Minimization as a Cornerstone of Security

Let's start with setting the scene. First, what is data minimization? Data minimization is an approach to reduce the impact of data breaches and spills. Since the beginning of doing cybersecurity, we've been focusing mostly on reducing the likelihood of a breach. We use training and firewalls and encryption and lots of other techniques to reduce the likelihood that a bad actor, or really any unauthorized party, will be able to read our sensitive information. Risk is a combination of likelihood and impact. A risk might have a low likelihood, but have a catastrophic impact if realized. In that case, it's not a low risk.

While it is essential to keep focusing on likelihood using defense in depth techniques, we know we can't reduce likelihood to zero. There's always a zero-day exploit, always a trusted insider, a potential misconfiguration, an advanced persistent threat. There can never be zero likelihood of a breach or of a spill. There can be catastrophic impact for national security as well as for civilians. I'll bring you to our first story, and this illustrates just how harmful a data breach can actually be. It's one that compromised the national security of the United States for two generations.

The Office of Personnel Management, the OPM data breach was discovered in 2015 by which time the hackers had been in the network for a year or more. A state sponsored APT exfiltrated over 22 million records containing the personal data of government employees, including highly sensitive information about psych evals, drug use, biometrics, health conditions, debts and more for current and former and prospective federal government and military employees, that included about 5 million sets of fingerprints. That breach obviously made those people easier to identify and target and exploit from that date until, realistically, the end of their lives, and not just them, but their children and their associates.

As well as undermining U.S. and allied national security, that attack would have significantly uplisted the offensive capability of that foreign government. It's believed that it forced the withdrawal of affected covert agents from the field as they could now be identified by their fingerprints, and exploited because the adversary knew where their children went to school. In a national security context, breach of sensitive PII doesn't just harm us, it helps our enemies in the field, and it also feeds their massive AI capability, which harms us commercially as well. This wasn't just a one-off.

I, myself, in my previous life as a security auditor, have found protected identity, so covert ops vetting files on a civilian's home drive, because the scanners would not scan directly to the secure system when they were digitizing files. I found a complete copy of a national security vetting database stored on a public share because it was the simplest way to send it to the backup team every day. Our technology design and configuration decisions make a really huge impact on the risk our organization is exposed to. Insecure software management is not just putting in hardcoded credentials or missing patches in production.

Let's wrap up the intro. Data minimization is a security and privacy principle that requires organizations to limit the amount of sensitive information they hold, knowing the data in their systems could be breached or spilled at any time. When we have a spill, we want that spill to be as small as it can be. To achieve data minimization, we need to understand all our information holdings. We need to know what has risk and what has value. Importantly, what rules apply to that information, including retention rules, secrecy provisions, and other regulatory obligations. Why is this important to engineers and developers? Because you don't have to just shift left cyber risk. You have to shift left privacy and you have to shift left records governance as well.

What are the obligations? Firstly, privacy. Data minimization is a key GDPR privacy principle, and it's also a requirement under the Data Protection Act, and the NCSC bulk personal data guidance. It requires that you ensure that the personal data you hold is limited to what's necessary, that is, you do not hold more than you need for the stated purpose. In practice, your organization must only collect data sufficient to the purpose, and must periodically review and delete data that it no longer needs for that purpose. If you don't delete the data when you no longer require it, individuals also have the right to make you delete it under right to erasure, or right to be forgotten. The data minimization principle is closely linked to the storage limitation principle, which is more specific when it comes to retention, with the following rules. Let's talk about these.

Firstly, don't over-retain. That means you must have a timely and comprehensive retention and destruction function. Destruction can be permanent deletion or effective anonymization. Secondly, apply records management. This means you need explainable and traceable methods for determining your retention and destruction policy, and a way to enforce right to be forgotten. That means finding every mention of a person. Reviewing that content to see who or what else is contained inside it. Then, either with or without, handing it over to them if they have requested that, destroying it irrecoverably, and being able to demonstrate that you've done so. Finally, you need to have and follow a policy, and that means you need the information and records team involved in security and privacy. They become key stakeholders of and participants in your software development lifecycle and your product roadmap and your own governance and oversight.

Let's consider another story. This one is about a breach affecting private citizens, although not just private citizens. Again, in 2015, a hacker, or hackers, compromised the user data of the Ashley Madison site. The site was designed specifically to facilitate extramarital affairs. The company had retained 60 gigabytes of data about users, including data that those users had requested be deleted and had actually paid to have deleted. Avid Life Media who ran the site had falsely asserted that they destroyed the data, which was not true. Multiple suicides were linked to this breach. Individuals were publicly humiliated and extorted. The spill contained military and government email addresses and addresses registered in Saudi Arabia, where adultery is punishable by death.

Problematically, that wasn't already problematic, many people had been signed up to the site as a prank, and they were also caught up in the aftermath. A lot of those people had already been effectively extorted by AVM, who were allegedly collecting about $1.7 million a year, USD, in those please delete fees. Any breach of user data would have caused harm in this case, but breach of data that should have been deleted made that impact even more disastrous.

I can't really relate to the Ashley Madison one, and most of you probably can't either. Let's consider an example a bit closer to home for me, at least. In 2018, foreign state actors breached the Australian National University Network. They stole 19 years' worth of stuff and student records, including mine. Those were exfiltrated most likely to build profiles on each of us for later use, as we're promoted through the ranks of government and corporates, or we found cyber companies. There are other potential targets: international students from the country suspected of the breach, their memberships, associations, their activities, the National Security College lecturers, leading edge researchers who may travel to or reside in that country.

All these high-value targets are now exposed, along with any sensitive or restricted data about them that should be kept secret under law. I graduated more than 15 years ago. My record should have been disposed of after 7 years. Sixty percent of us affected by that spill shouldn't have been in it in the first place. Our records weren't destroyed because destruction was not a priority, and technically, it was time consuming and complex to execute. You do have an obligation to preserve records as well as to destroy them. Getting the balance right can be very challenging, but it is essential for all organizations.

Legislation governing retention of data, such as the Public Records Act and other statutory instruments are designed to protect the national interest and the interests of your stakeholders. Keeping records for a legal minimum, helps make sure customers and staff can access justice if they need to take legal action against your organization. Retaining records also ensures you can be audited and held accountable. Record keeping is not just for the benefit of the state and your stakeholders, it's for your organization's benefit. Knowing how long data will be adding value to the business helps make sure you don't dispose of it precipitously. After all, you also need sufficient records to protect yourself in the event of a problem. You need to exploit your data, which is one of our most expensive assets to create, to be more competitive, more efficient, and more effective.

Let's have a look at what can happen when we destroy information too soon. In the late 1940s, the British government encouraged Caribbean immigrants to come to the UK to help rebuild post-war. These individuals were granted citizenship as members of the colonies and didn't need any paperwork to immigrate. The law changed in the '70s, but the Windrush Generation, named after the first ship to bring in new residents, were allowed to remain undocumented. In 2012, the hostile environment immigration policy was introduced requiring landlords and employers to start checking for documentation, which led to, specifically, in this case, elderly black immigrants being targeted. The only evidence of their legal right to remain was the landing card slips from the ships with their arrival dates.

The home office destroyed thousands of these despite staff warnings that this would make it harder for Windrush members to stay in the country. They were destroyed because the home office was moving building and did not want to bring or digitize the files. The organization justified destroying them, ironically, because they contain personal data, as a security measure, even though that data was all kept on paper in a locked basement. The affected individuals lost their jobs and their homes as they could not prove leave to remain. People lost access to healthcare, pensions, and benefits. Many were deported or refused reentry after overseas travel.

The state wrongfully detained or deported 164 people, and more of them voluntarily left after being here for decades. The government was found to have acted wrongly and was liable with compensation payable. When we consider privacy risk alone and not the continuing value and importance of records in a social and regulatory context, we risk making pretty catastrophic missteps. Destruction is permanent, and dying in a foreign country dispossessed is also permanent. When we design software to minimize personal data holdings, we need to do so with enough regulatory and social context to not also deny people their rights as well.

Finally, as a development function, you also have a positive obligation to make life difficult for bad actors. The Network and Information Systems Directive applies to energy, transport, water, healthcare, and relevant digital services providers. Critical national infrastructure is also subject to certain obligations. The finance and nuclear sectors, if you happen to work in those, have their own obligations as well. While those more formal frameworks focus on critical and essential services, the NCSC's core guidance is intended to apply broadly to all companies and organizations. It states that you should decommission any information that's no longer used or that can't be linked to a business need.

Even if you're not technically in the scope of the NIS Directive or the supporting instruments, your directors of your companies have defined responsibilities under the Companies Act, and they must act in the company's best interest to promote its success, and that includes minimizing detrimental impacts on your stakeholders and protecting the company's reputation. Data minimization, even if not specifically mentioned, is key to both of those outcomes. The reason why is because data minimization reduces bad actor success. For an attack to be considered successful by a bad actor, there must be return on investment.

The most successful cybercrime attacks generate revenue either from the affected party paying a ransom for return of high-value encrypted data or from the sale or exploitation of sensitive stolen data. Many jurisdictions are considering introducing laws that will prevent companies from paying those ransoms, so hackers are likely to turn their focus even more to stealing and reselling sensitive information over time.

In summary, privacy laws skew towards destroying data. Records laws skew towards preserving it. National security laws tilt the balance back towards disposal. Treading the line is complex, but it's very important to your organizational success. The systems that you design and build must be able to balance the tension between data risk and data value. There are even more reasons to do it.

First, deterrence. Data minimization obviously reduces the potential amount of harm when data is spilled, but it also helps discourage further attempts on the data. Ensuring sensitive data doesn't stay in the network any longer than it must by law, means that spills are smaller and therefore less monetizable. When a hacker does not find commercial quantities of sensitive information, they'll be less likely to attack again. Second, response and recovery. This benefit comes almost by accident, but not quite. To do data minimization, you must first be able to identify sensitive data at scale and classify it. That means registering every data item, reading it, extracting its terms and topics, and then classifying it against security and privacy and, vitally, records retention rules.

One of the most important things we can do, obviously, in the event of a successful attack, is to inform the affected parties, and that helps them take the necessary steps to protect themselves. You have to know your own data at a high degree of fidelity to be able to do that quickly. That high fidelity data knowledge captured ahead of a breach means that we can know who exactly is in a spill and what data points were compromised for those people and companies. We can know what data has been compromised that's subject to secrecy provisions under law, or will incur other penalties or require other specific actions or advisories, and that helps us respond quickly, defensively, and responsibly, which has financial and regulatory benefits as well as reputational and commercial ones.

Let's talk about another story. What happens when we don't or we can't react and respond quickly? Uber found this out in 2018 when it had to pay a more than £100-million settlement in illegal action over its purported breach coverup. In 2016, hackers had exfiltrated 57 million names, email addresses and phone numbers, and around 600,000 drivers' license numbers from a private area in GitHub. They asked Uber for a ransom of £75,000 to delete the data, which they paid, but Uber didn't tell anyone until a year later, in November 2017. In fact, the CEO didn't even find out until late 2017, and subsequently fired two of the people who'd led the incident response.

Failure to identify and notify the affected individuals lost them their jobs and cost Uber £100 million and more in reputational damage. Another tangential benefit of data minimization is in your insurability and transfer of risk. Cyber insurance providers tend to be fairly opaque about how they assess cyber risk and how they subsequently decide to cover you, and how much they charge. A 2019 study found that in the security questionnaires that potential policy holders need to complete, the sections capturing types of sensitive data held and protections applied to it tend to be significantly more extensive than other sections.

The study found that organizations holding large amounts of identifying financial or health information, for example, are considered high hazard and so usually attract a higher premium. The study found that when it comes to assessing your information and data management, the most common question in this category was whether a data retention and destruction policy existed. Minimizing the amount of PII you have specifically by managing its retention lifecycle is a key step in making sure you can get good cyber insurance for the best rate, and can help make sure it actually pays out in the event of a breach.

Then, finally, minimizing data can help with organizational effectiveness and improved outcomes. In many cases, data minimization enables organizational maximization. The National Common Intelligence Application database held around 400 pieces of missed information about the man who would become the Manchester Arena terrorist bomber, killing 22 people and injuring more than 800, most of them children. The system was built by merging existing system data together into one platform, but the database was unstable and very hard to search because of mass duplications of data that made identifying crucial intelligence almost impossible.

System owners had decided not to try to deduplicate the data until the end of the migration, which would be four years later. MI5 and police officers warned supervisors that the system was not fit for purpose, but it was rolled out anyway. By the time of the bombing, up to 20% of the content in the system was duplicated. The system bloated with redundant data, was so slow to search that they were still finding the 400 bits of missed information long after the bombing had happened. If that data had been minimized, its efficacy would have been maximized. The inquiry into the attack found that the bombing would have been prevented if the missed information was findable originally.

Data Minimization in the Information Lifecycle

Part two, let's talk about the lifecycle of data, from data creation or capture, all the way through to eventual disposal, and everything in between. We can start by summarizing the three key elements of data minimization. Firstly, minimize the amount of data you're collecting in the first place. This doesn't just mean not collecting extraneous personal details, it also means making sure you aren't keeping duplicates, caches, offline copies on devices, or excessive backups. It means not asking for the same data twice as part of two separate functions or two separate software applications. A big part of minimizing collection comes down to processes and governance, which you don't have as much control over in the engineering function.

However, one key thing developers can provide which can help drive change in governance is the data inventory, particularly to identify sensitive and high-value data duplication and proliferation. The next element of data minimization is to minimize the number of people with access, their privileges, and the duration of their access. Again, that usually comes down to governance and process, but engineers can also take steps to drive change here. Seeing who is doing what to data and being alerted to actions on sensitive and high-value data helps to catch and kill, malign, or just risky behavior and privilege creep. The last element of data minimization is end-of-life management.

As we've discussed already, there's a lot more to disposal than just hard drive degaussing. There's a big process and governance role to play in managing records, policies, and retention rules. Technology professionals can influence and reform retention management by quantifying what data we have and what it's about. That has to go further than just knowing that it contains PII or knowing that it's owned by the finance team. Retention obligations aren't based on what something is or who made it, they're based on what they're about. We keep a receipt for asbestos remediation services much longer than we keep a receipt for stationary supplies, for example. That means we need to go well beyond metadata to understand our records. Our systems need to fully index the content and classify information against appropriate schedules based on its value as well as on its risk.

Technology Approaches to Data Minimization

Let's get into part three. How can we meet these obligations with the systems that we build? Firstly, autoclassification. This is just the use of computers to determine what something is about, without that computer needing to be told by a person. Not just what it is, but what and whom it discusses, and what that means for organizational risk and reuse purposes. For us, we pioneered something called Regulation as Code for classifying data in an ethical, efficient way. We disrupted the first wave of automated data classification, which had relied on user intervention to determine what content was about. Our AI determines how to classify for risk and value and regulatory obligations without needing to be told, because it matches the record contents directly to the applicable rules and frameworks and policies. Here's our first good news story after those bad ones.

A large university identified that they'd been the target of a breach by a foreign state actor, the same one we've already met, actually. They needed to urgently understand what their exposure was, so the system rapidly indexed and classified the compromised data that had been spilled. It applied secrecy provisions from acts and regulations specific to that client, to determine the civil and criminal penalties they'd be facing for that breach, and a records authority to identify which of the spilled data was overdue for disposal, which also obviously goes to liability, as we saw with the ANU. The rules were applied automatically and identified the full scope of the breach, as well as the legal implications, the privacy implications, and harm to procedural integrity, and that was all achieved within 24 hours.

After a breach on Friday, they could report the quantified impacts to their board on Sunday. Breach response is just one benefit of accurate autoclassification. A second one comes back to making best use of the data that we hold. For even the most sensitive topics and use cases, it's possible to achieve real gains using autoclassification, still with explainability and still with security. In this story, the AI system was implemented to find key terms related to abuse and misconduct across multiple child protection systems in response to a government inquiry into a child predator who'd been operating in that state's healthcare system for 19 years. He was uncovered through a podcast, actually, and it led to this formal inquiry.

The autoclassification system was able to do that with 100% accuracy, and it returned 60,000 previously unidentified results, exposing multiple potential new cases requiring investigation and referral. The power of fast and accurate and importantly explainable autoclassification is to address both sides of our data tension problem: minimizing risk and maximizing value.

The second kind of tech is automated decision making. This is the process of speeding up outcomes by taking some of the human intervention out of the equation. It may or may not use AI. With something as risky as data governance, it's very important not to take people completely out of the loop. We use ADM and a decision support model doing the work of collating and presenting evidence in a traceable way that a human then needs to make an informed decision. In our next story, a water utility used the ADM platform for proactive PII discovery across multiple systems, including network drives, M365, Salesforce, HiAffinity, Pega, Livelink.

This was done in a manage-in-place model, without moving or duplicating or modifying that content and without modifying or disrupting the source systems. The system identified the PII and other sensitive data specific to that client, their controversial topics and issues, and automatically calculated the legal retention periods of those records. They were able to then use this evidence to begin to defensively dispose of the highest risk data that was found, systematically reducing their exposure.

Risks of AI and ADM for Data Minimization

Part four, the risks of using AI and ADM. There are some good outcomes that come from AI and ADM, but it's important to reflect on what can go wrong with these techs. They're not always a panacea for the earlier tales of woe that we discussed. They are essential technologies for managing risk and value at scale, but there's some key risks inherent in them. Firstly, AI can be used maliciously on purpose. Secondly, AI can accidentally apply unwanted bias. It can inadvertently create erroneous outputs or hallucinate. It can breach people's privacy and security just because of the large amounts of data that are required to use it. If there's an issue with an outcome for AI or ADM, there might be no clear line of accountability.

There are lots of examples of these things happening. Robodebt was a method of automated debt assessment and recovery which was intended to replace a manual process. It compared welfare payment records in Australia with average income data, and calculated overpayments and issued debt notices, but the algorithm was wrong. That was known within months of it going live, but it persisted. The system issued almost half a million false or incorrect debts. It caused significant harm and distress, and was subject to multiple inquiries, including a Royal Commission.

The scheme was officially scrapped in 2020 with a $1.8 billion AUD settlement ordered, including repayment of debts that had been paid in, wiping of outstanding debts and legal costs. Closer to here, the Netherlands, had the SyRI system, which was an algorithm that flagged potential welfare fraud to investigators as well. That also saw backlash, and it was shut down referencing Article 8 of the European Convention of Human Rights, which is the right to privacy. In the '90s, a very similar thing had already happened here in the UK, with the Horizon scandal, where about 700 subpostmasters were accused by the automated system of theft and wrongfully prosecuted.

Many lost their homes, their life savings, and were even imprisoned. Now significant action, as you're all aware, is being taken against the leadership. The CEO had her CBE revoked by the king and had to resign from her non-executive board seats and her duties as an associate minister. Not only is the post office leadership now being held accountable 25 years later, so is the management of Fujitsu who built the system. Both these examples are not even sophisticated AI, they're just what can go wrong when we rely on machines to inform decisions, and those decisions are not transparent, explainable, or importantly, able to be contested.

That's led to obligations. Because of the likelihood and the impact of these type of harms for using these technologies, obligations are mounting for ethical AI. We've had best practices for AI for a long time. Some include UNESCO, OECD, and the G20, and these standards were designed to protect the most vulnerable in our communities from bias and disadvantage and harm caused by adverse outcomes. They essentially require any decisions arrived at using AI and ADM, AI or not, to be explainable and transparent so that they can be challenged if they're unfair.

In the UK, currently, a few regulations and frameworks apply or are coming into force. The Data Ethics Framework is a planned mandatory transparency obligation on all public sector organizations using algorithms that have an impact on significant decisions affecting individuals. The Data Protection Act already allows automated decisions to be challenged by affected stakeholders. The understanding AI ethics and safety FAST Track Principles are a best practice intended to ensure fairness, accountability, sustainability, transparency. Essentially, AI systems should be fully answerable and auditable.

Organizations should be able to explain to affected stakeholders how and why a model performed the way it did in a specific context. There's currently no ethical AI law, but there's a proposed artificial intelligence regulation bill that's just received a second reading in the House of Lords. The EU's Ethical AI Act regulates AI, data quality, transparency, human oversight, and accountability based on the risk classification of the system. GDPR Article 25 also states that algorithmic based decisions which produce legal effects may not be based solely on automated processing of data. There's also a supporting convention for protection of individuals with regard to automatic processing of personal data, which defines rights for persons affected by AI and ADM. We have an AI explainability hub on our website, if you want to dive deeper into ethical AI and your obligations from a data minimization perspective.

Data Minimization Best Practice Checklist for Cyber Professionals

Let's move to the checklist for the final part of our session. As developers, you've inevitably, over the last few years, been focusing more on this concept of shifting left. Security needs to be an inherent part of software quality, and that means that you need to be security experts in a way, and not just in how to launch OWASP app and run it. You need to understand the reasons for data minimization security practices, and design software outcomes that achieve the business case for security in the most effective, efficient, and sustainable way.

First, you need to understand data minimization in your organizational context. Initially, that means the threat environment. Understand the threats that are facing you and facing the data that you manage in your systems. Threat actors range from foreign states to cybercriminals who make up the bulk of successful attackers, through to hacktivists and disgruntled employees. Understanding the types of bad actor and what benefit they'd get from taking your data helps make a clear case for what data to focus on minimizing first. The NCSC and NPSA websites are a good place to start for this.

Initially, you need to review threat assessments, external and internal, and map them to your data types and your assets. Next is governance models. Governance includes leadership, which is your executive and your board strategy, which is your governance framework and goals, and oversight, which is going to be the data governance program in your organization. You need to make sure that product is a key part of this model. To effect data minimization, you have to work closely with the parts of the organization responsible for the data obligations. Establish formal engagement with your governance teams on data.

Next, we need a basic understanding of our obligations. Firstly, privacy. Your data governance model, once you're involved in it, will become both an upstream and a downstream dependency on your software and engineering practice. You need to have a basic understanding of privacy law, particularly as it pertains to secure code development. Research your obligations, not just for your jurisdiction, but also for the type of data you hold and what types of people it's about. UK GDPR has a principle specifically for data protection by design and default, also known as privacy-by-design coding principles. Apply privacy-by-design in every sprint.

From a records management point of view, in most places, the records governance has traditionally been limited to only core record keeping systems with retention and disposal treated as an afterthought in most line of business solutions. You need to build in information lifecycle management by design into your software. The systems I've told you about that were over-retaining data were not the formal records management systems of those organizations. Most of the organization does not use the formal records management systems. Build in records registration, classification, sentencing, and disposal into all your systems that hold sensitive and high-value data.

Then, finally, whether you're a national security agency or not, you have an obligation to protect sensitive and high-value information from adversaries. The APT we discussed before has amassed probably the world's largest database on its adversary, the USA, not just by stealing the OPM data that I told you about, but by stealing and combining the Anthem health insurance, Equifax credit monitoring, and Marriott Hotels datasets, among others. Foreign state actors don't just target national security agencies. If you have data about people, or places, or intellectual property, they want it too. You need to apply defense grade Natsec coding practices, like OWASP when you develop your software, whoever you are. If you're at this conference, you need it.

Now for making the business case. The average per record cost of a data breach was $165 USD in 2023, and that can really show at scale. The average cost of a mega breach of 50 million or 60 million records in 2023 was $332 million USD. A lot of those costs come from the effort it takes to identify and then contact affected individuals. Detection and escalation represent the largest share of breach costs, at an average total cost, just for that activity, of that $1.58 million per breach, which is what you'd be looking to spend. The less data you have, the faster it is to detect unauthorized activity on that data, the simpler it is to determine who is affected by the breach, and against what obligations, and the cheaper it will be when you experience one, and the cheaper it will be to get insurance for one. You can calculate the likely cost of a breach based on the number of records you hold using those established metrics. Price your own database in terms of the impact.

Here's another quick story about some of the tangential benefits. A government health department reported significant time and cost savings up to 98% for eDiscovery activities. They compared with a human actor on classification and found the system was 100% accurate compared to 25% accuracy for the human. They found cost savings of about 97.5% per annum for legal discovery, 97% per record for identifying redundant items as duplicates that are taking up space for you, and 99.8% saving per event for reporting and auditing. The same autoclassification capability that you use to reduce risk can also realize material, financial, and operational benefits across teams. Being able to understand and articulate how your data minimization development can positively affect stakeholders across the business is really vital for achieving executive support for investment and innovation. Map those potential cross-functional benefits for your planned data minimization investment or development, and quantify them wherever you can.

Next, how to implement. The only way to know how much data you've collected about individuals or high-risk topics and where you're keeping it is to audit your entire environment. Data is not just kept where it's supposed to be kept. It's common that multiple systems or processes are duplicating the same information. It's common that data is being shared post-collection in ways it was never planned to be. Comprehensive data inventories used to be impossible to achieve, but now with effective AI, they're actually very straightforward.

Once you have one you can identify data flows and duplication without which you can't implement effective data minimization. Develop an automated asset inventory of every document, email, database row, web page, chat message in every system on every platform, including its location, its attributes, and its content, importantly. In terms of minimizing access, the only way to know who's accessing your sensitive and high-value data is to monitor it. You can design systems to have minimal access at the outset, but privilege creep happens when staff move teams or roles. Logging who's accessing content helps detect early signs of breaches and also helps plan for and enact credential revocation that might otherwise be missed. Implement monitoring of user activity on sensitive and high-value data, at least.

Ensure that downloads, shares, deletions, downgraded security markings, for example, are captured for later review. It's hard to just notice aberrant access just from looking. Just in our own environment, because we run our system over our own other systems, we capture and report thousands of data interactions every day, probably tens of thousands, but it only alerts us to the important ones. Make sure that systems that you develop can proactively alert on data activity, specifically activity that might lead to data proliferation, precipitous destruction or unauthorized access.

Next up, records management. It's not one and done. Information changes over time. Every time its content and its context changes, and users interact with it. Both the applicable retention rules and the disposal trigger date are likely to change, so your records management capability needs to detect and adapt to these changes, and in the end, alert as soon as a record can be disposed of under law. Ensure records management capability is dynamic, and it manages records for their whole lifecycle.

Finally, autoclassification and ADM. Autoclass only works when it's full coverage. All of your data can have risk, and all of it can have value. Autoclassification needs to run across structured and unstructured data, on-prem and cloud, across any type of system, including legacy and bespoke systems. Plan for rollout of data minimization capability to new planned systems, and plan for retrofit to existing systems. Prioritize on a risk basis. It also needs to have no barriers to adoption.

Autoclass that relies on user tagging and labeling, naming conventions, file plans, rules engines, or machine learning that has to be trained and supervised for every rule, can't succeed at scale. As well as avoiding impacts on users and governance teams, autoclassification software has to avoid network performance impact and impacts on source systems that create technical debt or cascading technology risk, such as using agents, or connectors, or requiring customizations. Finally, there needs to be no impact on source data.

Duplicating or moving data into a data lake or a second system actually ends up increasing the threat surface and halving the discoverability. Data should be autoclassified in place in its secure systems without being modified in the process. Implement data minimization capability in a way that does not introduce tight coupling between technologies or affect the source data. Design and develop an interface model rather than an integration model. Finally, ethics. Without AI and ADM, there's just no way we can attempt to manage this information properly. The volume, variety, and velocity of content is too high for humans to govern.

AI is really essential for information protection and exploitation, but the AI has to be ethical. Decisions about what we share, what we protect, what we preserve, and what we destroy can have serious detrimental outcomes on individuals if we get them wrong, and that makes them high risk from an ethical AI regulatory point of view. The AI you use can't be closed box, like neural networks, supervised ML, or LLMs, it has to be clear box. The assessment list for trustworthy AI is a wizard driven tool created by the European Commission that will give you a picture of how your systems and processes would comply. Review the ethics of your planned AI and ADM before you deploy it, using tools like the ALTAI self-assessment. Make sure you ensure sufficient human oversight at all stages of development and use.

Then, lastly, what we establish we have to maintain. It's not enough to design systems to support a data minimization model. You have to ensure that they continue to be effective in motion. Continually grooming the environment to remove data that carries risk but no longer has value, helps to minimize the success of foreign state or sophisticated criminal operations, who 100% have the capability to breach you if they decide to. Implement a capability to proactively destroy or de-identify personal and other attractive data on a daily basis, if possible, as soon as it comes to you for disposal. Don't let privacy be the blunt axe by which you determine the lifespan of data.

While there is privacy legislation requiring you to destroy personal data as soon as possible, the exception within the allowed uses is where it has continuing value to the business, and that value can be archival or social or just operational. Ensure that all types of retention policy are considered and overlaid to ensure that records aren't deleted prematurely by triggers in your system as you build it. Ensure that the appropriate SMEs are involved in the final decision to destroy.

Finally, make sure your systems, as designed and built, can help the organization respond if you're breached, not hinder it. Plan for a breach. Game out how you would be able, from the backend or the UI, to report quickly to the executive on who was in your spill, what exactly was taken, and what laws were broken in the process. Don't forget to tell the CEO, if you happen to work for Uber. Following this checklist can help you implement effective data minimization. You have an obligation to do it, and it's easier than you think. It can be done without detrimental impacts. It's just the right thing to do for your stakeholders and your community.

See more presentations with transcripts

Recorded at:

Dec 05, 2024

Rachael Greaves

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Trends in InfoSec: Data Minimization, Autoclassification, and Ethical AI

Summary

Bio

About the conference

Transcript

Data Minimization as a Cornerstone of Security

Data Minimization in the Information Lifecycle

Technology Approaches to Data Minimization

Risks of AI and ADM for Data Minimization

Data Minimization Best Practice Checklist for Cyber Professionals

Related Sponsored Content

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Popular across InfoQ