Transcript
Ransbury: My name is Alyssa Ransbury. I'm going to talk about how we're protecting user data via extensions on metadata management tooling at Square. I'm currently a security engineer on the data security engineering team at Square. I work on libraries and services that enable strong data security practices across the data lifecycle. Prior to my current position, I spent almost two years on our sister team, privacy engineering. There, I worked on laying the groundwork for Square's current data inventory tooling. I also spent a lot of time thinking about compliance and how metadata could be used to help guide development of preventative data security tooling.
Outline
In the first part, I'll start broadly with an overview of metadata management tooling. In part two, I'll talk a little bit about how we use this type of tooling at Square. Then I'll transition into talking about some very specific work my team and I have done to use metadata to prevent data leaks via printed protocol buffers.
Metadata
To get started, let's go back to the basics. When we talk about metadata, what do we mean? In a basic obvious sense, it's data about data. There's nuance and complexity in how we can mentally compartmentalize this information. When we talk about data, we can mean a single item, an aggregate of many items, or an entire database. At each of these three levels, we can expect the data to have three main features, content, what the data contains or is about. Context, the who, what, when, where, and how associated with the data's creation. Structure, a formal set of associations within or among data. I've also seen this broken down into structural, formatting, and storage. Descriptive, the description, usage and context of the data. The relationship, the linkages or lineage of the data.
Why Care About Metadata?
Why should we care about metadata? First, over the last few years, we've experienced an increasing need for data governance to help manage regulatory and compliance requirements, while still enabling teams to use data safely, what some call data enablement. Laws like GDPR and CCPA have given companies the opportunity to review their data holistically from a data privacy and protection perspective. Metadata can be used to guide whether internal data can be used or shared. Rather than blanket denying or allowing certain data use patterns, companies can use metadata to allow specific actions in regards to specific data, while disallowing or blocking everything else. Second, metadata can provide increased business value for a company. Extra contextual information can help data teams choose the most high quality data, leading to more trustworthy and accurate analytics. Last, metadata can help companies mitigate risk. If we know who the data is about, how it's secured, and what kind of data it is, we can insert additional controls at various stages of the data lifecycle to ensure that it is handled properly.
Metadata Management Tooling
We know what metadata is, and we know why it's useful. Now let's talk about tooling built specifically to help with metadata management. At a high level, these tools do a couple of things. They oversee collected data across its lifecycle and help us track associated metadata. They also provide a common interface for people to interact with metadata, collaborate, and effectively manage their work. They link automated metadata and human added metadata effectively.
Common Capabilities
More specifically, metadata management tools often provide some or all of the capabilities listed on this slide. I'll run through what each of these means quickly even if you don't choose a metadata management tool that provide each of these things. It's also possible to mix and match. First on this list is data inventory. Data inventory is the act of ingesting and translating data so that we can understand and store answers to questions like who, what, where, when, and how. Data enrichment is the act of making the data more meaningful. This could mean writing code to automate gaining a deeper understanding of the data, like checking actual data values, and adding a PII tag if it's a phone number in plaintext, so that the data is subject to more rigorous privacy controls. Data lineage is the act of understanding the origin of data, making it easier to identify downstream effects of changes. For example, data lineage would help us understand which tables were derived from some parent table. If we have metadata stored about the parent table, we can make educated assumptions about what's in the children tables.
Active metadata management is when we augment the metadata we have with human acknowledge, and do things like use metadata to drive automation with machine learning. User experience can be a really important element of metadata management tooling. Storing metadata is only useful if it is usable. The best metadata management tools offer an efficient way for people in various roles across an organization to interact with and use metadata to work faster and more effectively. Business semantics are variations in terminology across team. A good metadata management tool provides an easy way to link data that is referred to slightly differently across the company. Business rules should also be made easily visible and accessible. Rules should be tied to actual data so that it's clear what pieces of data the rules do and do not apply to. Depending on use case, and consistent with our internal and external policies governing privacy, it's sometimes also useful to exchange metadata with third party tools. Last, a good metadata management tool will provide support for security and privacy by making it easy to visualize and manage rules and policies for specific data.
Ecosystem Overview
Over the last few years, we have seen metadata management solutions come to market with increasing maturity. Today, there are a multitude of paid open source and in-house examples. I've included some logos in each of these three categories. This is not a full list by any means. The in-house logos I'm including here are for companies who have written about their strategies for metadata management.
Initial Drivers
That was the overview. Now let's jump into how we handle metadata management at Square. I'm on the data security engineering team. This talk skews towards security rather than business intelligence or data analytics. I'm going to speak only to how my team uses tooling to understand and manage metadata. We had a couple of initial drivers that pushed our security team this direction. When I first started at Square, we relied mostly on manual work by individual teams to understand our data, but we had a lot of data and it was only growing. We wanted to be able to scale and automate insights into what data we stored, collected, and processed. This would allow us to not only continue to meet legal requirements from laws like GDPR and CCPA, but also aim for broader goals related to data privacy at Square. We ended up forking Amundsen, an open source project originally developed by Lyft and became one of their early users. We added a lot of custom functionality to support metadata that could power privacy and data security initiatives. We also added some functionality that we ended up contributing back to Amundsen. In particular, we added support for using AWS Neptune as a backing store. We introduced three new metadata types through Amundsen fork, with the express purpose of affording greater privacy protection for our data. We introduced PII semantic type, data subject type, and data storage security.
PII Semantic Type
PII semantic type is a set value describing the general contents of data in a column with the focus on discovering data that fits into one of three buckets. Sensitive by itself, could be sensitive when taken together with other data, or it links to sensitive data like an internal identification token. Our goal is not to categorize every possible type of data. With this metadata, we only wanted to categorize potentially sensitive data and bucket everything else as not PII. This information would allow us to better understand individual data for our specific privacy purposes. We developed a list of possible PII semantic types based on Square's data policy, plus conversations with our legal team and product teams. We developed values that were specific enough to be useful, but broad enough that we didn't end up confusing ourselves and our teammates with hundreds of different options. If two different pieces of data should be handled the same way, for example, a cell phone and a home phone, then they should be the same PII semantic type, phone number.
Data Storage Security
The next metadata type we introduced was data storage security. Examples of this could be plaintext, encrypted, or adapted. This piece of metadata specifically refers to ways the data has been manipulated to improve its security. If we had a column that held the last four digits of some sensitive information, the PII semantic type would describe the type of sensitive data. Since we were only talking about part of the original data, the data storage security value would be truncated.
Data Subject Type
Data subject type is the third metadata type we introduced. It describes the type of people the data could be about. We worked with product teams across Square to define data subject types that were more specific enough to be useful and broad enough to be easy to use and apply. If we take the example from the last slide, some truncated sensitive data, we'd also want to understand what type of user this data refers to. If we had a product called Test product, the data subject type would be test product user. Just to clarify here, we expect that some columns can have multiple PII semantic types and data subject types. The column in this example may actually hold data for two different products, which would be stored as two separate relationships between the column and two separate data subject types.
Example
In this example, two columns hold data that is the same PII semantic type for the same type of user, but one is stored in plaintext, and one is encrypted. With this setup, we can now write a simple query to uncover what data that we deemed to be less protected based on our risk model, and make quick changes as necessary. We can also write queries that tell us which datastores contain the most sensitive data. These queries are preventative. They help us increase awareness, reduce surprises, and allow us to provide greater security around these hotspots.
Flagging Possibly Sensitive Data Locations
Aside from the additional metadata types, we also tried to find ways to automate data inventory where possible. We have a lot of data, we still face the tall challenge of keeping the information up to date, and ensuring that information given to us by data owners remained correct over time. New data sources also get created all the time. Reducing the burden on engineers and analysts at Square who own the data source to properly annotate their tables when they're created, can lead to less work for them and better metadata for us. Mistakes happen. We take privacy quality control very seriously, which means a lot of double checking. While a data owner might tell us that something is not sensitive, if the name of the column is user phone number, it might make sense to take another look. Checking the first few rows of randomly named column could also reveal sensitive data that someone missed.
Just to give a really rough example. Let's say we have a table column and it has the data storage security type of encrypted. I'm a data user and I know that this recently changed. I know that we're actually storing this information truncated instead of encrypted. To reduce manual work and with future scale in mind, we introduced a concept of being able to flag a column if metadata is missing or incorrect. Flags are meant to highlight data storage locations that we've judged as likely to hold sensitive data, but that are possibly missing the correct metadata. They can be created both manually by humans and through an automatic process.
Flags
What exactly is a flag? A flag is a piece of metadata linked to some specific data. It contains information about the metadata type it relates to, like PII semantic type, or data storage security. The possible value, which might be tokenized, if we use our example from the last slide, and the human readable description. We also store information about the reason the item was flagged, the person who eventually reviewed the flag, and the result, which would be true or false depending on whether the reviewer decided the flag was correct or not. This allows us to perform checks to ensure that flags were resolved appropriately by authorized reviewers. This also allows us to run queries on our database to understand the accuracy of our flagging system over time.
Automatic Checks
We worked with product teams to understand basic heuristics people looked for when they applied our new metadata types to some data, and came up with automated checks. After receiving the first set of metadata from our data sources, in this case, table schema information, we run it through a first level of scanning, looking for low-hanging fruit. We ask questions like, is the column name an exact or a partial keyword match for a value we expect? We keep a list of keywords associated with various PII semantic types. In this system, a column name first name is an exact match for one of our person name keywords, while a column name for account first name or name would return a partial keyword match. Does the column type match what we expect? A column name email status triggers a partial keyword match for email address. However, since its type is Boolean, and we only expect a string for this PII semantic type, we don't flag it. Is the column name literal PII? There are some cases where data gets ingested from a spreadsheet, and it's technically possible in a very edge case that the headers are actually the first row of data.
Is the table name an exact or a partial keyword match for a value we expect? A table called some sensitive data type sends a good signal that the data in the table might be sensitive, even if the columns aren't named in a way we expect. We also have a job that it samples values from specific columns we're interested in, and sends the actual data to the Google DLP API. If this API tells us that the values are likely sensitive, we flag the column. If not, we tag the column with the date it was last checked against this API, so that we can wait before checking the same column again. We'll still flag the column based on the answers to the questions we asked earlier. This chart gives us additional signal into whether something is sensitive, whether or not we answered yes to an earlier question. If some things match up exactly, we actually skip flagging altogether and set some additional metadata instead.
Example - User Flow
In this example, a column was flagged for two different things. A data owner reviewed these flags and accepted that the data storage security is plaintext, but denied that the data is a phone number. After a data owner reviews a flag, we set the metadata appropriately in the background. In this case, we set the accurate data storage security value for this column. The reviewed flags contain a history of actions taken to verify metadata by individual Square employees.
Mitigating Risk - Protocol Buffers/Protos
In this next section, I'll talk about how we applied flagging to those new metadata types to address the different kind of data altogether. If you don't work regularly with protocol buffers, or protos, I'll give you a quick overview. Protocol buffers are a language-neutral, platform-neutral mechanism for serializing structured data. They're like XML but with schemas. In structured data like JSON, YAML, and XML without schemas, data can take any structure and you're on your own to make sure a specific instance matches what you expect. Here's an example proto message called Person. It contains three typed fields. If you use this proto schema for a message, you could use a compiler to generate a language specific implementation to serialize, unserialize, and represent the message. For example, a Java class. Before my time at Square, someone added a special annotation to protocol buffers so that engineers can annotate when a field contained something that should be redacted when printed or logged. Square has maintained a fork of the proto compiler for over five years to ensure that we actually honor this special annotation. Our version of the compiler modifies the generated proto message code to respect the annotation. Any time a proto is printed, the fields with a redacted annotation are obfuscated. This comes into play, in particular, when protos are printed for logging purposes.
Adding Redaction Annotations
We wired up our metadata management tooling to ingest proto definitions the same way we ingested other types of data, and ran the definitions through the same flagging logic I described before. The one change we made was that if a proto message field already had a redaction annotation, we automatically resolved flags for the data storage type. We knew based on the annotation that the type would be redacted. With a fully flagged set of proto schemas, we now had the ability to guess when a field might be missing a redaction annotation. If a field was flagged as being anything other than not PII, and had either an unknown or plaintext data storage security, we could guess that it was missing a redaction annotation. At this point, our flagging logic also hadn't been seriously tested or challenged. We were adding flags to different data locations, but we didn't have an effective way of tweaking our flagging logic other than to keep an eye on how flags were getting resolved. Since this was a new feature, we didn't get a lot of feedback.
Our mission here was twofold, we needed a way to add redaction annotations to protos defining code bases across Square without requiring too much effort from product teams. Relatedly, we did not want to burn anyone out with false alarm, some faulty flagging logic. We needed to make sure our flags had a very low false positive rate. The result was a strong customer-first approach to design and rollout. We started with five fields that our logic told us were probably missing an annotation. I handcrafted PRs for each field and manually followed up with code owners to interview them about a potential automated feature process. We had a test that checked the false positive rate for a set of data with each change to our flagging logic, and we continued to drive the false positive rate down with each PR. Once our false positive rate was low enough, and it felt like PRs were useful, we wrote code to automatically generate PRs on offending repos.
Automated PR Flow
We created a job that would create our metadata using the criteria I mentioned earlier. It would then check the database to see if we already had tried to fix that proto field in Git. If there was no existing branch, the job would create a branch, update the appropriate files, make a commit, and open a pull request. If there was an existing branch and the branch was closed or deleted, the job would update the PR status in our database, and note that data owners had decided the flag was incorrect. Data owners also often left useful feedback comments for us on the fields that they thought were flagged incorrectly. If the PR was still open, the job would comment a reminder if no one had taken an action in some amount of time. The job was also smart. If the contents of our database had changed since the PR was opened, it would update the existing PR to include additional code changes. We parsed an updated proto files in Python using an EBNF grammar.
There were some challenges with this. For example, it was usually not enough to just add the redaction annotation. To prevent red CI builds, we had to make sure helper files were updated correctly, and that the file had the correct imports to support the redaction annotation. On top of this, we ran into challenges in making sure that we could properly parse and update fields that already had annotations or a multi-line. We also had to work to support both proto2 and proto3 files. We stress tested this parsing code on all protos at Square and adjusted our grammar until we had full coverage. We also eventually expanded into adding redaction annotations on protos defined in language-specific files like Golang. We started automating four to five PRs per week. We didn't generate more PRs until we had closed out the ones that were open. For each of these PRs, we followed up manually with teams and made incremental improvements to the design each week. Improvements included updates to the PR descriptions, changes to the flagging logic, and tweaking the query we had for finding fields missing redaction.
Wrap-Up
Our approach to mitigating potential risk in protocol buffer code gave us time to improve our logic and feel more confident in our sensitivity checking code. In the future, we have the opportunity to extend these checks to other data, anything that can be checked for sensitivity and updated in some way to handle sensitive data more effectively. This could mean JSON objects, GraphQL APIs, or YAML files. We are continuing to build tooling to make this possible.
Questions and Answers
Anand: Another question I had was about data ownership. I've, in the past, seen the case where there are datasets, and then there are teams and individuals in those teams, and not every dataset is owned by somebody. Sometimes people leave, teams fold, and that responsibility somehow doesn't get carried over. Has that been a problem?
Ransbury: That was definitely a problem, especially when we were first rolling out the solution that I was describing with the protocol buffers. We have a sister team at Square who has been working on solving this problem of how do we as a security org make sure we always know who's responsible for pieces of code, pieces of data. That has been an ongoing project, somewhat manual of trying to make sure that we have all code accounted for. If it's not accounted for, it always has to roll up to someone. It's not a perfect solution right now.
Anand: Typically, every company has a head of product. That head of product creates a roadmap, and then every team has to execute on this roadmap in their own areas. Then you have a security team that's trying to influence everyone to make sure they're complying with your needs but it's not necessarily stated in their OKRs. How do you handle this?
Ransbury: This is also part of my presentation, I was talking about how slowly we were rolling things out. Part of that was because, yes, it's on nobody's roadmap, or it wasn't at the time when we were rolling this out. What does it mean for us to actually make progress while not annoying people, and in fact, making people feel like they have successfully helped you solve the security issue without putting them out basically? I think that's been our answer so far is just to automate as much as we can. We literally had our own code make the PR for them, we opened it for them. All they had to do was just merge the PR, or say, looks good, or say, this doesn't look good, we're going to close this PR. I think that was really helpful. We always followed up with people if people were upset or concerned. Especially in the beginning, people were like, what are you doing? We tried to just have as much communication as we possibly could.
Anand: You mentioned a bit about your forking of Amundsen. Can you talk a little bit about that, about the system you have and how it differs, and how do you see it evolving over time?
Ransbury: When we first forked Amundsen, it was maybe two years ago now. At that time, the project was just not as mature. It was definitely hard. It started off mature. Compared to what it is today, it was really different. At that time, coming from a security team, who we didn't really have any frontend people, and knowing that we wanted to have a UI component, the fact that Amundsen came and shipped with a UI was really important to us. We forked Amundsen because we are not a data team, or we didn't start as a data team. We wanted to be able to ingest information via jobs that we could run programmatically and write a bunch of code, and we wanted to really get into the weeds there. We didn't necessarily want to set things up the way that Amundsen had already. The default was that you were using ETLs basically.
What I was talking about in my presentation with the different metadata types, in our version of Amundsen, we surfaced all of that information. We also surfaced in our UI, things like, who has been accessing this data? We also brought in different data sources like Snowflake, which wasn't supported at the time when we were working Amundsen. Overall, it's been a really positive experience, I think that probably everyone has the same experience that data inventory is hard. We're trying as much as we can to still make progress and do things with our inventory, even though it's still ongoing because we have so many data sources.
Anand: Was your team the one that brought in the first metadata catalog to Square?
Ransbury: Yes, we were.
Anand: What does that mean for your team? Did it become a data team and a security team?
Ransbury: What it's meant is more in the last year, we've been working with the data infrastructure teams to share the load. We were the first ones to do a proof of concept with metadata management tooling.
Anand: What are some of the other challenges that you currently face? Maybe not at the time of thinking about this talk, but maybe things you didn't have time to add to the talk that you'd like people to know.
Ransbury: The solutions that I'm presenting to you are not some crazy, like ML backed magic solution. The things that I'm presenting are pretty much, here are things you can do tomorrow. Here's the script you can write. I think that sometimes we can lose sight of these easy fixes. We all have a lot of data. We don't need every solution to be a magic solution. Sometimes that's ok. I think when I've been doing this work, I'd have to continue to come back to, how do we just get full coverage? What is the baseline? We don't need to jump ahead of ourselves.
Anand: Is every dataset under your management? Do you have a coverage goal?
Ransbury: Yes, we do. Our coverage goal is basically making sure that we know about all the data, because the reality is Square is huge. We acquire people all the time. What does it mean to have these acquisitions to onboard them to the data inventory? What's the timeline? That's all still being ironed out?
Anand: Your coverage goal. For example, someone said, do you review every single schema change for every dataset that Square owns? Then you've mentioned, I think, for the ones that you own, you do?
Is your coverage goal for just the datasets you own? Let's say it's 100. You said you check every day if there are any compliance or changes, or things that need to be fixed. Let's say overall, there are 1000 datasets, is your goal to cover the 1000? How do you think about coverage? Ignoring acquisitions, which is yet another challenge on its own.
Ransbury: Just for scope and understanding our scale right now, we are currently looking at hundreds of thousands of tables. When I say we look at them daily, this is not manual. This is our flagging logic runs, basically, and checks to make sure things are what we expect. Eventually, we would like to get full coverage so that we can understand all the data that exists and how it's protected. Then we can do additional things based on that. Right now, we've started by saying, let's at least get full coverage of this one type of database. We've got Snowflake covered. We've got MySQL covered. We have the on-prem MySQL. The way that things work at Square is that we let teams choose whatever backing store they want, and so that makes security harder, because then we have to be like, you're using Dynamo? You're using whatever random database. That is where we start getting into the weeds of how do we know what's there. Make sure we're understanding what exists in that database.
Anand: That is very challenging, and it's mostly social outreach within the org. You ping them and say, "I'd like to just learn what you're doing, what tables you use, what databases." Then they ask, "Who are you?" You say, "I'm in the security org." They say, "I don't want to talk to someone in the security org."
Ransbury: We do have this concept of a security partner at Square, and they're almost like their own mini-CISOs. Each security partner is responsible for a different business unit, and they really help us understand like, what data exists and where it exists. Help us make that timeline of making sure we can integrate with their data sources, basically.
Anand: That's good to know. It's always a challenge. The other question was you were saying, what other things you think about. You said, over the weekend, you thought, in this world, you have to scan and then pattern match, and then write tagging rules or flagging rules, and then follow up. There's this follow-up case, which is, your data is not clean. Now it's up to you to figure out how to fix it. How much of your time do you spend on these four, or these different aspects? I think one is, once you write the automated job, and things start getting flagged, there's that last mile of bringing it into compliance. Can you speak to that?
Ransbury: We actually have a team of project managers within security. They help us do that last mile follow-up. In general, we try to make it as automated as possible, for example, making those PRs for them. You just have to click the button, the bot will literally message you and be like, "Hello, can you please check this?" If someone's still not responding, it's not owned, whatever, those project managers help us follow up. That sometimes looks like then going to the highest level of the hierarchy and saying, you need to help us convince people on your team. I don't have to do as much of that.
Anand: I saw one question here about the original agreement tech version, but I didn't totally grok this topic. Could you speak to it a bit more?
Ransbury: Do you track the agreement tech/version that was used, when a given piece of data was originally stored by the user?
At Square, we have a separate service that tracks compliance. That means that if I go to the Square website, or I use a Square product and I agree to some terms of service, it does track the exact version of the text of the agreement and whatever. That is carried and follows your user. Remember how I talked about this idea of a data subject. You as a data subject now have this agreement you've agreed to, then we can follow your data like that.
Anand: How do you protect PII data from rogue employees, so internal attacks?
Ransbury: This is something that we have to think about. Because there's definitely an issue if I'm a rogue employee, and I go on and I say, this isn't PII, but it is. That goes into what I was talking about, where we still check those things. We're still doing random checks, because, not only could somebody be rogue, but they could have just made a mistake. We want to make sure that we're encompassing that. Also, things like these decisions are public. We store in our database, we know who told us that this is not PII, and we know the time and we know what you were doing. You can only make those decisions if it's data that your team owns, which is scoped pretty specifically. I couldn't just go to some random team and be like, I'm going to say this isn't PII. No, that's not even possible. We have a trail of that, basically.
Anand: For example, let's say my team has read access to a column. That means that I have the permissions in a team that has read access to that PII column. What I do with it, there's no way to know. If I'm rogue, I already have permissions. There's nothing beyond that that you can do.
Ransbury: Then it becomes something for our detection and response teams to deal with.
Anand: You do have those teams that deals with it?
Ransbury: Yes. Those are the people who are managing your corporate machine and who are looking at what data you're looking at.
Anand: I worked at PayPal, and they definitely know if you engage a USB drive, and copy from a cloud storage, they know that. They also check all the keystrokes. Everybody's keystrokes are captured and checked against patterns. There are a lot of things checked for this case, especially in, I think, anything having to do with payments.
Ransbury: We're continuing to find ways that we can be passing data to these detection teams so that they can have more information, and respond appropriately.
Anand: How do you know about all the existing datasets? You mentioned the mini-CISOs. These mini-CISOs are outside of your team, but they talk to the project manager somehow and try and give datasets and stuff?
Ransbury: One part is there's these security partners, and they're within our org, but they're working with the higher level of every department. The reality is we are a payments company, and we have a lot of compliance things that pertain to us. We have data security governance teams who that is their whole job, is making sure they know where the data lives. I think a lot of what we're solving for is like, how do we make this less manual? Of course, they already because of compliance reasons had to know what all the data sources are, and what data we expect to be stored in those data sources.
Anand: That's their job. You're really the proper engineering team that helps make this whole thing more automated. It's easier to stay within compliance.
What's the next big challenge for you and your team, something on the horizon?
Ransbury: These protos, I think, open up a lot of opportunities for us. We've had a couple teams here moving to GraphQL, and so what does it mean to make sure those API schemas are secured properly? We can use some of the same flagging logic there. Where securing the protos also comes up a lot is in logs, so people might log a full proto. We are doing additional checks and research into what we can do to continue to make logging more secure, not just by flagging and adding those redaction annotations, but also by addressing logs at different levels of the log lifecycle.
See more presentations with transcripts