BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles An Introduction to Differential Privacy

An Introduction to Differential Privacy

Key Takeaways

  • Differential privacy can be achieved by adding randomised “noise” to an aggregate query result to protect individual entries without significantly changing the result.
  • Differentially private algorithms guarantee the that the attacker can learn virtually nothing more about an individual than they would learn if that person’s record were absent from the dataset.
  • One of the simplest algorithms is the Laplace mechanism, which can post-process results of aggregate queries.
  • Both Apple and Google are making using of differential privacy techniques in iOS and Chrome respectively.  Differentially private algorithms have also been implemented in privacy-preserving analytics products, such as those developed by Privitar.
  • Differentially private algorithms are still an active field of research.

Differential privacy leapt from research papers to tech news headlines last year when, in the WWDC keynote, Apple VP of Engineering Craig Federighi announced Apple’s use of the concept to protect user privacy in iOS.

It was the latest instance of a general trend: users and engineers awakening to the importance of privacy protection in software. High-profile privacy violations such as Uber’s “God mode” have demonstrated in stark terms the ease with which employees of a company can misuse sensitive data gathered from their customers.  

The amount of sensitive data that is being digitally recorded is rapidly increasing. People now rely on digital services for more of their payments, transportation, navigation, shopping, and health than ever. This new data collection creates ever increasing ways to violate privacy.

But it also creates exciting opportunities—to improve transportation networks, to reduce crime, to cure disease—if made available to the right data scientists and researchers. There is a natural tension between protecting the privacy of individuals in the dataset and enabling analytics that could lead to a better world.

Differentially private algorithms are a promising technical solution that could ease this tension, allowing analysts to perform benign aggregate analysis while guaranteeing meaningful protection of each individual’s privacy.

This developing field of technology is worth considering in any system that seeks to analyze sensitive data. Although the differential privacy guarantee was conceived of only ten years ago, it has been successful in academia and industry. Researchers are rapidly inventing and improving differentially private algorithms, some of which have been adopted in both Apple’s iOS and Google’s Chrome.

This article discusses the historical factors leading to differential privacy in its current form, along with a definition of differential privacy and example differentially private algorithms. It then looks at some recent high-profile implementations of differentially private algorithms by Google, Apple, and others.

Background

Differentially private algorithms are the latest in a decades-old field of technologies for privacy-preserving data analysis. Two earlier concepts directly influenced differential privacy:

  1. Minimum query set size
  2. Dalenius’s statistical disclosure definition.

We’ll explain these first as they provide helpful background for differential privacy.

Minimum query set size The first concept is a minimum query set size, which—like differentially private algorithms—aims to ensure the safety of aggregate queries. Aggregate queries are those where the returned value is calculated across a subset of records in the dataset, such as counts, averages, or sums. It may be helpful to think of aggregate queries as SQL queries that begin with “SELECT SUM”, “SELECT COUNT”, or “SELECT AVG”. Other types of aggregate queries include contingency tables and histograms.

A minimum query set size is a constraint that seeks to ensure that aggregate queries cannot leak information about individuals. Given some configured threshold number T, it ensures that every aggregate query is conducted on a set of at least T records. A minimum query set size would block aggregate queries that targeted fewer than T individuals. For instance, if T=2, it would block the following:

“SELECT AVG(salary) WHERE name = ‘Troy Brown’;”.

because this query would conduct an average over one record (we assume there is only one Troy Brown).

Using minimum query set sizes prevents certain attacks, but does not come with a privacy guarantee and, in practice, can be circumvented by skilled attackers. For instance, the attacker could accomplish the above attack with:

 “SELECT SUM(salary);”.

“SELECT SUM(salary) WHERE name != ‘Troy Brown’;”.

Or even, if we know Troy Brown’s age (45) and position (WR) uniquely identify him:

“SELECT SUM(salary) WHERE position = ‘WR’;”.

“SELECT SUM(salary) WHERE position = ‘WR’ AND age != 45;

Such attacks are called tracker attacks, and they cannot be stopped by a minimum query set size constraint. Because of these attacks, minimum query set sizes were deemed an inadequate defense for protecting query systems (see Denning’s work). Something better, with a guarantee, was needed to ensure privacy.

Dalenius’s statistical disclosure definition

In 1977, statistician Tore Dalenius proposed a strict definition of data privacy: that the attacker should learn nothing about an individual that they didn’t know before using the sensitive dataset. Although this guarantee failed (and we will see why), it is important in understanding why differential privacy is constructed the way it is.

Dalenius’s definition failed because, in 2006, computer scientist Cynthia Dwork proved that this guarantee was impossible to give—in other words, any access to sensitive data would violate this definition of privacy. The problem she found was that certain types of background information could always lead to a new conclusion about an individual. Her proof is illustrated in the following anecdote: I know that Alice is two inches taller than the average Lithuanian woman. Then I interact with a dataset of Lithuanian women and compute the average height, which I didn’t know before. I now know Alice’s height exactly, even though she was not in the dataset. It is impossible to account for all types of background information that might lead to a new conclusion about an individual from use of a dataset.

Dwork, after proving the above, proposed a new definition: differential privacy.

What is differential privacy?

Differential privacy guarantees the following: that the attacker can learn virtually nothing more about an individual than they would learn if that person’s record were absent from the dataset. While weaker than Dalenius’s definition of privacy, the guarantee is strong enough because it aligns with real world incentives—individuals have no incentive not to participate in a dataset, because the analysts of that dataset will draw the same conclusions about that individual whether the individual includes himself in the dataset or not. As their sensitive personal information is almost irrelevant in the outputs of the system, users can be assured that the organization handling their data is not violating their privacy.

Analysts learning “virtually nothing more about an individual” means that they are restricted to an insignificantly small change in belief about any individual. (Here and below, “change” refers to the change between using a dataset and using the same dataset minus any one person’s record.) The extent of this change is controlled by a parameter known as ϵ, which sets the bound on the change in probability of any outcome. A low value of ϵ, such as 0.1, means that very little can change in the beliefs about any individual.  A high value of ϵ, such as 50, means that beliefs can change substantially more. The formal definition is as follows.

An algorithm A is ϵ-differentially private if and only if:

Pr[A(D) = x] ≤ e^ϵ * Pr[A(D’) = x]

for all x and for all pairs of datasets D, D’ where D’ is D with any one record—i.e. one person’s data—missing. The symbol e refers to the mathematical constant. Note that this definition only makes sense for randomized algorithms. An algorithm that gives deterministic output is not a candidate for differential privacy.

The primary appeal of the differential privacy guarantee is its limitation on the amount that any analyst can learn about an individual. Additionally, it has the following useful properties:

  • Composability: if two queries are answered with differential privacy guarantees of level ϵ1 and ϵ2, the pair of queries is covered by a guarantee of level (ϵ1 + ϵ2). Recall that a higher value of ϵ means a weaker guarantee.
  • Strength against arbitrary background information: the guarantee does not rely in any way on what background information the attacker knows. This property is one of the principal reasons that differential privacy is stronger than an earlier privacy guarantee, k-anonymity.
  • Security under post-processing: there are no restrictions on what can be done with a differentially private result – no matter what it is combined with or how it is transformed, it is still differentially private.

How is this guarantee achieved in software? Differentially private algorithms are randomized algorithms that add noise at key points within the algorithm. One of the simplest algorithms is the Laplace mechanism, which can post-process results of aggregate queries (e.g., counts, sums, and means) to make them differentially private. Below is example Java code for the Laplace mechanism specific to count queries:

import org.apache.commons.math3.distribution.LaplaceDistribution;

double laplaceMechanismCount(long realCountResult, double epsilon) {

    LaplaceDistribution ld = new LaplaceDistribution(0, 1 / epsilon);
    double noise = ld.sample();
    return realCountResult + noise;

}

The key parts of this function are

  1. Instantiate a Laplace probability distribution (see Figure 1) centered at 0 and scaled by 1/ϵ. We use the Apache Commons implementation, “LaplaceDistribution”, which is constructed with two arguments: the mean of the distribution, and the scale of the distribution. Note that lower epsilon (more privacy) leads to a larger scale and thus a wider distribution and more noise.
  2. Draw one random sample from this distribution. This sample() function takes a random number between 0 and 1 and applies the Laplace distribution’s inverse cumulative distribution function (CDF) to this number. This process results in a random number such that its likelihood of being any specific value matches the distribution. As an alternative way to think about it, if this sample function were invoked a million times to get a million samples, the shape of the histogram of these samples would closely match the shape of the Laplace distribution.
  3. Perturb the real value by adding the random value from step 2.

Let’s consider why this algorithm is differentially private by taking the viewpoint of an attacker named Eve. Say the data set is mental health data, and Eve has conceived of a tracker attack (see above) that will reveal whether her target, Bob, receives counseling for alcoholism or not. If the query’s result is 48, Eve knows that Bob does receive counseling; if it’s 47, Eve knows the opposite.

Whether the answer is 47 or 48, the Laplace mechanism will return a noisy result somewhere around 47 or 48. It may return 49, 46, or even, with smaller probability, 44 or 51 (see Figure 2 for a histogram). In practice, it is impossible for Eve to be very sure about whether the true answer was 47 or 48 and, thus, her beliefs about whether Bob is in counseling for alcoholism or now will not meaningfully change.

Figure 1: The Laplace distribution centered at 0 with scale of 1. Pictured is the probability density function (PDF) of the distribution—the y-axis is the relative likelihood that the variable will take the value on the x-axis.

Figure 2: The likely outcomes of the count query for the two scenarios, where the real answer is 47 and when it's 48. A small number of outputs would not be enough to distinguish which distribution they came from. Epsilon is set to 0.67.

You may have observed by this point that Eve could cut through the noise by repeating the query many times, and seeing whether the answers cluster around 47 or 48. To prevent this tactic, differentially private systems must have a “privacy budget,” a cap on the sum of the ϵ’s used in each query. This cap works because of differential privacy’s composability property described above. They may ask a few relatively low-noise queries, or many hundreds of high-noise queries, but either way, they will not be able to confidently establish whether the true answer is 47 or 48.

Lastly, note that the Laplace mechanism for counts is merely one simple differentially private algorithm. The Laplace mechanism can be extended to work for sums and other aggregate queries. Furthermore, there are fundamentally different algorithms that have been proven to satisfy the differential privacy guarantee. A few worth exploring are the Private Multiplicative Weights algorithm, the Multiplicative Weights Exponential Mechanism, and DualQuery.

Differential privacy in action

In June 2016, Apple announced that it would begin to use differentially private algorithms to collect behavioral statistics from iPhones. This announcement, besides causing a huge spike in differential privacy interest, showed that differential privacy can help major organizations get new value from data that they previously did not touch due to privacy concerns.

Although Apple has so far made few details public, the algorithm used in the iPhone seems similar to Google’s RAPPOR project. Google implemented a feature in Chrome that collects behavioral statistics from Chrome browsers through a differentially private randomized response algorithm.

In randomized response, random noise is added to statistics before they are submitted to the collector. For example, if the real statistic is 0, the browser will with some probability replace 0 with a randomly selected 0 or 1. Each user has a large degree of deniability about the value that their software reports, because it could be a random value. But collectively, the signal stands out over the random noise and the organization collecting the statistics (i.e. Google or Apple) can accurately observe trends.

Interestingly, neither Google nor Apple, to our knowledge, has revealed the value of ϵ that goes with their differential privacy guarantee. We need this value to understand the protection offered by the guarantee. If they use a high enough value of ϵ, analysts can still learn sensitive facts about users with high confidence. A low value of ϵ is required for meaningful privacy protection.

Differentially private algorithms have also been implemented in privacy-preserving analytics products, such as those developed by my employer Privitar. These products allow companies that work with valuable, sensitive data to incorporate differentially private algorithms into their data architecture, providing privacy guarantees to their users while still performing meaningful analysis on the data.

Looking ahead

Both the engineering and research communities have paths forward with differential privacy. For engineers, the task is to become educated on differential privacy and ensure that it is used where appropriate for user privacy protection. For researchers, it is to find more and better differentially private algorithms, improving the toolset with which we can enable privacy-preserving data analytics.

We all stand to gain from the establishment of privacy guarantees and the successes of data analytics. For both reasons, we look forward to more organizations embracing differential privacy.  

About the Author

Charlie Cabot is a senior data scientist at Privitar, a data privacy startup that builds high-performance software for data anonymization, including perturbation and generalization algorithms and differentially private mechanisms, to facilitate safe use of sensitive datasets. Charlie focuses on provable privacy guarantees and the statistical impact of anonymization on analytics and data science. Previously, working in cyber security, Charlie engineered machine learning-driven approaches to malware detection and modeled cyber attacks on computer networks.

Rate this Article

Adoption
Style

BT