BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News PipelineDP Brings Google’s Differential-Privacy Library to Python

PipelineDP Brings Google’s Differential-Privacy Library to Python

This item in japanese

Google and OpenMined have released PipelineDP, a new open-source library that allows researchers and developers to apply differentially private aggregations to large datasets using batch-processing systems.

The project is a collaboration between OpenMined and the Google Anonymization team in an effort to co-create production-level tools for differential privacy. OpenMined is a non-profit community that focuses on researching and building secure and privacy-preserving open-source software. It has previously developed and published PyDP, a Python differential-privacy library built on top of Google’s open-source differential privacy libraries

Differential privacy is a data-science practice that adds artificial noise to user-generated data while producing high-quality results without exposing personally identifiable information. It has been used by many large tech companies to perform scientific research and generate meaningful reports without violating individual privacy. Recent adaptation includes Google’s COVID-19 Community Mobility Reports and the COVID-19 exposure notification seen on iPhone.
As consumers become more careful about sharing data and regulators step up privacy requirements, Google and OpenMined felt it was important to make differential privacy more accessible and usable. Miguel Guevara, a product manager at the Google Privacy and Data Protection Office, explained why enabling more developers to use differential privacy is important:

We felt a moral responsibility to share these technological advances with the wider community, we had also heard from a lot of developers that they wanted to try out some of these algorithms using Python, which is why we decided to open-source this library, in the hope that developers will try it out and create new and exciting cases with differential privacy.

PipelineDP offers a high-level end-to-end solution that manages the differential-privacy complexities under the hood while still ensuring that the result is differentially private, whereas its predecessor PyDP provides a relatively low-level Python API and requires domain expertise and additional configuration.

PipelineDP architecture

Source: https://pipelinedp.io/overview/

To allow easier access to processing data with differential privacy theory for non-experts, PipelineDP encapsulates differential-privacy complexities, such as protecting outliers and rare categories, generating safe noise, and privacy budget accounting, in a familiar API to Spark or Beam developers. Standard computations, such as count, sum, and average are supported natively. Other aggregation types can be easily extended from the standard APIs.

Data quality often degrades when applying differential-privacy practices. PipelineDP’s attempt at solving this challenge is a utility analysis toolkit that comes out of the box. The toolkit provides convenient ways for users to conduct analysis and tune parameters on any input data.

When compared to other differential privacy open-source libraries such as Facebook’s Opacus, and Google’s TensorFlow Privacy, PipelineDP comes with a great flexibility advantage with no vendor lock-in and interoperates well with other systems.

Note that PipelineDP is still experimental and is subject to change. At the moment, the project developers do not recommend using it in production systems and it's not thoroughly tested yet. Google and OpenMined teams are looking to add more functionalities and improve reliability in the near future.

The PipelineDP library is available in OpenMined's GitHub repo, which also includes more examples to try it yourself.

About the Author

Rate this Article

Adoption
Style

BT