DoorDash recently published how it proactively embeds privacy into its products. It explains the importance of Privacy Engineering, an often overlooked software architecture practice, and provides an example of geomasking users' address data to better protect their privacy.
Alex Dougherty, software engineer at DoorDash, explains the motivation for implementing Privacy Engineering:
To facilitate deliveries, users must give us some personal information, including [...] names, addresses, and phone numbers [...]. This information is needed for Dashers to know where and to whom to deliver an order. Because this information can be used to re-identify an individual, it could be used by a bad actor to cause harm, including identity theft and doxxing.
That's why we want to ensure that this personal data is redacted (erased or obscured) from our platform within a reasonable period of time after a delivery is completed. That way, even if a bad actor gains unauthorised access to our database, personal data will no longer be there, preventing it from being misused.
Asynchronous jobs trigger data redaction in DoorDash's distributed system. When a user's data is eligible for redaction, the job pushes a message to a Kafka topic, signalling to purge the data associated with this user. Services that hold a copy of the user's data listen to this topic and redact the data upon request.
Asynchronous process to execute redaction using Kafka messages (Source)
An example of user data redaction is address geomasking. Dougherty notes that there is a false dichotomy between protecting sensitive personal information and leveraging the data for analytical purposes.
He further explains that instead of completely removing address data, DoorDash employs Gaussian perturbation to displace users' locations. This process prevents bad actors from re-identifying the users while allowing the business to perform the required analytics and business optimisations.
Geocoding an address and randomly displacing its coordinates (Source)
DoorDash uses Spatial k-anonymity to evaluate the geomasking's process effectiveness:
Spatial k-anonymity produces a value "K", which measures the number of potential locations that could be identified as the "true location" of the user after geomasking. With this value, the probability of a bad actor selecting the true location is 1/K. The larger the value of K, the more effective the geomasking is at protecting a user's actual location.
An example of spatial k-anonymity (Source)
Dougherty states that the population density of a user's location can impact geomasking's effectiveness. In urban areas, many other users could purchase near the actual user, reducing the chances of re-identifying a redacted user. However, displacing the location coordinates by the same amount in remote areas might not be enough to de-identify a user.
Doordash strives to have a K value between 5 and 20. Given this value and an approximation of the population density in an area, it can now determine the appropriate standard deviation number by which to displace a user's address to consider the redaction process a success.