BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Privacy-First Re-Architecture

Privacy-First Re-Architecture

Bookmarks
48:48

Summary

Nimisha Asthagiri discusses what it is like: an alternative architecture and ecosystem, where industry-wide decentralized data ownership is the prime directive.

Bio

Nimisha Asthagiri is a Principal Consultant at Thoughtworks. Prior, she was Chief Architect and Senior Director of Engineering at edX, driving intentional architecture for the next generation of large-scale online learning. Her past accomplishments include leading the security of a peer-to-peer group communications platform at Groove Networks.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Asthagiri: Put yourself in the shoes of a regular internet user. Let's call her Alicia. As a typical user, Alicia spends about 7 hours per day online. She benefits from the services provided by the 200 million websites active today. As a typical user, Alicia is inundated with managing passwords for each site. She has about 100 passwords, and is locked out of about 10 accounts every month. Even if she manages to create another account, satisfying yet another set of password requirements, there's no telling whether she'll have access to her own data when she wants it. Eighty-two percent of sites had at least two outages the past three years, and on average, at least four hours unplanned downtime. Sometimes the downtime was as long as two weeks. Or if the site is knocked down, it's possible the site is gone. While there are over 1 billion websites today, technically, less than 200 million are actually active. Think about Alicia's photos that she has now lost forever. Maybe the site is up, but Alicia, she can't share her own data across applications. She has to reenter and submit her medical history to multiple doctors. Poor Alicia, she may not even know when her own data is breached. Her data is lost in the crowd of 22 billion data records breached within the year. Does it sound familiar? Does it really have to be this way? I say no.

I'm Nimisha Asthagiri. A principal consultant at Thoughtworks. I'm here to tell you that a privacy-first architecture is more than possible by decentralizing data ownership. You see, what if we design it such that Alicia herself is at the center, not the applications and not the technologies, not the providing organizations? As before, Alicia chooses her computers and hardware. What if Alicia makes her own decisions on where her data is stored? For example, she may store data locally on a device or on external storage that she controls. Or yes, she may even backup her data to a cloud-based storage provider, especially for unreliable devices. Now that the data is under Alicia's control, what if she freely chooses which application service providers to use? In this Alicia centric world, she makes her own decision of which particular features to use from each application in the competitive landscape. In fact, maybe she uses the best in breed app for searching through her photos, and a different app for editing her photos. Plus, since the data is in her hands, she controls access to the data.

What if while the storage providers store the data, they don't have access to the underlying clear text of the data. What if while the application services access to the data, they don't control access to the data, Alicia does? She can easily change secrets or change access since the data is in her control. Alicia is so empowered that she can even control how her data is used for machine learning, and other data science applications. She can choose the cohort with whom to train a cohort personalized machine learning model. Finally, what if Alicia chooses her own identity provider? Not an identity provider that an application service chooses for her, not an identity provider that her device chooses for her, but rather, she chooses an identity provider that she herself trusts. One that vouches for who she is and her credentials. One she trusts will protect her identity and her PII, or her personally identifiable information. Then, once we're in this world, then when Alicia's identity changes in any way, by name, by gender, by credential, she interacts with only her own identity provider to make those changes. She doesn't need to update her identity everywhere else, in all applications, on all devices, storage spaces. Have you or someone you know needed to change information of your identity? It's painful, isn't it? Our identities are spread throughout the internet. Since applications are decoupled from data backends, then when Alicia wants to switch application providers, she takes her own data with her. That's right. There's no need for data portability across applications when the data remains grounded with Alicia.

Groove (Peer-to-Peer Application)

For those of you who are thinking, this is great, but how is this even possible? It's just not where we are today. I'm here to tell you that we've done it. In fact, in the late '90s, I was part of a team that built a peer-to-peer application called Groove. In retro style here in black and white on the left, you see a faint imagery of a past Groove UI. We had virtual spaces where groups of individuals could come together and have discussions, share files, messages. Mind you, there were no centralized servers in the picture. In fact, if you wanted to authenticate other people, you could share Groove fingerprints out of band. Fingerprint as in a hash of a Groove identity's public key. Or you could use the TOFU style of authentication, the trust on first use, that is as you build relationships online, you began to alias people in and choose personal nicknames to associate with those online Groove identities. Even the public sector, they found Groove to be incredibly powerful. It was used in the Iraqi war to assess humanitarian needs of people affected by the war. Since no central servers were required, Groove functioned in unreliable network situations. It was also used in negotiations between Sri Lanka and the Tamil Tiger rebels. Neither side trusted the other to run a server. They couldn't even cross certify each other because of intense political sensitivities, so they used the peer-to-peer aspects of Groove instead.

Let's diagram Groove's architecture using the modeling elements from before. Now here we depict users in the center, and each user is a peer to another. Each user runs a Groove application on their own device, and their own storage of choice. They control their own Groove identity. Local access to their data is protected with encryption at rest in their storage. Then, shared data and messages are automatically encrypted over the wire. When a peer communicates with another peer, peer-wise keys are automatically generated using Diffie-Hellman protocols. When group membership changes, shared group keys are automatically generated, distributed, and rekeyed. There's full end-to-end encryption over the wire. In the end, the data, it's in clear text only on the devices of the users. We did leverage servers in certain situations, but the servers are only additive and ephemeral, not core to the architecture. We use storage in the cloud called relay servers, but they were only dumb routers. They were storing data temporarily while someone was offline. The data was still end-to-end encrypted so these storage providers, they didn't have access to the underlying data. Plus, they were configurable, that is, you could specify in your own profile which relay server I should use if I want to send you a message while you're unreachable. Similarly, we support identity providers, if you chose to leverage an authority to additionally vouch for your identity. Your primary identity, though based on your public key, that still remained in your control. The identity provider was used only as an additional managed credential.

Groove as we had built it, it's not in use today. As with many product failures, there may not be a single reason but a confluence of conditions. For Groove specifically, it may be the way we positioned it in the market, or because it ran only on Microsoft devices, or the fact that it required an 18 megabyte download at the time when the first cloud-based SaaS companies were just emerging. Also, many in the corporate world, they may have distrusted peer-to-peer since P2P origins were then associated with Napster. Napster is a notorious P2P music sharing platform that was eventually shut down due to legal suffocation. What could be the reason? Perhaps Groove was just ahead of its time, before internet users had the chance to feel the pains that we do today, which brings us back to where we are.

Centralized Silos

Today, we have centralized silos created by software, hardware, and cloud providers. Each provider decides, controls, and serves its own siloed package of storage, applications, identity providers, access control, and machine learning, and data science all in their own silo. This pushes Alicia far out of the center today. She's almost an afterthought. It's even worse. If we take a peek inside one of these centralized silos put together by a typical organization today, you'll see a depiction that looks more like the right-hand side. These essentials, they're all tightly coupled together. Let's hear what Rich Hickey has to say, he is the author of Clojure, the programming language. He tells us, this is where complexity comes from, complecting. He says, it's very simple, just don't do it. Don't interleave, don't entwine, don't braid. It's best to just avoid it in the first place. When you build a system that's complected, it becomes very hard to expand on it and innovate further. Instead, find the individual components. Those can then be composed and placed together to build robust software. Let's take Rich's design principles beyond the single programming language like Clojure, and even beyond software application design. Let's apply it to enterprise architecture, and yes, even industry wide architecture.

First, how do we know which individual pieces to compose? For that, let's take a look at what Steve Jobs and Pablo Picasso have to offer. In the mid-1940s, Picasso spent 12 hours a day for 4 months, creating series of lithographic prints, including one of a book. You can see how he starts with the blueprints that are artfully shaded and realistically rendered. In the end, he boils it down to its true essence of just 12 lines. He has a U suggesting the horns, a short line for the tail, and a smattering of black and white dots for the skin. In the end, he simplifies the bull to its essence. Steve Jobs says it takes a lot of hard work to make something simple, to truly understand the underlying challenges, and then come up with the elegant solutions. That's what Picasso had done. That's how Steve Jobs designed. In fact, when Apple trains its new hires, it includes a presentation of Picasso's blueprints. This is why we see Apple create simple designs of its mouse, its Apple TV remote, and its iMac.

Decoupling and Decentralizing Identity

Bringing this back to our topic of today, let's take these design principles of discovering the essence and detangling our systems. Let's tease apart the essentials of the software and services that we offer today. Once we do so, we can recompose and recreate with Alicia as a new center. Let's start by pulling out and decoupling identity. Once decoupled and removed from the centralized silo, we recompose it with decentralization, and Alicia centricity in mind. Alicia decides how to identify herself online. It could be via email, telephone number, even government or institution issued IDs. She can reuse her chosen identity across platforms and applications. Here's an example UI where Alicia controls the binding of her own selected identity with a specific application. She minimizes what identity information to share, and perhaps even leveraging progressive proof of information.

Decoupling and Decentralizing Application Service

That was decoupling and decentralizing identity, now let's look at doing the same for application services. Let's first decouple application services from that tightly coupled silo, and then we compose in a decentralized way. The application, they're now so decoupled from the underlying data storage. Alicia now decides which application she uses and when, and to what extent. This means if Alicia's photo application chooses just to go out of business, her photos, they still remain with her. She can change to another photo app and not worry about losing any data. She more easily experiments, shops around, and tries different photo apps seamlessly, all the while her photos remain with her. She decides for herself which messaging app she uses. Her social circle, they can choose other apps if they want. The decentralized messaging apps, they interoperate with standard communication protocols. At the end of the day, it's Alicia and her own experiences and choices. The applications serve her, not the other way around.

As another example, let's take calendar apps. Today, our single day is broken up across multiple calendars on Google, Apple, Microsoft, and other provider services. If the calendar app were truly decentralized, it wouldn't matter what app other people use. Instead, the calendar apps, they use data standards to interoperate. Alicia's app, it uses a standard to access her own day. Her colleague's calendar app accesses Alicia's calendar using the same standard. In the end, Alicia and her contacts have freedom to choose their own individual calendar apps.

Decoupling and Decentralizing Data

That was about decoupling and decentralizing application services. Now it's time for data. We see that data has multiple facets. We can separate data storage from data access from learning. Once decoupled, we recompose decentralized. Alicia controls the storage. Alicia controls the access, who, which users, which service providers. What, the granularity of access: fine grained, or coarse grained. When, Alicia expires access when she wants to. She can turn it on and off at her will. Alicia, she controls learning, how her own data is used for personalization via machine learning. Here are some sample UIs for Alicia's control, for data storage, backups, data longevity. For an application's access, she has time limited and fine-grained control. Similarly, for access to her data by other users as well. When machine learning is decentralized, Alicia chooses whether and which collaborative model to leverage. She may also derive her own local generated model from a provider's model.

Coming Expectations - Privacy Requirements

You see, we walked through what happens when identity, application, and data essentials are decoupled, then decentralized. Here's a reminder of what such an architecture looks like in a decentralized world where Alicia is in the center. Whether you agree or not, these are coming expectations. Privacy by design are seven principles adopted in 2010 by the International Assembly of Privacy Commissioners and data protection authorities. A few notable ones include privacy by default, privacy embedded into the design and not bolted on as an afterthought, or respecting privacy with the interests of the user at the center. OWASP, in addition to their security list, they have started publishing the top 10 privacy risks since 2014, with their own ideas for countermeasures. Examples, instead of your site requiring users to consent on everything, have users consent for each purpose, like consent specifically for keeping profiling for ads, distinct from consent for the use of the website. Better yet, minimize what data you're collecting in the first place. Ensure that your site sufficiently deletes personal data after it's needed, or the user requests to delete it. You do so proactively with data retention, archival, and deletion policies. Also, you can ensure that users, they always have access, and can change and delete their own data when they want to.

While OWASP suggests providing users with an email form to make their request for this, with decentralized architecture the user's data is in their control the whole time. Finally, if we don't hear privacy demands from the collective voices of consumers, nor the voices of international assemblies and nonprofit communities like these, then we're sure to feel the slap on the wrist from upcoming global legal regulations. They're going to come with enforced sanctions and fines. New global privacy laws are emerging even beyond GDPR, most notably from EU, Brazil, Canada, and California. For example, EU's Digital Marketing Act or DMA, it has provisions for ensuring data interoperability, data portability, and data access specially for users of big tech platforms. As another example, EU's Data Governance Act or DGA, it provides a legal framework for common data sharing spaces within Europe to increase data sharing across organizations, especially in industries such as finance, health, and the environment. We can try to bolt on support of these requirements, but privacy by design, architecturally, it will truly enable the ideals and essence of these policies and principles.

Real-World Efforts Today

You may ask, what real-world efforts exist today, so we may practically support these requirements? Let's look at each of the essentials individually. I'll give you a few examples of emerging technologies in each category. For identity, we have W3C's decentralized identity or DID, which Thoughtworks is implementing as a proof of concept for the Singapore government. With DIDs, there's no need for a centralized authority or identity management. DID is a persistent, globally unique identifier, following a URI scheme. DID refers to a DID subject, which can be a user and, in our case, Alicia. The DID resolves to a DID document that describes the DID subject. The resolution of the DID uses a verifiable data registry that maps to the method in the DID'S URI. Then the DID's corresponding document is controlled by what's called a DID controller. You see, there's no centralized controller, there's no centralized resolver in this process. Plus, to limit exposure of a user's data, a DID controller can use zero knowledge proofs for selective and progressive disclosure of the user's information. Finally, I've also listed OpenID here as a secondary alternative if you really can't make the case for DIDs today in your org. OpenID is an open standard for federated identity. Let's just take a moment to reflect on the impact of all of this for Alicia. If all sites invested in just this one essential of identity and decentralizing it, Alicia's password management goes down to near or absolute zero.

Now let's move on to application services as a next essential. Foremost, you have SOLID and PODs as new standards actively being created by Tim Berners-Lee, the founder of the web, with his team at Inrupt, starting in 2018. PODs or Personal Online Datastore is a proposed standard for storing personal data in user or organization specific stores. SOLID, or Social Linked Data is their standard for communicating across PODs and cross-linking data. This is a web centric, decentralized architecture using HTTP, web ID, web ACL. Secondly, for decentralized applications to interoperate, we need data standards as a common language. Schema.org is a place to find what exists already, and if something's not there, you can contribute new ones. There are also industry specific standards such as FHIR for healthcare, or TM Forum for telecom, and IDMP for the identification of medicinal products. Another exciting emerging technology in this area is 2019 published research initiative called Local-first, it's by Ink & Switch in Germany. Local-first apps, they keep their data in local storage on each device, and the data it's synchronized across all the users' devices. Yes, like Groove, the network is optional.

Local-first, it leverages CRDT, or Conflict-Free Replicated Data types. What CRDT does is it allows decentralized apps to merge data without conflicts. It provides strong eventual consistency guarantees, where over time all nodes eventually converge without data loss. Essentially, in async and decentralized scenarios, it fundamentally reframes race conditions. To get a super, very brief idea of how, first think about what model it may use to notify other nodes when a change happens. A few options for synchronization models could be, denote the full new state of an object to other nodes, or specify what operation to run on the object so other nodes run that same operation. Or, just transfer only a diff, or delta of the new state from the previous state. Regardless of the synchronization model, to provide strong eventual consistency, the object's underlying data structure uses different types of tools. For example, if we all JSON base changes, they can be reduced to commutative operations. A plus B is the same as B plus A. Once you've done that, it doesn't matter in what order a node receives changes, it'll all be additive. Another tool is to use a convergence heuristic, such as last writer wins. To support that, you need to timestamp each change. If either of those don't work for a data type or object, then you could use a version vector or a vector clock to keep track of where other nodes are at by field or by event. Such techniques, they have to make use of UUIDs to identify each field and maybe also make use of data versioning. As you can see, this design is very different from server-based designs that rely on a single centralized server to mitigate merge conflicts.

For storage, the local-only movement supports efforts to decouple and decentralize storage. By keeping data local only, organizations, they can avoid liability issues and even government subpoena of data. Of course, we mentioned peer-to-peer technologies earlier when we were describing Groove. Now for access, let's look at what these technologies can do in terms of decoupling access from other essentials, so that it results in separation of powers. For example, the open source Signal Protocol, which is used by multiple messaging apps today, including Signal, WhatsApp, and Google, it provides end-to-end encryption of the messages. Access to the underlying message is not shared with the application provider, or any cloud storage provider. Then, you have technologies such as Vault, 1Password, and HSM, and they separate the storage of secrets from the storage of data. Finally, OPA, or Open Policy Agent, is an open source tool that enables you to separate your access policies from the rest of your application logic and data. Its architecture is a great example of Picasso based decomplecting of essentials, resulting in a powerful and flexible foundation. In fact, you can see how easily our own decomposition of essentials maps well with OPA's own primitives.

For learning, I was pleasantly surprised to learn about the many emerging technologies in decentralized learning from my colleague in Berlin, Katharine Jarmul. She can tell you detail about federated learning, differential privacy, secure multi-party computation, secure aggregation, TinyML. Many of us, including my prior self, assume we need big data for deep learning and analysis to provide personalized user experiences and business analytics. However, this list of technologies, they prove us wrong. As you can see, there's a plethora of emerging technologies just waiting for you to leverage, contribute to, and advance.

Practical Steps Forward

Many of these technologies, they're in the innovative and early adopter stages. If the maturity of these are at all a hindrance for you and for your own organization, then here are some actual practical steps you can take even today. For identity, think about whether your org still requires users to create passwords, specifically for your org. If so, can you influence your company to support federated identity and single sign-on instead? There are multiple third-party identity providers now in the market such as Okta, and ForgeRock, and Auth0, who support OpenID Connect. Once we've leveraged an IdP, identity provider, then integrate them with preexisting identity providers, such as social login or SAML integration that may be already existing with a user's employer. It would be even better if you can propose doing a proof of concept or pilot with a standard decentralized identity technology such as DIDs, and digital wallets. For application services, could your organization shift to leveraging existing interoperable standards in your industry? Previously, I mentioned a few industry-specific standards such as FHIR. If standards aren't available for your industry, pick the business case for being an industry leader, and publishing standards into a framework such as schema.org. For storage, have you looked at where your users' personal data is propagated, sent, stored? How are we keeping track of this? Could you, in fact, minimize replicating users' data into more places than it really needs to be.

To explain what I mean, let's say you have multiple frontends and backends in a microservice and micro-frontend architecture. Backends may be using cloud storage technologies, while frontends may be leveraging browser or device local storage. Rather than having all of your users' profile information propagated throughout all of your backend services and their databases, minimize replicating the user's data to only where it needs to be, perhaps only to the specific micro-frontend that displays a user's profile. Even then, perhaps the profile information is in the JSON Web Token provided by your organization's own identity provider, and so it doesn't even need to be stored on the frontend separately. Then, when your organization matures to using an external identity provider or a decentralized identity, the users personal identifiers can be ephemeral and not even stored within your organization's own application storages.

Coming back to our list, where we were looking at practical steps for decentralized storage, make the business case for supporting data portability and making user data exportable. Doing so now provides your organization optionality, before the upcoming data privacy laws give you little time. For access, check assumptions that you have today. Can you build in measures today to automatically expire access, or remove data after a period of time? Or can you allow more fine-grained access to data, so it's not all or nothing? Finally, move away from entangling your application and access. You can use well regarded policy as code technologies like OPA, Open Policy Agent that are already available. OPA aligns with this decoupled architecture that was put forth by a XACML standard, which is Extensible Access Control Markup Language. By using OPA's model architecture in your organization, that's a great practical step to decouple policy enforcement from policy decision from policy administration and from policy information. You can see how this decoupling maps very well to our own decentralized modeling essentials. I hope that gives you a good list of practical ideas that you can start even today, that will be directionally aligned to the promises of this decentralized architecture. If we collaborate on getting to this, starting today, we are on the path of an industry-wide rearchitecture.

Human-Centric Architecture

Then, fast forward a few decades. Now, Alicia is at the center. She controls her own data and the choices surrounding it. Our industry is reset with the centralization pendulum shifted to a new and better balance. We have a user centric foundation to meet the underlying needs of Alicia's communities and spheres. Alicia's memberships and belongings are rooted in her own self, her own growth of her own identity. It's not chosen by others and broken into disparate pieces throughout the internet. This pivotal recentering transforms data ownership for whole organizations, cities, and states. Users, communities, organizations, nations, we have expanded it to the global landscape where now we're in a place where human-centric architecture is at the heart. Technology serves Alicia, not the other way around.

Questions and Answers

Knecht: I think you mentioned a lot of different options for here are the different ways that folks could start to move in a user-centered architecture direction. Just to tie it back to the practical security track, and getting very opinionated actionable steps here, if you had to choose one of those things, what's the most important step that we can take or technology that we should embrace to enable a user-centered architecture?

Asthagiri: I do think the most important long run is to tackle that decentralized application essential, where we do start really decoupling where data lives and the ownership of that, and really get the data in the hands of the users when they need it, and to that extent. For the short term, I think what could be most important, is really tackling the identity dimension. With decentralized identity, removing a lot of the security issues that we get even from password management, humans being the weakest point in managing passwords, but it's just not user friendly in so many ways. I think that just tackling that, when there's already so many technical advances in terms of single sign-on providers, whether it's Okta, or Auth0, or ForgeRock, or whatever you may choose to use as your technology, but getting us to move towards a single sign-on experience for our users and then eventually decentralized identity.

Knecht: One of the foundations of DDD is that the same concept like identity, for example, means different things, depending on the subdomain using it. How do you handle avoiding a huge blob as identity containing all information necessary for services that might be different and are presented differently from one service to another without it growing out of proportion? Especially if identity is handled and stored outside of services from different companies, if we want one change to update them all?

Asthagiri: Domain-driven design, yes, it does speak about polysemes, and basically each bounded context having its own semantic for a particular term. You use here, as an example, the term identity. Identity is a cross-cutting concern within organizations also. It's a cross-cutting architectural concern. Therefore, typically, even within an organization that's providing an application service, you would have a single identity provider in order to scale the organization, because you don't want your each of the microservices developing its own identity management solution, and so forth. I think what you're pointing out is more about the data about an identity. You're perfectly right. The way that I'm thinking about it is that you do need a hybrid architecture. When we're shifting from one side of the pendulum to the other, we need to find a balance where we're doing both decentralized and centralized and using each of them where it makes sense. When you have data standards that you want to develop, so in this particular case, if the users have a separate decentralized identity provider for themselves, that would be the system of record. That would be the place where a lot of their identity data is there. However, in the application organizations and stuff, there will be a relationship with the user. For instance, in terms of, let's say, my purchase of a particular product, or my history with that organization and application, some of those things, I don't necessarily want to store in a separate place, however, you want to still be mindful for privacy. I think there is a hybrid approach where some non-PII related, still user related, though, information can still be stored in the application. We still want to be very mindful about thinking about privacy-first design. If the data that is being still stored in the application layer, can be retained only sufficiently for its needs, and then you expire that data. All of those techniques, we want to start being able to exercise those muscles. Always, whenever we're thinking, ok, data, where should it be stored? Is it in the user side, is it on the application side, for how long? All those things we would want to tease apart as we look through it.

Knecht: I think this is a very ambitious future that you're proposing here with decentralized design and so many different domains in places. I'm curious, your thoughts on maybe what are the biggest challenges, and what are some contributing factors to why we have the currently prevalent non-user centric architectures that we have? What are the ways that we can break those things down?

Asthagiri: I was pleasantly surprised in terms of when I was doing research for this talk on things that I had found where there were already a lot of emerging technologies in this area. Then I was even more pleasantly surprised after my talk, some of them, people were telling me, have you heard about "The Intention Economy," by Doc Searls. People were telling me about this Internet Identity workshop, which is an un-conference where people have been meeting for a couple of years to talk about this, and all the different angles to it. Then there's user driven services, which is trying to tackle the user experience aspect of this. I do think that this is not something that could just be addressed, technically. It's an architecture talk on this. It's not going to move the needle, just by itself. It is a holistic movement where the business incentive, the legal compliances, the user experience and helping users understand what's happening today. Right now, even when we talk about privacy to certain users, they're complacent. They think that, ok, this just has to be that way. Or they don't even understand the level and extent to which their data is being shared. With surveillance movements happening in our own country and elsewhere, where, if I want to be able to go to a mother's home for whatever reason, people would be able to find out. My geolocation is available in so many different ways today, and through so many different apps. There's an educational component for the users as well. I think just holistically addressing each of these different aspects, and then being able to bring some of this forth. I do think that that Internet Identity workshop is a great one too, where if people would like to learn a little bit more, they do have proceedings from past workshops, where a lot of thought leaders in the space come together and talk through it.

 

See more presentations with transcripts

 

Recorded at:

Oct 19, 2023

BT