BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles GDPR for Operations

GDPR for Operations

Leia em Português

Key Takeaways

  • Build strong identity concepts into your infrastructure from the beginning - consider the identities of servers and containers as well as people.
  • Use role based access control, and the principle of least privilege to limit access to resources only to the entities that require it.
  • Give consideration to detailed, tamper-proof audit logs and think about your retention policies in light of GDPR's requirements.
  • Stop copying data from production into development environments without first masking personal data.
  • Use tools from trusted vendors to monitor and respond to security events on your network.

GDPR for Software Engineers

This eMag examines what software engineers, data engineers, and operations teams need to know about GDPR, along with the implications it has on data collection, storage and use for any organization dealing with customer data in the EU. Download Now.

What's GDPR and why should I care?

The GDPR (General Data Protection Regulation) is intended to help EU citizens take greater control over their personal data. It applies to any organisation (inside or outside the EU) that holds data on EU nationals, so the scope is pretty broad, and this has caused an amount of panic for some organisations.

Part of the reason for this is that the regulation introduces sanctions of up to €20M or 4% of annual worldwide turnover. This is a more significant financial penalty than can be levied under the Data Protection Directive (which GDPR replaces), and has therefore forced more serious conversation about compliance and accountability into the boardroom.

As ever though, the real hard work of implementing GDPR goes on elsewhere in an organisation. As practitioners of systems administration, ops, DevOps or SRE, what do we need to be worrying about?

Reading the Rules

I usually recommend that if you're going to be bound by some form of regulation, you should probably take some time out to read the text of the thing in question to get an idea of the language being used, so that you're better positioned to discuss details with auditors or advisors. However, don't expect to find any useful technical answers in the GDPR regulations. This is a legislative instrument, designed for lawyers, not a security how-to for engineers. Readers who've worked with the very prescriptive PCI-DSS standards checklists may be disappointed to find that there is no equivalent here.

There's both good and bad news here. The good news is that many of the provisions of GDPR are not dissimilar from the provisions of the Data Protection Directive that it replaces. The bad news is that many organisations aren't really doing a good job of supporting those either.

Fundamentally, most compliance regimes are about data security, and their content can basically be boiled down to the two main concerns of "don't let the data out" and "don't let the bad guys in". GDPR expands on these basics, requiring that individuals be given the right to know more about how you're using their data; to request copies of data about them; and to request that you delete data about them. These have obvious implications from a technology point of view.

In particular Article 25 refers to "data protection by design and by default". What design considerations should we be building into our systems then?

Identity and Access Control

From a system’s perspective, good security posture requires strong management of identity. Every individual interacting with a system should have their own unique identity, managed centrally in some kind of directory solution. This might be a service you run yourself like Active Directory or LDAP, or it may be a SaaS solution provided by a company like Okta. This identity service will manage credentials such as passwords and any security tokens used for additional authentication factors. By providing a SAML or OAuth2 IdP (identity provider) backed by this service, you can federate this identity into any number of separate systems, allowing users to log in with a single set of credentials.

Whilst this sort of user federation is most commonly used with external SaaS services, it can also be used with internal web systems. Using Traefik or similar software (such as Google's IAP) it's possible to put an identity-aware proxy in front of existing web apps, so that users must log into the proxy with their single sign-on credentials in order to access the app itself.

Within Amazon Web Services, both IAM (their Identity and Access Management service) and Cognito can make use of federated sources of identity, allowing users to log into those services with their single sign-on credentials.

This doesn't stop with web interfaces either: Hashicorp Vault provides a security API service which can make use of an external IdP as a source of identity for users. Once identified, users can ask Vault to vend signed SSH certificates which will allow shell access to remote systems; or to provide temporary access credentials to databases and other services.

With a good identity story, the next thing to think about is Role Based Access Control (RBAC). Managing permissions on a user-by-user basis is tedious, scales badly, and is difficult to reason about from an audit perspective. In an RBAC model, we create roles, and assign permissions to those roles. A role might be "database administrator" or "customer service agent" for example. Each of these roles will have very different permissions - the database administrator might have shell access to database clusters, but no access to the customer service web front end. By assigning roles to users, we can easily see who has which groups of permissions.

Identity isn't just about people - physical servers, virtual machines, containers and applications can all have identities too. Cloud vendors such as AWS, and orchestration platforms like Kubernetes all have strong concepts of identity for their component parts. In AWS, EC2 instances can gain access to their "instance identity document", as well as a set of signatures which can be used to verify the authenticity of this data. The identity document can be used to prove instance identity to Hashicorp Vault, which can then securely provide secrets to that instance based on the role it has been assigned. A similar workflow is available to establish the identity of a Kubernetes-scheduled container. With strong identity principles like these, it should never be necessary to place secrets such as database credentials or API keys directly into application config - these can be managed centrally instead.

We can now assign roles both to people and to parts of our system. Each of these roles should be given the absolute smallest set of permissions required to do their job. This principle of least privilege makes it easy to demonstrate to a compliance auditor which people or systems may access which data objects.

Logging and Auditing

Once we've established which people or systems can access which items of data, we need to record such access in order to be able to show to an auditor that our access controls are working correctly. We also need to be able to demonstrate that logs can't be tampered with after they've been written to.

In AWS, we can enable CloudTrail logging, which will log all AWS API calls made against our resources. These logs should be written and encrypted into S3 storage owned by a dedicated secure logging account. Access to this logging account should be strictly controlled, the policies on this bucket should ensure it is not possible to modify or delete logs once written.

Other system and application logs should be aggregated in a similar manner, shipped straight off host servers into secure, tamper-proof storage. The log shipper you use here should be configured to copy log data verbatim, so that you can demonstrate to an auditor that the stored logs have not been modified from their original form.

If you’re also using tools such as Elastic’s ELK stack to view and search log data, there are various reasons why you might want to modify log data as you ship it. In that case, use a second log shipper configuration for this less secure copy of the logs.

It's possible that logs will contain personal data as defined by GDPR terms, and as such these should be expired and deleted on an appropriate timeline. What that timeline looks like will depend on your specific workload and on any other compliance obligations you carry. Article 17 of GDPR covers "right to erasure", where a data subject can request that you delete any personal data you hold on them. This will be easier the less data you hold by default.

Article 15 covers "right of access by the data subject" wherein someone can ask you to provide all the data you hold on them. You probably have a good idea of what personal data exists in your primary data stores, and how that data is linked with relations, but it might be less obvious which of those items of data could end up in your logs. This probably means that you’d need to be able to search for a particular user's log entries under a right of access request. In this case, structured (rather than free text) log data is likely to be useful, and search tooling such as AWS Athena might come in handy.

To make complying easier, you might want to insist that software developers take steps with their logging frameworks to remove personal data from log events if they're not necessary. Bear in mind that under GDPR rules, device identifiers, IP addresses, postcodes, and so forth could be considered personal data since they could be used to single out an individual, so consider those too.

Backups

It's very likely that we'll have personal data in our backups. GDPR may therefore impact our retention policies. Under the right to erasure, a data subject can ask to have data about them removed. If we only delete that subject's data from our production systems, we'll still have copies in our backups.

You'll want to ask a friendly lawyer about this, but from my own research into the subject, it looks like it should be reasonable to remove data from production databases, and inform the data subject that whilst their data will still exist in backups, that these will be aged out in 30 days, or whatever, according to your retention policy.

In the event that you need to restore from backups, you'd need to erase that data subject's data again, and so erased subjects would need tracking, at least for the length of your retention policy.

Creating a "backup administrator" role, preventing access to backups by anyone else, and limiting the number of individuals who have that role will help reduce the number of individuals who have access to erased data during the backup retention period, which seems like a reasonable measure to take.

Dev / Test Data Sets

Some companies are accustomed to being able to restore copies of production data into staging or development systems in order to facilitate testing. There may be an argument to allow this in staging environments, assuming access to those is limited in the same way as production access. However, allowing all your developers access to your full data set is definitely a no-no with GDPR.

There exist commercial solutions (such as Data Masker from RedGate Software) to take a data set and mask sensitive data as part of an ETL operation into another database. I've also seen organisations attempt to build these themselves.

It may also be sufficient to generate dummy data sets for use in development environments, and tools exist that facilitate this generation. You'll need to ensure your generated data is of a realistic size and cardinality, otherwise your dev systems will perform very differently.

Which of these approaches is the most appropriate will be very workload dependent. Close collaboration with development teams will be very important here.

Monitoring and Alerting

Articles 33 and 34 require that in the event of a data breach, you notify affected data subjects, along with the supervisory authority - there will be one of these for each member state in the EU.

Obviously, this is only possible if you know you've been breached, and so monitoring, security scanning and alerting will all play their part here. Generally I'm a fan of open source solutions, but in the case of security monitoring, turning to vendors is the smart option since they have whole teams of people working to keep their solutions up to date with details of the latest threats.

Web application firewalls (WAF) can help mitigate against common modes of attack on web applications and APIs, watching requests for the fingerprints of these sorts of attacks and blocking them at their source. For example, a WAF might scan the content of every HTTP request, applying a list of regular expressions that match known SQL injection attacks. In the event that a pattern matches, the request is blocked instead of being forwarded to the application cluster behind it.

Scanning outbound network traffic can help identify data breaches - smart modern tools like Darktrace use machine learning to build a model of what normal looks like, in order to look for anomalies, as well as pattern matching on typically "personal" data such as credit cards, postcodes and email addresses. Seeing too much of such data leave your network can raise a red flag for further investigation.

Inside the network, intrusion detection tooling can help identify when your systems have been accessed by bad actors, either by scanning network traffic or by watching log data. AlertLogic and ThreatStack both have offerings in this space, and AWS GuardDuty offers some of these features too.

Other Good Hygiene

There are a couple of other security practises that you should be employing here too.

You should encrypt everything in transit and at rest. There's really no reason not to do it these days - and cloud vendors even provide primitives to make this easier for you. It's more straightforward to design this into an infrastructure from the beginning than to retrofit it later, so make sure it's a design consideration up-front.

Network design is still important. Protect your perimeter, and use host firewalls and security groups inside the network to limit access to your systems. Separate your management tools from your other systems, and keep each environment distinct from each other to prevent leakage of data between them. In some cases it may be appropriate to segregate different classes of data into distinct databases, in order to further limit access to them on the network.

It should go without saying, but you need to ensure you have a regular software patching regime - and not just for your infrastructure components. Keep on top of newer versions of your application dependencies too - and employ tools like Snyk in your CI pipeline to get alerts on dependency vulnerabilities before your code makes it into production. High profile incidents such as the Experian data breach are often the result of insecure libraries still being in use.

Security is as much a development concern as an operational one. Tools like OWASP Zap and Gauntlt can help look for security problems in application code before they go live and cause trouble.

Consider what external services you make use of, and what data you pass to them. If you're using SaaS logging providers, for example, be aware that you may be passing personal data outside of your network, and that they also then have obligations to your data subjects.

Conclusion

GDPR is an unavoidable fact of life for anyone working with data about EU citizens. Taking care of this personal data is an organisation-wide responsibility, but in the operations part of the business we can provide a lot of supporting tools to help deal with the multiple facets of this problem.

GDPR doesn’t substantially extend the provisions of the Data Protection Directive, and so a lot of what I’ve described here is good practise that you should already be following. The penalties for not complying with GDPR are much higher. It’s time to stop looking at security as an obligation, and start making customer’s data privacy a reality.

About the Author

Jon Topper is CTO at The Scale Factory, a UK-based DevOps and cloud infrastructure consultancy. He’s worked with Linux web hosting infrastructure since 1999, originally for ISP Designer Servers, and then for mobile technology startup Trutap. Since 2009, Jon and The Scale Factory have worked to design, build, scale, and operate infrastructure for a range of workload types. Jon has a particular interest in systems architecture, and in building high performing teams. He’s a regular speaker on DevOps and cloud infrastructure topics, both in the UK and internationally.

BT