BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Zalando’s STUPS: Creating an Audit-Compliant PaaS on Top of AWS

Zalando’s STUPS: Creating an Audit-Compliant PaaS on Top of AWS

This item in japanese

Bookmarks

At the microXchg conference 2016, Zalando talked about their journey to creating an audit-compliant Platform as a Service (PaaS) for multiple autonomous teams that runs on top of Amazon Web Services (AWS). InfoQ recently sat down with Tobias Sarnowski, a Cloud Architect at Zalando, and discussed further technical detail about the platform.

Key lessons learned included: the need for autonomous teams, and a supporting technological platform (PaaS), is essential to scale development as an engineering organisation grows; auditing of actions taken on the PaaS is vital for both legal and diagnostic reasons; and ensuring the technological vision (and corresponding ‘rules of play’) is shared and understood by everybody in the company helps drive consistent technological decision making.

InfoQ: Hi Tobias! Thanks for taking the time to talk to InfoQ today. Could you introduce yourself and the Zalando STUPS platform please?

Sarnowski: Sure. I’m Tobias Sarnowski, a Cloud Architect at Zalando. Zalando is an online fashion platform based in Berlin, with a few offices across Europe. I’ve been at Zalando since 2012.  

STUPS is a set of tools and components that provides a convenient and audit-compliant Platform-as-a-Service (PaaS) for multiple autonomous teams on top of Amazon Web Services (AWS). It’s basically an add-on for AWS that provides some opinionated abstraction over the IaaS. Among other things, it covers metadata services for deployed applications, deployment tooling based on Docker, and secret distribution for OAuth 2.0 credentials.

My team and I publicly released STUPS in spring 2015, not long after Zalando’s tech department adopted Radical Agility. Radical Agility is based on Daniel Pink’s book “Drive”—it’s an approach to running a tech organization that promotes the concepts of autonomy, mastery and purpose, along with trust. STUPS was, in some ways, a precondition for ‘Radical Agility’ to happen; we needed something like it, to enable our engineering teams to really start working autonomously.  

InfoQ: What were the motivations for building your own platform, in comparison with leveraging something that is already available (e.g. Cloud Foundry, Kubernetes, Cisco Mantl)?

Sarnowski: Just to clarify, STUPS is not a standalone platform, but a light add-on on top of AWS that allows us to leverage all AWS products to their full extent. We don’t want to restrict people in what they can do. We started with the goal to give our employees full access to native AWS. We only had to restrict some parts that we couldn’t track for our continuous internal and external audits. And, thankfully, we succeeded! Our employees get nearly administrative access to their AWS accounts with almost no restrictions.

InfoQ: Why did you build STUPS on AWS, and not say GCP (or make STUPS platform independent?)

Sarnowski: AWS was a safe bet in terms of cloud capabilities and maturity. Its strengths and quirks are mostly well understood. The diversity of tools gave us a quickstart without requiring us to invent much ourselves.

At the time of our evaluation of AWS, GCP was still ramping up its features. We see big potential in GCP, and will have a closer look at it.

STUPS cannot be platform-independent as it’s more of an add-on to AWS, and nothing on its own. It helps us to use the AWS services in a Zalando-compliant way.

InfoQ: Some people may argue that you have 're-invented the wheel' with some of the tooling e.g. mint/HashiCorp Vault, fullstop/Gilt Cave, Senza/Ansible? Could you share the team's thinking behind creating these tool?

Sarnowski: The STUPS tools are not about reinventing the wheel per se, but about more fully leveraging what AWS offers. For example, mint takes fuller advantage of AWS IAM and S3 to distribute our secrets. This is a very lightweight and extremely resilient way to work with AWS and S3, and doesn’t require us to maintain a complex infrastructure. Senza is just a light wrapper around AWS CloudFormation—again, leveraging the full power of AWS products instead of reinventing them.

Taupage, our AMI, as well as Senza provide an opinionated way of doing deployments. Taupage itself is our way of orchestrating Docker containers on a large scale. Docker serves as our deployment artifact format, but every container gets its own server.

The AWS ecosystem is primarily focused on the one app per server model, so we knew from the beginning that we have to use this to automatically get all the features that, for example, Auto Scaling Groups and Elastic Load Balancers provide. Senza itself promotes the idea of immutable deployments. For every application version, we create new CloudFormation stacks. And we don’t do any inplace updates of our software. This way of deploying provides a very deterministic way for us to do our blue/green deployments.

InfoQ: How long has STUPS been in development, and how many production services do you run on it? Is the plan to completely migrate all of the Zalando applications to STUPS?

Sarnowski: We started building STUPS in late 2014 and made it available for production usage to every Zalando team in May 2015. All of our microservices and stateful services (PostgreSQL, Cassandra, Spark and Kafka) on AWS also run on top of STUPS.

And yes, the current plan is to build everything on top of STUPS.

InfoQ: On the STUPS website you mention that "all team members have equal rights" and "teams are autonomous and can choose technologies as they think fit". How does this work in practice, both at the technical and organisational level?

Sarnowski: When we announced Radical Agility, we also adopted a set of five “Rules of Play” that act as technical and organisational guidelines for teams. On the technical side, these guidelines require all our services to provide a well-defined REST API (defined with OpenAPI 2.0, formerly Swagger 2.0). We’re also leveraging all the tools that enable the Internet to scale: DNS for our service discovery, and OAuth 2.0 for access control, for examples. This makes our overall architecture independent of individual technology decisions.

For scaling our OAuth 2.0 infrastructure, we also just recently published our “Plan B” provider, which scales with our services. We are using OAuth 2.0 to its extremes, because every service call has to be authenticated and authorized. This puts a lot of pressure on the OAuth infrastructure, and the common OAuth providers weren’t capable of handling the load with appropriate response times. Each millisecond wasted on OAuth easily adds up, because of all the cascading microservice calls.

With our Plan B approach, we are leveraging JWTs to be able to verify tokens in a  decentralized way, so that we don’t have a huge single point of failure for token verification. It also allows us to deploy our tokeninfo endpoints locally to the applications (on the same server), removing network latency issues and spreading load equally everywhere.

In terms of inter-team communication, we are in the process of adopting the Objective-Key-Results (OKR) pattern. Even at that level, teams are autonomous. Communication is the key value we focus on, so that teams cooperate instead of only work on their own things.

InfoQ: Several people, such as Randy Shoup, have argued that eventually independent teams creating and maintaining microservices will need to implement some kind of economic charge-back model to ensure that other teams don't consume resources recklessly. Have you looked into this?

In fact, we have. We haven’t worked long enough with the microservices model yet to know exactly what we will need in the future, but internal accounting for API usage is very likely at some point. For now, each team acts as an independent SaaS provider; this will probably lead to billing capabilities as well.

InfoQ: The notion of compliance and audit appears to be a first-class citizen in STUPS. Can you explain why please?

Sarnowski: On the one hand, we wanted to maximize team autonomy by granting raw AWS access. However, we’re also a publicly traded tech company that has to abide by financial regulations, as well as laws related to handling and protecting data: PCI, SOX and German laws, for example. Auditors need to know that we’re complying with the regulations.

Bridging this gap was one of the biggest challenges. The system we came up with fluently integrates with our development workflows and provides us near-realtime notifications in case of violations. This helps the developers to immediately recognize those violations and react on them. We are in constant exchange with our teams to improve our tooling and workflows so that they are not slowed down by the regulations.

InfoQ: What is the best way for people interested to contribute to STUPS? Are you looking for any particular help?

Sarnowski: Firstly, we’d love it if more people tried using STUPS. We’re currently making STUPS more easily adaptable and more accessible for external companies—mainly by focusing on customizable enterprise integration. Discovering and reporting problems with the documentation to allow others to also operate on STUPS. Please get in touch with us if you need help with STUPS or if you find problems.

InfoQ: Thanks for talking to InfoQ today. Would you like to share anything else with our readers?

Sarnowski: I’d like to share that STUPS illustrates some of the pretty unique and large-scale technical challenges that our engineering team gets to work on every day. Zalando’s made a huge shift in the last year to transform from “online shop” to “fashion platform,” which from a technical perspective means building and scaling an infrastructure that supports lots of diverse business initiatives. Aside from releasing a lot of of work as open source, we keep an active tech blog and give a lot of talks at conferences. It’s an exciting time for the tech team.

Additional information on the STUPS platform can be found on the platform's website, and the associated code can be found in Zalando's STUPS GitHub account.

Rate this Article

Adoption
Style

BT