BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Q&A with Galo Navarro on Building an Effective Platform Team

Q&A with Galo Navarro on Building an Effective Platform Team

Key Takeaways

  • Navarro argues that an effective platform team needs to find the balance between setting organizational standards and permitting product team autonomy.
  • Standards should focus on "effectively encode and disseminate existing organizational knowledge".
  • Providing the product team with a clear mandate and goals helps them "introduce tools that generate small impacts to a wide surface".
  • Setting a North Star for the platform team as a measurable metric helps ensure that the output of the platform team is meeting the needs of downstream teams.

In his recent article Galo Navarro, discussed his learnings as Principal Software Engineer at Adevinta in building a platform and supporting team to support over 1500 developers. He indicates that companies that reach a certain size tend to create one or more teams to care for their technical infrastructure

He calls out that the key struggle within these teams is to find the correct balance between setting standards but allowing for individual team autonomy. Navarro states that "opinionated Platform teams risk coming across as patronizing, ivory-tower-dwelling jerks that impose capricious choices on other engineering teams."

InfoQ recently sat down with Navarro to discuss approaches to structuring a platform team to enable them to best support the organization. 

InfoQ: You indicate that for a platform team to be able to make a meaningful impact the organization must have standards. Could elaborate on what you mean by that and what sort of standards you feel need to be present?

Galo Navarro: Sure. The point of creating a platform team is to leverage economies of scale. Organizations create that type of team because they expect to amortize the costs of building tools and infrastructure over the population of engineers that will use them.

Imagine a Platform team that is asked to provide an RPC solution for service-to-service communication, but some of their product teams want to use JSON, others Thift, others Protobuf, others their own custom serialization format. The Platform team is in a tough spot. If they support every choice they multiply cost (dev and maintenance), this is what I mean by spreading too thin. If instead they support only one or two, they reduce their surface of impact. Either way, this team will have a lower ROI for the company. On the contrary, if the organization is able to settle on one standard, the Platform team is able to focus on building one solution that immediately spreads to every engineering team, and can then move on to solving other problems. The team has a higher ROI, and more chances of success.

Standards are inseparable from economies of scale. There is no Internet without a standard networking protocol. No global logistics without a standard shipping container. An organization that creates a Platform team is looking for economies of scale, and must be ready to define and evolve standards if they want to set this team for success.

For your second question, the type of standards that make more sense to me are those that a) address roadblocks / friction points that slow down product teams, and that b) are specific to the organization. Standards that:

  • Effectively encode and disseminate existing organizational knowledge. I stress "existing" because I think you don't want to introduce innovations through standards, but rather consolidate choices that have been tested and hardened by experience, and are trusted to be applied across the organization. These are best practises, successful patterns, scar tissue, etc. Good examples here can be build systems, runtimes, etc.

  • Function-like APIs that both decouple and articulate organizational units, allowing teams to make productive assumptions about each other, while remaining independent. An example could be everything around service-to-service communication (e.g. JSON over HTTP, gRPC, Thrift), tailored to org-specific needs (e.g. authentication/authorisation, security standards, instrumentation, quality assurance...)

InfoQ: Are there warning signs that a platform team is not achieving the correct balance between autonomy and standardization?

Navarro: Where the correct balance is varies greatly from organization to organization. Size, maturity, industry.. those and other factors influence where they want to place the dial.

The main warning sign I'd watch for is the absence of a deliberate choice of what "right" means for them. A well articulated strategy gives both Platform and Product teams clarity and direction so they can focus on moving the right dials. When they know what the organization wishes to optimize for, they measure and act accordingly. Without that, efforts degenerate into feuds between misaligned factions (e.g. product vs. infra, mobile vs. backend) that are seldom productive.

Given a solid strategy, I think mature organizations can enjoy both autonomy and standards because their teams are able to (dis)agree and commit productively. Less mature organizations can get there after a period with closer guidance from leadership.

InfoQ: You mention that your PaaS is aiming to become a pierceable abstraction. Could you share an example of what that looks like within the platform?

Navarro: I took that term from Will Larson who has a great article on the topic.

One example is our engineering metrics system (Ledger), which I mention in the article. Ledger crunches metrics like deploy frequency, code review time, etc. by consuming from an event bus that collects everything that happens in our ecosystem (e.g. builds, deploys, pull requests, etc.)  Our PaaS includes default tools for those functions (e.g. GitHub, Travis, Spinnaker, our Kubernetes cluster, etc.)

Any team that uses the standard toolset gets their metrics generated out of the box. But some other teams want / must use different ones (e.g. Jenkins or Bitbucket instead of Travis, or their own Kubernetes clusters). We designed Ledger to accommodate these cases without compromising the strategy of an opinionated PaaS. An enabling factor for this was defining canonical event schemas for "a build", "a deploy", "a change", "a change in trunk":

  • Events coming from the standard systems (e.g. Travis, Spinnaker, GitHub, etc.) get automatically normalized by us to a canonical event type.
  • Teams are given a simple tool that they can integrate in their own systems (e.g. in Jenkins or Bitbucket) and publish canonical events into our event bus.
  • Ledger computes engineering metrics based on those canonical events only. This means that even though Ledger is part of our PaaS, it remains agnostic to the actual source of events.

We believe this makes a good compromise:

  • Product teams get the final choice on which tools they use, in exchange for managing more complexity.
  • They still benefit from parts of our tooling that make sense for them, without committing to the full bundle.
  • We diversify the ways in which we can generate impact, mitigating the negative consequences of divergence.

InfoQ: In your article you indicate that "One of the strategies we use is to spot where we can introduce tools that generate small impacts to a wide surface." How is the platform team empowered and enabled to be able to more easily spot these situations?

Navarro: By giving them clear goals and autonomy to achieve them. We're seldom told "build this tool", but rather "power-up product teams", and it's expected that we'll walk up and down the organization to understand what challenges product teams have and which are worth solving. This empowerment also takes the form of enriching our teams with other roles than infrastructure engineers: product management, UX, etc.

On our side, this is yet another reason why we put a lot of focus on instrumenting the engineering ecosystem. These insights let us analize pain points, potential impact, etc. before we actually do any work.

InfoQ: You share that your North Star is "successful deployments per week". Are there other key indicators that you track to ensure you are delivering the features and services that your users want?

Navarro: Yes, quite a few. We created a metrics tree that has the North Star as root, decompose it in several factors, and build the lower levels of the tree based on them. Successful deployments per week can be decomposed like this:

   successful deploys per week = number of active repos * avg number of deploys per week * success rate.

So those three metrics are in fact the next level in the tree, with their corresponding subtrees.

The first one, "number of active repos" is a measure of adoption. We decomposed this subtree in other metrics that influence adoption of our PaaS. Examples are build times (fast is good), rate of successful onboardings (high is good), sessions in our UIs (high is good).

The "avg number of deploys per week" tracks efficiency. What influences this metric?  Duration of code reviews (short is good), size of pull requests (small is good), time to deploy to PRO (short is good), etc.

Success Rate is the least mature because the definition of a "failed deploy" varies across teams (some do an actual rollback, others a roll-forward, etc.). We're focusing less here for the time being, but our intuition is that it will be influenced by things like code coverage, deployment frequency, release size, etc.

The metrics mentioned here are actually tracked by our teams, and regularly used to define OKRs.

InfoQ: What advice do you have for a team just starting out on building a PaaS for their organization?

Navarro: First, have a clear strategy (this is to a large degree given, not chosen) and situational awareness, both internally (how do other teams feel about you? do you have political capital, etc.) and externally (what's the industry like in your scope?)

Second, get very close to the teams that you support, embed in their daily work, share their processes, tools, understand their roadmaps, goals, etc.: you're there to make *them* successful. Try to spot areas of intervention where you can maximize impact for as many teams as possible, in the least possible time (by this I mean weeks, not quarters). If there are no clear ones, focus on solving the needs for a team with a critical project. Give them credit for the success, and they will start acting as your ambassadors. Then export the (now successful, proven) solutions you built there to other teams. In other words, earn trust by solving real problems. Then use trust as leverage to steer the organization.

About the Interviewee

Galo Navarro is a software engineer with background on backend, distributed systems and continuous delivery.  He recently joined New Relic as a Lead Software Engineer. Prior to that he built a PaaS for 1500 engineers at Adevinta, virtual network software at Midokura, and some of the services that powered social networks like Tuenti or Last.fm.

Rate this Article

Adoption
Style

BT