BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Highlighting Silicon Valley Strategies for Improving Engineering Velocity, Efficiency, and Quality

Highlighting Silicon Valley Strategies for Improving Engineering Velocity, Efficiency, and Quality

Bookmarks
42:22

Summary

David Mercurio shares personal insights and experiences about cultural practices that one can apply to help improve the effectiveness of an engineering organization.

Bio

David Mercurio is a Software Engineer at Stripe working on Financial Infrastructure, focused on the system that runs the process of clearing for credit card transactions. Previously, he worked as a Software Engineer at Snapchat on the Memories product and at Facebook working on Personalized Videos, Developer Platform, and Infrastructure.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Mercurio: This talk is about how cultural strategies can be used to improve engineering velocity, efficiency, and quality, which is a bit of a long title and I didn't realize until I tried to put it all on one slide. I'm currently a software engineer at Stripe - if you're not familiar with that, I'll talk a little bit about that later - I work on payments infrastructure. In the past, I've worked at Snapchat mostly on Memories Backend, and I spent almost five years working at Facebook on various projects, including infrastructure, platform team, and personalized videos, which you may have seen before.

My goal for this talk is to basically give you an engineer's perspective about some common themes that these companies and other companies in Silicon Valley do well, which influence the quality of their product at the end of the day and sort of like cultural practices that they tend to follow. As an engineer, this will be a little more technical focused, but I think that's a good thing. I think culture is more than just free lunches and ping pong tables. At Stripe we have no ping pong tables. This is about how we can actually improve our engineering through culture.

Efficiency Is Leverage

As an engineer, I write code for most of the day, I'm not used to talking to groups this large, so bear with me. These will be divided into three different sections. For each section, I have one of Stripe's core values posted, and this one is called efficiency is leverage. Basically, this is the holy grail of engineering. We're all strained for time and we'd all like to be able to get a lot done with just a little. I'm going to go through examples of projects at each of these companies where I thought they took advantage of this very well, and having embedded into their culture a way of thinking and a way of approaching problems.

The first is at Facebook. If you rewind back to around 2010, which is when I joined Facebook full-time, they were experiencing a few repeating patterns of bugs, privacy bugs where basically some content from one user would be displayed to another user that was unintended and it makes that user very unhappy. If you think of Facebook as a social graph and you are a node in this graph and the edges connect things are in the graph, and both of the nodes and the edges have types, that means something. You may be a user type. I'm in the center there and I'm connected to my friend, he's also a user. This edge that connects us represents the friendship. My friend goes on a trip and he uploads a photo. Now, the question is, in order to perform a privacy check, the question is, should I be allowed to see this photo? Basically, should we allow for this connection in the graph?

At this time, Facebook already had a way to do efficient data fetching where they basically had an automatic way to combine data fetching to minimize the number of round trips between your web server and your storage tier. This works quite well, but the problem that they were encountering was, every object in this graph can define its own privacy rules, it's basically like an ACL for every single object. You might choose to upload a photo that all your friends can see, but then maybe this is a photo of you at the beach and you don't want your boss to see it because you were supposed to be at work. All your friends can see it except for your boss, so you could choose to hide this.

This makes checking privacy a little bit complicated. The way that it used to work was that you would basically define some sort of privacy check for your node type. The photo type would have some set of privacy rules. You would load a bunch of data, you do this in an efficient manner, but the data loading is not aware of privacy. It just loads all of the data and then the data gets returned to the webserver. Now, you must filter any data that fails these checks, which I think at the time were called canSee checks.

The problem here is, if you forget to do step three or if the code is arranged in some way that you made changes over to one section of the code that had some unintended consequences over in another section of code, and you release this and it's a bug and it makes the news and everybody's angry. This happened a few times and I think it would be natural for a company to say, "This is happening a lot. Everybody go look at your code and make sure there are no privacy errors in it," which is not really going to work, and it doesn't do anything to prevent future problems like this.

Facebook decided to try to centralize the effort and fix this problem with a single solution that could solve the problem for everybody that uses this. They basically consolidate all the privacy checking logic by making the social graph privacy-aware. The implementation of this was one of my favorite pieces of code that I've seen. Basically, every node type will define a privacy policy, which is like a class. This class gets auto-generated when you create a new node type and it builds the template for you that you fill out. You define a list of privacy rules that will exist for this privacy policy, you can think of them being essentially executed in order one after another. At each point, you can decide to allow, deny, or just skip onto the next one.

Now, when we're loading data, we pass the viewer along everywhere where we load data. We are basically coupling data fetching and the viewer. They always happen together, you always load data with the context of a specific viewer. If the viewer can't see that object, it never comes back, there's no way of leaking the information in that way. This was very effective and essentially, I believe pretty much completely solved this issue that they kept on running into. Every time you load data with the viewer, you pass it around everywhere. You do not construct this viewer object, it's given to you in the request and then you pass it along everywhere else you need. It's sort of threaded through the entire code base.

This isn't real code, but this is what it would look like, I believe it captures the idea of what this code looks like. You have some class which defines a list of privacy rules. The names of these rules are supposed to be very descriptive, so the first one is allow if you are as owner. If this is your object, you should just always be allowed to see it. Then they go through the list, you might deny if the viewer is blocked or if they're in the hidden list, like I mentioned earlier with your boss, or there's some default at the end. Maybe if you're friends with this person, then you can see it as long as you weren't previously excluded. Then some final rule, which is just a catch-all. You probably just want to deny in that particular case.

The beauty of this is, this is very easy to unit test and there were some cool advantages when you centralize everyone on a single solution because now what you could do is, given some data, you can determine that certain rules may be costly to execute, but they don't actually affect the decision very often. When you are actually performing these privacy checks, you could potentially leave those rules until the end, you're sort of rearranging their execution. As long as you do that in a way that's logically equivalent, then you could possibly make these privacy checks more performant. This is one small improvement you make to the piece of code, which executes this, but has vast consequences for improvements all over Facebook.

The other beauty of this was, even though this was created in probably 5 years before I went to use it for a personalized video project, it fit this model very well. I worked on this team that was auto-generating videos using content that you or your friend share. You might've seen these appear in your newsfeed. How we want this to work is, let's say those three photos are going to appear in the video, we only want to allow someone to see that video if they can also see those three photos. Otherwise, we're leaking some private information. The way that we model this is simply just to connect all these in the social graph with an edge to the video object. These are all notes and the social graph that are now connected. Then, we write some new privacy rule that is something like allow a viewer to see the video if they can see all the content in the video. We recursively do the privacy checks on each of those pieces of content and this is done in an efficient manner. Then, that determines whether or not you can actually see the video.

The framework was built in such a way that it allowed a product team like mine to be able to execute a lot more quickly than if we had to solve this problem on our own. It was general enough to be able to foresee this future product that was going to happen five years in the future. I just want to bring this back, this is not just a technical solution. I think the key thing that's important here is actually the way of approaching these problems. I believe that this was very well baked deeply into Facebook's culture and still is.

Snapchat Example

The next example will come from Snapchat, here’s just a little bit of context here. Snapchat went from zero users to a couple of hundred million daily users in the course of just a couple of years, it's pretty rapid growth. These users tend to consume rich media videos photos, error filters, very expensive stuff and very strenuous on infrastructure. As you all know, it's difficult to hire enough engineers and it's difficult to do it that quickly. Snapchat does what most companies do these days and leverage public cloud services. This was working well, I was on the Memories team here, Memories is how you save your snaps for later. The way this essentially worked was, you send some initial requests for Metadata, the response returns with how to lay out your UI and where, which one of these are videos versus images, and also a bunch of URLs that are signed, which I'll go into in a sec.

Basically, the URLs define where you should download the thumbnails and the media and all that stuff and also, possibly provide some URLs for uploading. When you create content, you want to know where you should upload this. In general, you will probably upload and download directly to your storage tier. This would be something like Google cloud storage or Amazon S3 there. You could also proxy it through the web server and there are certain times you would do each of those two things, depending on your requirements. This works well.

The problem is at this scale, this can get very expensive but, for now, this was how it works. We used assigned URLs here, which are sent back from the server, and they look something like that. Basically, this provides a little bit of additional security so that if you're downloading, there'll be an expiration so that if someone managed to get this URL, they wouldn't be able to access the content. When you're uploading, you can enforce certain things like the size of the content, the hash of the content to make sure you're actually getting what you intended to get. There are some other headers that you might provide.

The naive solution to this would just be to have the client pass these headers, sort of hard coat these headers and pass them along when they're uploading and that'll work for your first pass but, as I mentioned, this can get pretty expensive. Snapchat's solution to that was to use multiple tools to split their usage of providers to help prevent their costs from going up. If you are big to too much into one provider, the cost can tend to go up because you lose your leverage.

It would look something like this. You now want your client to be dumb, you don't want it to actually know where it's going because then if you want to add a third provider here, you don't want to have to update your client. You want this client to be able to be future-proof. There are a few very subtle things that you need to do here. One is that now, maybe you have your web server just instruct to the client on what headers it should use. This way, if there are specific headers that apply to one service but not another, then you can pass them along and the client doesn't have to care what it's doing. It just blindly applies the headers that it's passed and uploads to the URL it's told to upload to. The reason you need to do this is because when you have a signed URL, the signature actually includes the headers in it, as well. They have to match in order for the upload to work.

You should abstract all of this away, both from the client and from server-side developers who are working on code that's running on the web servers. They should never have to care what the underlying service is. They just treat this like some storage service. Behind the scenes, there's some decision making going on that decides which service should be used, depending on whatever criteria it cares about. Things that you might care about for downloading, for example, are where is the object stored? Maybe the object lives in both service providers, maybe it lives only in one, and that decision is made for you.

Also, other factors that are more products like - do you want to stream this video or is it small enough that you should just download the entire video? What is your connection and quality like? Maybe you want videos at different resolutions. If you have a poor connection quality, you want the lower resolution one. Similarly, when you're uploading, you have some decisions to make. Do you want this upload to be continuable, which is useful for large videos. For example, you can pause the video, upload, and continue it again later. This is useful if your users only want to upload on Wi-Fi and not on cell data.

The user location matters. These service providers tend to have their POPs in different places. Depending on the location of the user, you might prefer one service provider versus another because of performance reasons. Also, the cool advantage here is your uptime can actually be better than any of those services uptimes because if one of them is down, you just automatically switch over to the other one and your users experience no downtime at all. This way of centralizing on consolidating on a single solution allows you to make improvements in that abstraction there that apply to every product team that might want to build on this.

Stripe

The next example will come from Stripe where I currently work. Here’s just a brief description of Stripe if you're not familiar with it. This is probably hard to read, but the sentence I think summarizes what Stripe does pretty well. It's "Stripe is the best software platform for running an internet business." We are not currently generally available in Brazil, but we have people actively working on that. If you browse the website, you'll see that you can sign up for updates so that when we become available, then you'll be notified and hopefully, that'll happen pretty soon.

As I mentioned, Stripe deals with payments, we do a lot of payment volume. When you're dealing with card payments, there's something that you have to adhere to called PCI compliance. I think one of the best things about using Stripe is that we take care of PCI compliance for our Stripe users. It's something you never have to worry about. It's kind of annoying and it's something that we don't want to worry about much either. I'll get into that a little bit later, but PCI compliance basically has a set of overall guidelines. There's more a kind of description for each one, but most of these are pretty common sense things that you would do as a good engineer anyway. They all revolve around protecting hard data, cardholder data. I'm just going to highlight a couple that are maybe some things that you would not normally do.

Number nine here says, "Restrict physical access to cardholder data." What this means is that if you have your own data center, what you probably have to do is have a locked section of the data center where the servers that hold card data exist and only certain people have physical access to those servers to ensure that nobody can go and steal a whole bunch of card information. Then every time somebody goes in and out of that section of the data center, the information needs to be logged so that you have an audit trail of everything that's going on.

In the cloud world, what this usually means is that - say you use AWS, for example - there are maybe certain services that AWS will not allow you to use because they are not PCI-compliant services. You will probably be forced to, or there may be certain restrictions on which of their services you can use or you may have to pay extra to have, for example, your own instance rather than a shared instance. That's not ideal.

There are also restrictions around in your company who can access this piece of the codebase that has access to cardholder data. You'll probably want some additional processes, authentication and logging, that you would otherwise not really need to have. These people also must have good business reasons for making any changes there. That's something that you don't want broadly applied to all your code.

Stripe has an interesting solution to solve this. Imagine you have some online store and a customer of yours is making a purchase, and so you need to send this card information to Stripe. Stripe has the elements embedded on your webpage that allow you to send us information without ever actually having to touch that data. It goes directly to Stripe.

What we have is this thin layer around Stripe's main codebase called Apiori, everything inside Apiori is the PCI-compliant version of Stripe. This is a barrier between the API and the rest of the Stripe code base. What Apiori does is it takes that credit card information and replaces it with a token and stores this mapping from credit card to token in a secure PCI-compliant storage system. It then passes that token along to Stripe's main codebase. Most of Stripe's code is written using tokens. We've taken the cardholder information out of the equation and so it's all protected in the PCI-compliant system. Now, we can move a lot faster over on this section of the diagram than we could if we had to deal with the processes imposed by PCI-compliance.

We can also send this back to the user of the token, and so the user doesn't have to worry about PCI compliance because they're only working with the token. That's a great advantage of using Stripe. Then if we need to, we can direct Ariori to send information back outward if they actually need to send the card information to a partner for some, for some reason. This allows us to move very quickly in this section. There's also by pushing all the PCI-compliance stuff in one area, it allows us to make certain changes to the system - Apiori is written in Go because of the performance benefits that we get from that. It's quite a simple piece of code, whereas the rest of Stripe's codebase is written in Ruby and maybe could be a little more complicated, but doesn't have the strict performance requirements that you might want in Apiori.

The pattern that I'm trying to describe is that this is more than just technical solutions. This is a way of approaching problems with the belief that problems can and should be solved with a bottom-up foundational approach because of the benefits that provide every additional layer above it.

Trust and Amplify: Develop One Another in the Short and Long Term

In the second section I'll mention another Stripe value called Trust and Amplify. This is a developing one another in the short and long term. Basically, the point of this is improved teamwork makes us achieve our goals better. Two patterns I saw at Facebook and Stripe were related to onboarding and rotational programs. Facebook has talked about their bootcamp program a lot. Bootcamp is the, I believe, six-week training program that incoming engineers go through when they first join the company. One of the goals is to commit code on your first day, and they're generally taking low-urgency bug fixes or small features from various teams, various teams all across the company, and sort of learning the Facebook codebase that way. They will generally have a mentor that will help them review code and integrate into the environment.

I like how this sets a very early foundation for the things that Facebook's culture values like prioritizing productivity. I'm sure you've heard of the whole move fast mentality and it starts right on day one. Also, shared ownership - right from day one, you are looking at code from across all the teams and you have the permission and the responsibility to improve that. This continues throughout your career. Rather than just expecting someone else to fix their bugs or to improve their own products, that is something that you are empowered and encouraged to do the whole time you're there.

This is also used for team selection to make sure that you have chosen a team that is working on something you're very interested in. There's also Hackamonth. I'm sure you've heard of Hackathons, Hackamonth is like a month-long version of that where you fully disengaged from your regular team and switch to work on a new team for a month. I was part of Facebook's first Hackamonth experiment to see how this would go. I liked this a little bit better than the 20% time concepts that was popularized by Google. A lot of the criticisms I've heard about that were that it essentially became 120% time because you could never disengage from your old team. You were basically working on two projects, but the Hackamonth allows you to do that.

An advantage of that is bi-directional knowledge sharing. Not only are you learning new stuff while you're working on this new project, but your team has to learn all the stuff that you knew because you just stepped away from it. I'm not sure if familiar with Bus factor, it's a little bit of a morbid kind of a way of describing how can the company continue if you got hit by a bus on your way to work. This will help other people gain the knowledge that maybe only you have in your head. I believe this may also help with retention because if you have an employee that's bored with what they're doing, rather than looking elsewhere outside the company, they can find these opportunities inside the company.

Stripe is pretty similar here. We're trying to gain efficiency by consolidating onboarding programs. The one at Stripe is called /dev/start, this comes from Stripe's original name being Dev Payments before they renamed to Stripe. I like Stripe a lot better. Generally, there are slight differences here, these are people working on impactful but lower urgency and well-scoped group projects. You might have a team of four or something like that. You get the second set of peers and people are coming from different places and then joining different teams. Now, I have this group of friends that I was working on a project with when I first started Stripe and we still get together occasionally for lunch and stuff like that.

This also provides - and this is true with boot camp as well - mentorship opportunities for people who may be looking for opportunities to do that. This one could also be the bottleneck because you have to make sure that you value the fact that you are paying an investment now. There are these upfront costs to mentorship, but it pays dividends in the future when these new employees are able to integrate into the company better. We also have rotations, there's someone rotating on my team right now and it's been very useful to have this person bring their knowledge from the infrastructure team they were working o onto the payments team that I'm on.

The same benefits here - also retention. This one is also useful if you're opening a new office in a new location. It allows you to transfer not just knowledge but also culture by allowing people to rotate temporarily or permanently to a new office location. We have done this in each of our offices in Seattle, Dublin, Singapore, and our recently announced a remote hub for engineers where a group of engineers are fully remote. Hopefully, we'll have an office here soon because then I can come back.

Our leadership team does rotations too. I believe they try to do them on a yearly or maybe every six-month basis. Patrick Collison, our founder, did a rotation on my team for a bit and David Singleton, our CTO, recently did a rotation and was tweeting about how he liked the improvements in dev tools. I think it exposes them to a lot of the things that engineers are dealing with on a day-to-day basis that they may not experience from where they typically are working.

We Haven’t Won Yet: Identifying and Resolving Unaddressed Risks

We Haven't Won Yet is another one of Stripe's values. The main purpose of this one is to explain that most of Stripe's opportunity and all the hard problems are still ahead of it, but some people, like me, tend to use this in a little bit of a sarcastic way. When some code is not quite right we say, ''Well, we haven't won yet." I like the philosophy that all of these companies have around, what we call incidents, they go by different names at different companies. Basically, this is when something is bad, something has gone wrong. The consequences could be bad, so we need to fix this right away.

We have this concept, this tool called the Big Red Button. The point of this is that you may be unsure, but you realize something doesn't look quite right and you say, "Uh oh, I need to hit the button." You hit this button, which is an internal tool and it brings you to this form where it is supposed to be a very frictionless way of alerting people that there's something going on.

This form autogenerates a name for you, which is really useful because rather than referring to this in discussion as incident one, two, three, or whatever number you have, these names are very easy to remember. I still remember all the incidents that I was involved in - they're not that many. It's also useful for searching, I might put comments and code that refer to past incidents or Github pull requests. It's very easy to search the context of these. Basically, you fill out this form - it's meant to be very easy to fill out - you describe a brief summary of what's going on - we have guidelines on this severity - and whether or not this is affecting users. If it's "I just broke the master codebase," this is not affecting users, but maybe it's a breakage in the API and in that case, it is affecting users. We use that to determine whether we need to make public communications, we might put a tweet out that says we're having problems.

There's an Incident PM. The Incident PM is a short rotation, 24 hours or something like that where people will get paged if that checkbox is selected and they will jump into help. They're not necessarily going to help to fix the bug, but they're coordinating. You might say, "I know that this a bug caused some charges to fail in the API, but I don't know how to see which merchants were affected." They will help find the person who can run that query, find the effected merchants, and then we can proactively communicate to them.

When this is submitted, we auto-generate a whole bunch of stuff. In the moment, it's very stressful. You don't want to have to think about all the people you need to notify and all the steps that you need to take. These are all things that we can automate with tooling and that should just happen. We automatically create a Slack channel and consolidate all the communication into there. We create a Google doc where we'll have detailed description of the whole incident and the specific timeline of when everything happened and any ideas we have for things we could've done to prevent it. Possibly, you will have a user communications channel where people will make sure that we're communicating with users properly and only once and stuff like that.

Afterward, after the fire is out or at least dampened, we go through a review process. One key point of this review process is that it's blameless - we're not looking to say that you created a bug and you broke the API. We understand that human errors can occur and that if a human could possibly take down the API, then that is actually a problem with the system. This system should account for human errors and be able to prevent these from happening. We'll try to figure out what are the things we can do? What are the remediations to prevent these from happening again? These are sometimes code changes, and sometimes they're not code changes, they're process improvements. My team has a new process that resulted from these discussions where we just communicate a little better and more and more formally about this specific topic to prevent things from falling through the cracks.

The people who attend these meetings will try to detect common issues that may be happening across teams. You might hear a pattern of problems like, we were able to fix the bug quickly, but it took too long to deploy. That's something that if you're involved in one incident, you may not realize is a widespread problem. If it keeps on coming up, then maybe we decide that's something we need to fix and we'll have people go fix that. For a future incident, we can fix and deploy it as fast as possible.

You could also track these incidents over time. You want to know are you getting better at preventing them or when they do happen, are we resolving them faster, or are they increasing? If they keep on happening at a high rate, then maybe we're moving a little too fast. Maybe we have some foundational issues and we should go pay down some technical debt for a bit and slow down product development. These are the kind of decisions that you can make at this higher vantage point than you could if you were involved in each of these just individually.

These are also very useful for knowledge sharing. I attend these meetings pretty often just to listen in, even when I'm not involved in an incident. I find a lot of the information that I hear useful and I go back to my team and pass this information along so that we can adopt some of these new learnings proactively rather than reactively. That's about all I wanted to talk about. I believe that these are pretty baked into each of these companies and are ideas that could be adopted at companies of various sizes. I have some references here, which I believe these slide decks will be shared later. You can click these links. There are some talks that my colleagues have done that talk about some of these same topics in a much more detailed and better way. You can check those out.

Questions and Answers

Participant 1: Could you tell us a little bit about how the development team is organized in structure at Stripe and how the daily life of the developer as Stripe looks like as well?

Mercurio: Our engineering team is broken into different groups that are, I would say, pretty similar to what a lot of companies would have. We have an infrastructure team we call Foundation. There's a team that built the tool that I was talking about earlier, it's called Observability, they built metrics tools and and stuff like that, a lot of internal developer-facing stuff. Then the product side of it is typically around the business requirements of the company. Stripe has a half a dozen or so different products, each of these will have their own engineering teams or engineering orgs. Then depending on how big it is, there might be some sort of underlying middleware kind of infrastructure. I'm part of the Payments infrastructure group, but I work specifically on card payments and there's a supporting team that's under us that supports both us and also non-card payments, some of the new payments that happen internationally, for example.

Day-to-day, we are generally working on projects that are scoped and a little bit longer term, quarter or half-year life cycle and then are broken up into smaller chunks, and then we just try to iterate on those quickly. We generally, at least for my team, store our priorities in what we call an OSR, an ongoing stack rank. You're just constantly re-prioritizing because there's never enough time to do everything you need to do. You just constantly reprioritize and move things around as you see necessary as you gain new information.

I work mostly in our Ruby codebase. We also have, as I mentioned, a Go codebase, there's a data codebase, which is mostly Scala. A lot of the people are building user-facing tools like the Stripe dashboard, if you're a merchant, that's a lot of Javascript. There's a big mix and it all depends on exactly what problem you're solving.

Participant 2: You've mentioned some culture values foundation like shared ownership and I'd like you to give us some examples on how you stimulate people to truly commit with this kind of attitude.

Mercurio: I think by example is the best way. When people see other people doing it and see that behavior rewarded, they will tend to do the same thing themselves. I think that the idea around doing Hackathon and Hackamonth and rotational programs like that helps make this a norm, something that is just done. I think that's kind of a way to initiate these because people are busy and may forget that they should be doing this on a regular basis. It's a little bit of a spark to keep it going. If you have Hackathons every few months, then you're kind of forcing people to go do something different and go work in someone else's codebase. Then at the end of the day, all you have to do is put up the code review and have the other team look at the code. There's not a lot of coordination that's required beforehand for many of the things that people would do this for.

Participant 3: In a blameless environment, how do you fire people?

Mercurio: Luckily, I'm not an engineering manager, I don't have to fire anyone. We do regular performance cycles and generally, we encourage ongoing feedback from peers and with engineers and their manager. I think that if there is a performance problem and someone needs to either work on some skills to improve or exit the company, then it's not a surprise people should be given things that they need to improve and know about them and then can act on them and only in the rare scenarios where someone is really not able to fix a certain problem with they have to be terminated. Luckily, I don't have to deal with that on a regular basis.

 

See more presentations with transcripts

 

Recorded at:

Oct 03, 2019

BT