An Insight Engineering Manager manages a group of engineers who are the Insight Engineering Group at Netflix. So Insight Engineering at Netflix is a 10-person group right now, software developers who build real-time operational insight systems at Netflix.
Yeah. We're pretty aware of the fact that when we release something and we make some sort of repo in Github public, chances are people are going to look at it and I think there's some degree of pressure to make sure that when we release something, it's good at least from a code perspective.
Internally, we've actually talked about open sourcing as a forcing function for doing code hygiene and for writing better code because frankly developers are sometimes highly motivated by trying to avoid shame.
I think the kindest way to describe it is highly organic. The people who make the decision are typically the people on the ground who need to figure out how to get something done. Just like every other decision at Netflix, generally speaking, we try to have the decision happen at the lowest possible level in the company.
So for example, three and a half years ago, when we decided to outsource notification for alerting, that was a decision that I, at the time just an individual contributor, made. I discussed it with my director who actually wasn’t entirely supportive of this but didn’t get in the way. So quite often it happens in the lowest possible way.
It's very tempting to engage in hindsight bias and everything that doesn’t work out you look at it as “well, we should have known better at the time”. It's also very tempting to feel better about yourself by saying “we've never made any mistakes, even when we've changed our mind later”.
I think it's more complex than that. We've certainly made decisions sometimes to, for example, use another company's product and found that, especially at our scale, we've had cases where the external product couldn’t keep up with us and at some point we stopped and we started building our own product.
I would say even in those cases, in my experience, we got some benefit from seeing a polished product even if it couldn’t keep up with us as a way to inspire our own internal efforts. Perhaps it's the overly optimistic part of me but I can't point to anything that feels like an obvious mistake, though certainly in many cases we've made a decision and then later turned out to be no longer the right decision for us.
How's that as a potentially political answer?
I don’t want to promote vendors so much but I'll make an exception here. Three and a half years ago when we were building our next generation alerting platform, we were clear that we wanted to keep telemetry in-house and maybe even alerting in-house. But when we looked at notification, we looked at PagerDuty as a notification as a service product. And when I talked to my manager about it, this was actually the decision of which we had some disagreement because he felt like, "Oh, we can probably just build it ourselves. How hard can it be?"
I argued that building a bad notification product is relatively easy and building a great one is not. And we ended up using PagerDuty. That was three and a half years ago and that relationship is still going strong. I'm nowhere near deciding that we should build our own product for that.
I think there's tremendous space for experimentation. It's not just something that's allowed but I would actually argue it's required. We don’t have a budget for this because I think budget essentially is some sort of arbitrary decision making from the top as to how much should be done.
Rather we try to hire very smart engineers and let them figure out, for any given decision they need to make, is this something that I can sort of follow the beaten path and implement or is this something I need to experiment in. If I need to experiment, how do I minimize the cost of that experimentation so I know as quickly as possible whether or not this thing is going to work for us.
I think it might depend on who you ask. I was having a lot of conversations with our director at the time who felt strongly that we should build our own product. I would say that the reasoning that I heard was a certain degree of distrust in commercial products out there and other groups within Netflix who might have provided this product. So from his perspective, we had a bunch of really smart developers. If we are willing to invest the developer time to build it, clearly we could come up with something that's better for us.
Having said that, I will tell you that I thought we should use an open source product and got permission to essentially spend about maybe a month, six weeks, to investigate options in that space. And I actually came from that process believing we should actually build our own because I didn’t see anything that was going to be a great fit for us.
So at the time I think we made a decision that felt like a relatively easy decision maybe. We thought it was going to be a relatively smaller project than it ended up being. We might have engaged in more discussions if we knew how long it would take us to actually deliver this, but I don’t think the outcome would have been different. I think in the end it was the right decision for us.
We'll have a lot more load than we have today, that's for sure. The first two years of building this platform ended up taking us... we moved a little slower than we expected partially because we kept running into scaling concerns. The reason we kept running into scaling concerns is because it took us a while to notice that we're increasing our volume of telemetry by about 100% every quarter. You can maybe even start predicting that and say, okay, so that means that essentially I'm going to increase by about what? 8x or is it? Actually no, it's I think 16x, 24. So 16x every year but that's just ridiculous.
It's ridiculous for two reasons. One, when you're increasing at that speed, you keep running into technical constraints and you spend most of your time not improving the product but solving scalability concerns which is certainly very attractive but not necessarily useful to our customers. And the other one is cost. Even at Netflix, which isn’t particularly necessarily cost sensitive, when you start running the biggest cloud ecosystem within Netflix and costing Netflix hundreds of thousands of dollars a week, people pay attention.
So at some point, we actually sat down with the very prolific metrics producers within our environment and the good news is there are only about three different teams at Netflix who produce the vast majority of metrics. So the good news is if you're not one of those three teams, you can do anything you want. You don’t even have to be thoughtful about metrics because if you quadruple your metrics count, we're not even going to notice it. But those three teams need to be a lot more thoughtful.
And so we've had much more conversations with them. They are much more thoughtful about this. And we actually have a target for metrics growth of not increasing our metrics count from these systems faster than the business is growing. They've done a pretty decent job keeping to that target.
I think most of our stuff was built before 2012 when we started actually open sourcing our products. So obviously, anything that happened before 2012 we built originally without any thoughtfulness about OSS.
These days, generally speaking, most teams who start working on a new product probably have some idea about whether or not they're going to open source their product. And generally speaking, frankly, the answer is yes. We're going to open source it because there's no reason not to potentially, other than sort of the increased maintenance load. But given the fact that we don’t need permission to do this, given the fact that generally speaking engineers seem to like supporting open source, generally speaking it's going to happen.
Traditionally, I would say we've waited until it was really good and it had been working in production at Netflix for a while. That's certainly what we did with Atlas.
In some cases we haven’t, so some of the more recent development we've done, we've actually started off, from day one, with a completely non-functional product being open sourced. In some cases, we did it much earlier than when it's actually ready.
So for example, with RxNetty, which is basically the reactive framework that some people at Netflix use, our developers would probably agree that RxNetty is incredibly powerful but not necessarily ready for wide scale adoption at Netflix. The vast majority of engineering teams have not adopted it. But it's open source. My team is one of the maybe two or three different teams at Netflix who use RxNetty in production. It's been a pretty good product for us.
I suspect most people at Netflix would argue that if you open source something you should not be thinking of this as some sort of read-only open source project. If we're open sourcing something, we're hoping that other people are going to use it and contribute to it. I think we're very happy when that happens.
As for the review process, this is where it gets kind of interesting. I think we definitely need to review code. In the end, it's our repository. We're the ones who own it and that means that we need to be responsible for every line of code that runs in this environment.
We can do a pretty decent job, I think, making sure, for example, that we don’t incorporate malicious code into our project. Where it gets interesting is if people want to incorporate features and capabilities into this product that we don’t have the ability to test, how do we do that?
If people wanted to figure out how to make Atlas run on the Google Compute Engine environment and proposed some code enhancements to make it happen, I think we'd be really interested in figuring out how we validate that those things actually work because we'd be interested in accepting them. But we don’t necessarily have our own way to test that. We haven’t yet figured out exactly how to work that balance.
We use a tremendous spectrum of open source products. I think frankly you'd be somewhat foolish if you're running a Linux environment and you're not using a whole lot of open source products. I think maybe the question is why not? We use Cassandra, we use Apache, we use Tomcat, we use Linux. That's sort of just the first few that came to the top of my head.
As for the review process, there's nothing particularly formal. I think when we have questions about a product, typically if it's not necessarily a very famous or very well-adopted product, I'm lucky enough to work with some really fantastic security engineers. Cloud security at Netflix are some of the very few security professionals out there I've actually enjoyed working with and these guys are fantastic resources and fantastic consultants if we have questions around that.
I think from a microservices perspective I suspect I may be a little too much of a hipster here in saying that we were doing microservices before they were cool. We didn’t call them microservices. We just basically looked at the natural reaction to running a monolithic stack and saying "well, God, we should do this differently in the cloud." As a result, we've got about a thousand different services in the cloud and there's no particular desire to sort of try to change that particular trend.
Sometimes we find that people sort of notice “oh, hey, we built this thing that was originally a microservice, but now it's doing three distinct things. Well, maybe we should separate them.” They re-architect their platform to create three smaller microservices so to speak.
As for DevOps, I got to tell you, we never used that phrase internally. When we just moved to the model from IT supporting production to engineers both writing code and deploying it and running it in production and waking up at two o'clock in the morning, it wasn’t in response to the DevOps movement. It was a fundamental way we thought to align the responsibilities and incentives of the release process with what developers should be thinking about.
I have a shameful admission which is that I never actually realized that DevSec was a term that anybody had used until I actually used it. The point that I was trying to make is we talk a lot about DevOps and when we talk about DevOps, largely what we're talking about is an alignment between operations people and developers. At Netflix we don’t really have a DevOps movement because we don’t really have operations people. So we've basically localized both the development and the operations in the same people.
But what we do have are developers and security people, and I think that DevOps is a good start but really what you've got to look at is alignment between different teams. If you've solved the DevOps problem, congratulations. Figure out where else do you have alignment issues and work those out. I mentioned that I love working with the security engineers at Netflix and one of the reasons is because I would argue that what they're considering success and what I consider success are the same thing. We work together for a common goal. In many other organizations, security people largely think of themselves as defending the organization from the bad decisions of developers.
At Netflix, security people help me get my job done better. If you solve DevOps and if you solve DevSec, then the next question is what's the next group you want to look at? One of the reasons I love working at Netflix is because of, not just the engineers or the security people, but the facilities people and the purchasing people and the HR people. If you solve those two things, look at Dev-HR and Dev-Purchasing and Dev-Facilities. It's all about alignment. That was the point I was trying to make.
Manuel: Thank you very much, Roy.
Thanks.