1. We are here at QCon London 2015. I am sitting here with Caitie McCaffrey. Caitie, who are you?
Like you said, I am Caitie McCaffrey. I am a distributed systems engineer. I currently work at Twitter on their infrastructure and platform team, but prior to that I spent six years working in the entertainment industry, building services that powered a lot of the entertainment experiences that people probably interacted with. So, Xbox games including Gears of War 2, Gears of War 3, I spent a long time working on the Halo services, shipped Halo 4 in November of 2012 and then I also worked at HBO as well.
So, especially for the Xbox ecosystem – generally people buy a game disk and then they put it into their console and then start playing. In the original games environment, all you had was the code that was running on disk. Now, as we have services that start to power these entertainment experiences, we end up with a set of services that are provided by Xbox Live which has its own set of services so that provides things like your identity, that's your gamer tag, it also authenticates you as a user, achievements - so you can get your gamer score on, match making services, things like that.
Other games, like Gears of War and Halo, started develop their own set of services to accentuate the experience or help provide a more long-term experience and keep the game going and have a variety of things for users to do. So, for instance, in Halo, we built a set of services that did a ton of things, like we had our own custom statistics service that did really detailed analytics per player.
So, we knew all of your stats like kills, deaths, head shots, maps, accuracy per weapon, who you killed the most, who you got killed by the most so that you could go and get revenge on your friends maybe and take them down, daily challenges that were published via the service down to your console daily to say “Go do x thing and we'll give you some more XP” - all kinds of stuff like that. We had custom – you could make maps and upload them to the server and then all your friends could play them, we ran tournaments online – so this is all stuff that we had a set of services talking to your console and that gave you a variety of experiences that changed over the course of the game
Yes. So one of the big challenges, especially around triple A games which is equivalent to a blockbuster movie from that space, is that you have these worldwide release dates and you have a ton of marketing money put into it and so people all go and buy your game on the same day or within the same week and especially when you are running services, you go from 0 to everyone, you are maximum load within the first week and that is kind of really hard to scale for because in the case of most services you sort of build something and then you get some users and you are trying to build your user base and that is actually the hard part of the business, right? So, you get to scale and see as you ramp up over time. But in games, you typically go from 0 to tons of users and then it sort of trails off. So, being able to deal with that is actually fairly challenging.
: I think it is a little bit of both, but we definitely did a ton of stress and perf testing, especially for Halo. We rebuilt the services for Halo 4 from the prior games who had their own set of services that were running on a SQL box and were physical machines and it had been built over 10 years and were no longer meeting the needs of what we wanted to do for the future. So, we rebuilt a whole new set using Azure and a new technology called Orleans and since this was all new, we definitely wanted to test it out in the wild and there was nothing like production data. So, what we actually did was to rebuild one of the Halo Reach, a game which came out before Halo 4, services, using that tech and re-launched it into the wild a year before Halo 4’s launch date. So this was the present service which got heart beat data.
So, it told us what you were doing ever so often – one second to 30 seconds, depending if the second screen device was connected. This actually gave us a lot of really good data because a) it had to be up, although the penalty for being down was that we just fell back to the old service. It gave us real production traffic and real load and real scale. So, during the course of that, we actually found a memory leak in the new tech we were using and so we fixed that before we went to Halo 4 which was an even larger load and more critical for some services where we had nothing to fall back on because we rebuilt the whole thing. In addition, we also did a stress and perf testing on all of the services that we shipped for Halo 4 before we went live. We also stressed and perf tested some of our Azure components because at that time Azure was fairly new when we started working with it. So, for instance, when we were using Service Bus in the course of stress and perf testing our system, we also found a bug in Azure Service Bus, where they were not throttling traffic correctly and we actually took down our instances and some of the neighboring instances. So, we made everyone in the ecosystem better. That was great.
Yes. We spin up most of our instances in Azure. When we worked with the Orleans team, which is a group from Microsoft research. They had their own test cluster that was physical machines that they ran on – so those were super beefy machines so they get really good numbers specifically for what they were doing. So, we used some of that as well. But mostly we just used Azure and then we would either spin up and generate some load and slam it against your services or you could do some of it in isolation on your machine if you were micro-benchmarking something. Stuff like that.
Werner:
So it is not crowd sourced. You do not hire students all over the world and say “click here”.
No, we did not do that for services. When you do games development, you will have testers, you will have user research people who will come in and play the game – but that is more feedback on design or you will have testers paid to play the game, but that is also more feedback for the console and less about how did the services perform other than “Hey, my stats got here and I can see them”.
Werner: You mentioned a memory leak. Can you tell us more? I like memory leaks.
Sure. I am trying to remember the specific details because it was a while ago, but basically – Orleans is an actor framework built by Microsoft research and I do not remember exactly what was leaking memory in it, but what we found was that when we deployed it into production, it had about an eight-hour life span before we had to cycle the boxes, because we ran out of memory. So, we just couldn’t get the throughput we need. It was actually a really super fun holiday around 2011 where I was on call and we were cycling the boxes every 8 hours and so you would just – I remember I did duty on New Years’ Eve so it was not a super fun New Years’ Eve. But, we were just cycling it. What was really cool was that the Microsoft team worked incredibly closely with us. So, when we were cycling these machines we were also taking memory dumps of the old cluster so that we could go and dig through and find it. The Orleans guys were actually the ones responsible for finding and fixing that. It is their system – they knew it best. But the interaction that they helped do that as well, it was just amazing.
Sure. The actor framework is a framework for reasoning and doing concurrent computation. The basis of it is actors, which are these core units of concurrent computation. So you can think of these like little worker bees and they have their own state and the only way that they can interact with one other is by passing asynchronous messages to each other. So, this way, you sort of limit the opportunity that compute has to do bad concurrent things to shared state, because the only shared state you have are these messages and ideally, they are isolated messages so if you modify the message you get, it does not modify the guy who sent it and the same thing vice versa. So, this idea came from a paper in the 70s that actually came out of the artificial intelligence community, but it has been repurposed in distributed systems.
So, if you haven’t heard of Orleans, people may have heard of things like Akka as an actor framework, Erlang programming language is very heavily based on the actor framework. Things like that. So, Orleans is an actor framework and runtime and programming model and so what they did is they went and they took this idea of actors and they call them grains sometimes. You spin up these actors and then you make strongly typed actor references because it is C#, so it is object oriented programming and you just write functions on them like you would and define the messages that they can take and then, as a programmer, it hides the idea of where those actors live. This is why Orleans is different from other actor frameworks – you have this concept of virtual actors. So, Orleans will go and the runtime will determine where these actors run and manage the lifecycles of these actors for you in your cluster and you do not have to write any of that code as a developer, as you would in Akka or Erlang. So that is a really nice abstraction and it also allows the runtime to do interesting things like failure and fault recovery for you.
So, if a silo - or a machine in your cluster which is sometimes called a silo – goes down, it will just reallocate the actors that were on it somewhere else, it will deal with moving actors off of hot machines if that happens for you and just bringing them up somewhere else and as a programmer I don’t have to do anything about this. When you are on call, like in Dev Ops, you also do not get woken up, which is great.
Werner: That is always good.
Yes.
7. One actor is a thread or how does it map to the execution model?
Orleans runtime actually manages a pool of threads for you and generally there is one per core of the box, because this is just the most performant way to do it – they found. And you will get to execute your code that is defined via your classes in chunks called a turn. All the code that runs inside of a single actor, which is a unique ID and an object type, is executed single-threadedly – so you get this nice single-threaded concurrent programming model, without having to deal with concurrency issues inside of your actor and then, once you go and do an async call , then it gives up its turn and then the Orleans scheduler will go in and schedule some more work there. So, as long as you aren’t blocking threads which the compiler, when you import Orleans, prevents you from doing because it only allows you to use asynchronous methods on actors. So, as long as you are not doing something blocking in your code, you aren’t going to block and it’ll just schedule everything and you get really good throughput throughout your system.
The trick there is that as long as you are just talking to actors it is impossible to block. You could go off box to talk to some other service, do something asynchronous or you could talk to a service using an asynchronous interface or talk to storage also once again using an asynchronous interface. As long as you are doing asynchronous interfaces and yielding on the thread, then the Orleans scheduler is super happy. One thing that is incredibly cool about Orleans unlike other single-threaded models is because there is a thread pool and it is scheduling for you and it is not just one single thread is that you can do expensive compute computations and you also don't run into the problem of starvation or having long back-ups for the rest of the actors in the system because there are other threads that do work.
I don’t think that is necessarily true. I think the CLR and .NET have done an amazing job, especially in C#, of enabling asynchronous programming. The async/await patterns that are .NET 4.5 are really, really beautiful and are some of my favorite ways to do async programming. I have also programmed in things like Node and Java asynchronously and Scala asynchronously and in JavaScript. You can totally do that and those all work fine. They are great. The idea behind the asynchronous versus the synchronous model is, the asynchronous is not necessarily faster, but it is actually an implementation detail. The goal there is that you get more throughput if you are doing things correctly.
So, for instance, we wrote our own custom front-end dispatcher in sort of like a front-end which is really lightweight in Halo because IIS is a little too slow for what we wanted to do and too heavy. So, when we were pulling things off of the socket, that was actually a synchronous API call because we found and we micro-benchmarked it that if you actually used the async library, it was just a little slow. I think it was slower than doing the synchronous call because all the context switching and stuff like that and so we did not do anything async until that thing was allowed to do its own thread and just like turn on that. I think it has actually gotten better in the .NET library – this is a couple of years ago, but just an example to highlight: async does not necessarily mean faster. It generally means faster if the implementation is done really well and what it also gives you is throughput, right?
Werner: I guess it gives you more control. You do not get stuck on a synchronous blocking call.
Yes. In asynchronous programming, especially when you use things like Promise or Turns or Futures or whatever your vernacular of choice is, allows you to do concurrent computation without having to take locks and you can sort of chain things and do stuff without having to do any crazy or expensive locks or monitors or things like that.
Werner: Yes, we all hate locks.
I mean, they are useful when you have to have them, but they should be avoided at all costs, right?
Yes, I like to call myself a distributed systems engineer because I like things like – if you are running hundreds of cores and you are dealing with large chunks of data, that is where my interests all lie. Recently, I have been reading this book that is called “The Theory and Practice of Replication” which is a summary of all of the replication theory from the 70’s all the way to 2007 – so it is missing things like CRDTs which stands for – oh, I always forget what that acronym stands for – but essentially, it gives some trade-offs on consistency by modeling a lot of different operations in terms of the sets or things that can then be combined together, so you can have highly available data stores and keep writing through partitions in your system.
This is really cool because I think a lot of the early research around there was like “How do you get consensus? How do you do linearizable reads and writes?” Those are really good things to have in a lot of circumstances where you just have to be correct, but I think, as we go forward in the industry, what we are sort of noticing, especially in the business world, is that you want to be available and then you might be able to go and fix up some of your consistency problems later or we should skew toward being available because having outages and downtimes is no longer a thing that is tolerated. For instance, in Halo, we could not tolerate any downtime.
So, we made a lot of tradeoffs for availability. So, one of the things I am really interested in exploring right now is “How do we build really available systems that do not give up a ton on consistency, but do not go all the way to like programming against consensus protocols like Raft or ZooKeeper or things like that?”. Because that gets expensive and then you have no control. If a ZooKeeper node goes down then you are just out of luck and your service is going to be down until it comes back up. So, things like, for instance, Orleans uses a distributed hash table to determine node placement and a lot of distributed hash tables are either deterministically placed, which has issues in itself that it causes hot spots that is more of a more – it is using consistency by deterministically placing things.
But Orleans does not and it also does not use consensus to say “Hey, something is always here”. What ends up happening is that you might have multiple actor activations of the same thing running at a time in failure modes, but then you can always merge and reconvene by writing code to do that. That is a trade-off so that you can always, always, always talk to an actor and I am starting to explore how we get there more, how do we get those consensus protocols out of the way, how do we write to storage in ways using immutability and idempotent operations that we can build incredibly available services that are also still correct.
I think right now we are in a world where there aren’t a ton of great tools and frameworks to help us – people who just want to go write an application and just write an application that will scale, a lot of the current databases out there have marketing departments because they have a product to sell so they are also not necessarily being super honest a lot of the time. There are some that are doing a better job than others. So, I am hoping, and I made this analogy yesterday in my talk at QCon – I think we are in this world where the distributed systems programming is at this era that computer programming as a whole was in the 60’s. We are all writing assembly and that was hard and only sort of worked for like my one, like bespoke machine that ran code there and we are all writing these bespoke services that work for my one instance and scale, but what would be great is having frameworks and tools that are easy to use, focus on developer productivity, but also highly focus on performance and scalability.
One of the reasons I really love Orleans, even though I am no longer at Halo I still speak about it and still tell people about it, is because it had those two goals in mind and it makes building services that scale really, really easy because the developer programming model and framework is really nice and then it also just scales linearly, at least we did not hit a cap at Halo scaling linearly – but obviously there is one because of physics and it is not magic, but you can scale linearly for a very, very long time, doing some really intense compute and across like hundreds of cores. So, it is really cool and I think we need more of those and so I am hoping that when the team at IBM with Fran Allen and others went and wrote the compiler for Fortran, people sort of pushed back and went like “Oh no! I do not want to give up writing assembly because I can do that really performantly”.
I am hoping that we start seeing more tools like that, that are analogous to compilers for machines where we have tools and frameworks that make looking like programming in a distributed system is not super hard and it is easier and it makes sense for people who just want to go write really great applications and focus on that, so they can iterate on that and there is maybe some of us still solving these crazy problems behind the scenes. Obviously people will always want to write bare metal because there will be people pushing the scaling limit boundaries and that is going to happen, but hopefully we can sort of start generalizing some of these problems and help other people to solve them, instead of just “Oh, I just wrote this!” and you have to understand all this crazy distributed system literature and read all of these papers to even have a prayer of making this work in production.
Werner: It is a bit like garbage collection.
Yes. People understand garbage collection in theory, but how many people could write a great garbage collector right now. I would probably write a crappy one right now, to be honest.
Werner: Even basic understanding of garbage collection is somewhat rare sometimes.
Yes. It is really interesting, too, because. For instance, with that one, I was on a boundary where they started switching CS curriculums in colleges and high schools in the US at least to be no longer teaching C++ or C as the first language, but to be Java. So, it was easier to get people to program by learning Java because you do not have to teach them how to deal with pointers and all its crazy memory allocation, you just get to sort of hand wave and say “Look, you can write a for loop and some classes” and things kind of work. If you want to do high performance code, obviously you need to go and understand that, but for the most part, people can just start writing code and the barrier to entry is much, much lower. So, I am hoping we get there with distributed systems as well.
12. To wrap up, where can people follow you, follow your travels into the research space?
I am on Twitter, I am @caitie on Twitter. I have a blog at http://caitiem.com/ . So, check that out and I think those are pretty much my contacts.
Werner: I think before you mentioned that I need to mention one of your projects you want to write about.
Oh, yes. Earlier this year I started a project that I call Tech WCW . So #wcw is the hash tag for “Woman Crush Wednesday” that is on the internet and so I have sort of appropriated it and what I tried to do was once a month write about a prominent female in computer science history. Mostly I started this because I thought it was cool and it was interesting and inspiring for me and then I wrote a blog post and tried to find cool, old timey pictures and I will tweet and post them on my blog and so there is a collection of like three or four up there on that blog. People seem to enjoy them and if you like that kind of stuff, go check it out. Or if you have women that you think I would be interested in, send them my way because sometimes it is actually hard to find these, even though they did really great stuff. It is like a personal pet project of mine, I really like it and I hope other people like and enjoy it too.
Werner: Definitely. Audience, your job is to give Caitie a lot of ideas, refer lots of interesting people to her. Thank you, Caitie.
Cool. Thanks.