1. We are here at QCon New York 2014 and I am sitting here with Yoni Goldberg. Yoni, who are you?
Good question and thank you for inviting me. So far, QCon has been great. Who am I? Good question. I am first of all an Israeli, I served five years in the Army then came to the US, went to school here, went to MIT, did my undergrad grad and masters in computer science. Currently, I have been working at Gilt as their lead engineer for the last four years. Before that, I spent some time in Google, IBM. I love coding. I have been coding all my life since I can remember myself. I spent some time on product management and then I started: “It’s not for me. I want to go back to coding”.
Werner: You mentioned Gilt. Maybe a quick introduction for people who do not what Gilt is.
Gilt is a very interesting e-commerce Company. It started in 2007. What do we do? We have flash sales. Flash sales is basically sales that start at a specific time and they last 36 hours. Their promise is that they are always discounted, the inventory is limited and therefore it creates interesting problems for us. The most famous one is the spike we have every day at noon. So, every day at noon we have tens of thousands of people just running to the store, similar to black Friday and we need to handle that and have a good experience. But what makes us also very unique is that we really curate the items, we really try to give amazing customer experience all over, from buying to the moment they get the package in their home. So, that is what it is all about.
We went through all the cycles. We started as a Ruby shop in 2007, Ruby was pretty hot, we used to be on the Cloud but at the time it did not work that well. It was also kind of expensive and also some of our instances were sharing processors with other instances and at that time the separation was not as good as it is today. We had performance issues there. Then we moved to our own data warehouse that solved that aspect. Also, we had issues scaling. It was one monolithic app that worked pretty well with one database, but even then, until we really got a couple of sales it really kicked us and we had to find a better solution than that.
We just could not handle the load. So, what we started to do is to basically break it out into smaller services. Still, it was not the architecture that we use today which is microservices, it was more like macroservices, a couple of core services that run aside some dedicated databases and some of them were Voldemort, which is a key value store, just to get caching, be faster on some core part of the site. And we started to use the JVM because it was better with concurrency, it required fewer processes, fewer resources just to keep it up and running and this led to where we are today when we have more than 350 services, most of them are written in Scala and a lot of Mongo small databases to keep everything running.
The first service that Gilt broke out was the product service, just to get outside all the product, we had the inventory service, the cart, the page generation for the main pages. Basically everything related to the core experience, so as long as the three pages where you browse the site and the checkout worked fine, then it was good. These were the core things. Most of this stuff, I can go deeper on the details, but they are the core and they are still there.
Werner: These are the things that had to be up, whereas the rest could go down, but it would not be a massive issue.
Yes. This is the thing that the customer felt they are paying for. We still have a lot of legacy code that is really behind the scenes. It would be great to move in into Scala, it would be better, but that was never a priority, like there was something more important to do. But I am sure that eventually, we will get there.
4. The pain wasn’t big enough for those services to move forward for now?
Yeah. That's the way it is. Because when you get to 350 services, a lot of them are just running fine. We monitor all of them, there is more or less ownership for most of the services, and the thing is you basically do not look at it unless there is a problem, because you keep moving on, unless it is a service that required continuous development, like older services. Most of the stuff that I work in, it got into a stable state, we are pretty happy with them, then we move onto a new project, but we keep monitoring them. The only cases when you will go back to an old service is when something is wrong or when you really need to build new functionality.
I think that the best way to think about it is really to think about it abstractly, like let’s take a topic that you can actually think that all of this can be scoped in one service. One example for this is the account service. Another one is brand preference, like if you like a brand or you have any preference, let’s put it there. Another aspect that we use it is for loyalty, it is one service that manages all the points and all the stuff that the user would use. The same goes with products, orders, it is like you can have a concept that you can abstract there. Some people say that they can count lines of code, I don’t really use that as a parameter because if you have something that can be really separated, I think it is fine. Adrian from Netflix says that he likes to use a verb. So if you can define the microservice in a verb then, then that should be one.
Yes, and there is a good reason for that. Actually, two good reasons. One: if you have one database you really get pretty fast to overloading the database. Everyone calls there and it gets complicated and hairy. Then means that you are starting to monitor and see which services, which users are making the calls that slow you down. Also, if you think about the data aspect, it is also about ownership. So, every time you do a schema change, if it is a shared database, you need to make sure that it is approved and there is someone that maintains it and basically you create a lot of process on top on it.
On the other hand, if you can create a small database which we actually do with Postgres and also with Mongo, which is pretty easy to just set up a small instance. At that point, the team that owns a service also owns the database. So, at the end of the day, the service is just a representation of the model that is stored in the database. Then, you basically move the ownership to the small team that owns everything and so whenever they want to do a schema change or they want to do migration, they own it. They own the full stack.
So, first of all, every service is responsible for some part. So, it is like, for example, the shipping address of a user. It should be stored in one service. So, if you change anything else on the user that is not related to the shipping address, the service does not care about it. But, on the other hand, actually loyalty is a good example: Let’s say we want to give you points, whenever an order is made. So, at the end of the day, the only service which cares that the order was made that needs to do something right now is the order service. So the transactions go there, it puts it in a queue, and it processes it whenever it is needed, but this service is responsible to make sure that all that is going to happen eventually.
But let’s say I want to give you points when the order is made. It does not need to care and be in a transaction and happen immediately. Actually, it would be nice to do it. So, we tried to achieve it and make it immediately, but we just fire an event, it is going to a queue, Kafka or Rabbit and then whenever the service can catch up on that, it is just going to process it and give you the points. Usually, it is happening at the same time, almost in real time, but at the end of the day, so you can create these queues and every service needs to act because another service, which owns the actual data, wants to do some other action on that and the queuing system is a pretty good solution to that.
Werner: OK. So, you solved it by queuing essentially.
For this use case, yes. Distributed transaction is, I would say, it's an anti-pattern, and it is very hard to code in if there are like writes in transactions that need to happen in different places, in different databases, it makes it very difficult to make sure everything works really well. The only way, I guess, around it, if you really need to do it, like if something fails, you want to make it easy also on your code, so you just start all the work for the processes again, is to make it idempotent. In the sense that you try it once, say you need to make three calls, all of them need to happen, these two have succeeded, one failed, the easiest way, instead of having it all if you run, if this failed and that failed, they all need to happen and let’s say that they do not need to happen in some order, just try all of them, but the next time you will do it, because it is idempotent you are just going to get the same response. So, it is not going to create a new thing, it will just say “It has already happened. Here is the response for you” So, basically build a client and and the service in the sense that even if the service that depends on that makes a mistake and does it twice or do it anywhere else, you will know at the end of this all and basically the system will stay the same, no matter how many times you execute a call.
Werner: OK. So that is something that has to go into the logic of the service essentially.
Yes. I think microservices really empower the team because, at the end of the day, each team owns their own code so it is really about making the decisions and making sure that whoever uses it, what API, it will handle it the right way and fail safely. Things will fail, things will break, but the service is responsible the actual data model is accurate as much as we can.
We have more because every service deploys to at least three boxes, most of them can be 10-15, some are elastic to grab when we need them.
You mean cost in like network-wise or just hardware-wise?
Werner: Simply if one service calls another it has to serialize things and send data across.
It’s hairy. Yes. I think this is our biggest problem. The biggest problem of microservices dependencies. You got it right. Our network load can be very hairy, especially, we use REST and JSON as a protocol and yes, it is a problem. You can have different – we used to have different services on one box, like a lot even, and one service can start impacting the other service because they are creating a lot of load and then it impacts the other ones. But the thing is, now we are moving more to using Docker and go to EC2 so every time we deploy we get an instance for only this service and for this instance of the service and this should solve it.
Also, dependencies, like between services, that is what is causing most of the I/O that microservices have. One microservice can depend on like ten or dozens of other services. So, this is where actually the reactive manifesto and reactive architecture really works well for us. We use futures, we use the Akka library for that and basically one call can make one request to one web app can make up to like 15 asynchronous calls to other microservices. But because we make them all in parallel, at the end of the day, the response time is the max of all of these guys, so it is working pretty good for us.
10. You are using JSON for communication. Have you looked into protocol buffers and things like that?
We are looking a bit into Avro as we speak. I've personally used protocol buffers before, so I like it. But there are benefits for JSON, which is very developer friendly in that sense. Performance wise, it is nothing compared to binary and yes, the good thing is that the libraries that serialize JSON on Scala are getting much better, whereas the Play or the Spray ones that are still work in progress, most of it on Jackson, and yes, there is huge overhead on that. It creates a lot of traffic and CPU [overhead] but so far this is not our biggest [problem]. So, it’s been OK.
Werner: I guess that is the price you have to pay for all the benefits of microservices.
Yes. The price is really complex, right? It is like you cannot run it anymore and on your laptop.. You have to always tunnel, always need to have integration environments or even tunnel to production, if necessary. It is like the big promise of microservices, as far as I can see, is independence and ownership which basically can give big companies, basically enterprise companies, still the sense of some startups inside of them because each team is really independent, they control when they deploy things, they control how they want to develop it, each team has a lot of freedom. That is what makes it more fun.
Werner: I guess also this separation aspect where, for one thing, everything runs in its own process, so you have the stability aspect, ie. if things fail they don't affect others. Also, I guess, on a team level, one team can screw up and they can replace it. Is that something...
Exactly. That is the beauty. If one service has issues during the day and it does not belong to my team, the other team can handle it unless we need to help them, but usually you are getting out of it. When you have a monolithic app and something is wrong, everyone has to stop and see “Maybe it is my problem, maybe it is not”. The separation of problems is really really good and it is also, because the service is small. When you have a problem it is much easier to find what is the code and fix it. So, even if you are not familiar with the service, if it is a monolithic app, it can take you like days to figure out what is going on. When it is a new service, even if I had never seen it before, there is a chance that within a couple of hours, first time I am seeing it, I can actually debug it and understand what is going on and fix the problem. We tried to keep more or less the same patterns so, even if you go between different services, they all have like the same structure, of like server, clients to make things very easy for us.
Werner: You already mentioned that you have different languages at the company, so at least Ruby, Scala and I guess Java.
Some Java from before. We are experimenting right now with Node. This is basically – we have a lot of good JavaScript developers and a lot of them started with Scala but some of them do not want Scala and by empowering them and giving them node.js as another opportunity to them. I am a big fan of that, especially – the event loop it is built on. And Play is doing a very similar job. That is why I am also a big fan of Play.
Werner: OK. So, what is your policy on adopting new languages. I mean, you say you like node.js and so on.
That is a policy for which all the credit goes to our CTO, to Mike Bryzek completely. He is bullish about “You want to do something cool? Experiment with it, try it out!”. If it works well, you need to have reasoning for why you want to try it and what value you are going to get out of it. If it works fine, that is great, but you need to remember that you need to maintain it and if other people are starting to use it in the company, you need to start supporting it and give guidance for them. Personally, I think, especially with languages, there should be a very good reason why you want to add another language. I really think that trying languages, one static and another dynamic, just for fun, is great, but having so many languages is tough and also getting people to know how to coordinate them at a really good level. So, I feel like adding technologies like, if it is database, things like that. If you have a good use case, then why not? Why not move with that?
Can you repeat that?
Werner: How do you prevent weird emergent phenomena to happen? Weird things, going wild. How do you handle microservices? How do you handle the number of micro-services?
If you want to create a new microservice, it is team dependent. So, we do not try – it is very decentralized, like whatever the teams want to do. How do we monitor it? We need to make sure that every microservice has an owner and at least all the services that I have been involved with is the only time I stopped looking at them is when I made sure there is a transition to a different team. So, as long as you have a point of contact, like you know these services do not have any issues and it is a different issue like monitoring, like how do we monitor them. But the thing that contributes again is ownership. There is ownership from the beginning, but you need to make sure it always exists.
12. So the monitoring aspect, how do you handle that?
It is really about tools. I think there are some good tools out there. Like New Relic just keep improving, it is really a great tool with which you just monitor the different microservices, they are doing a great job on that. We use OpenTSDB to just graph and if you want to get different metrics inside the app, we use Boundary to really understand what is going on within our network, where are the bottlenecks. But it is still tough. You still do not have full visibility to everything. I think the biggest thing that New Relic did for us which is showing percentiles, so you can really dig and make sure that your service performs always really well. That has been great. It is connected of course to pager duty so every time there is a problem, the team that owns a service gets an alert and try to fix it.
There is something that it is a work in progress right now, like there is a lot of focus on that and we started by just basically having buried APIs that we always provide a client on top of them, so the service did not really care about it. And now we are much more bullish on REST APIs. If you are familiar with Swagger, like we still evaluating whether we'll go with Swagger or there is something internal that we're building. Basically you define a schema using JSON schema and this defines the models and also the API points that should be available. Then the idea is that you can generate the clients on top of that.
So, you create generators in different languages, so if we need a Ruby Gem or if we need a Java library, or Scala or basically we can use the Play library or Spray and just create the clients which have zero dependencies. Depends on what you want to have and just use that. I think that at the end of the day, it is all about the versioning so every time, whether you release a new version for you server, if it is including a client inside, or whether you update a JSON schema, you need to make sure that you version the client with it. Then basically, every time you have a breaking change, you need to make sure that there are no clients out there, that are still expected to have the old server so you need to really use OpenGrok which is a great open-source tool to like search between all the git repos that we have, to just find where do you need to upgrade the client before you can change the server side.
So, again, it is a lot of responsibility for the developers but you can trust it 100%. So, if the developer is smart enough and you have the option to add functional tests, so every time we do a deployment to production, before that we deploy to the integration environment and then we can run functional tests on top of it. So, the idea is that if you deploy service x and there is service y that depends on it, every time we deploy service x, service y can make functional calls to service x to make sure that the API is a aware of it all still exists. It is working pretty well but it has required a lot of developer responsibility and motivation to this kind of test. Because they are tough.
It was a pain point for Gilt when they moved from 2.9 to 2.10 which was kind of painful and we had for a while to support, every time we released something, we had to support 2.9 and 2.10. I think that Typesafe and the Scala community in general understood it and they are trying to reduce the breaking changes which I think is going to be the goal in the future. I think 2.11 is supposed to be much easier, the integration, but I guess this is a bigger question: do you want to be on the edge and basically get your team inspired, like use the best practices available right now and take the risk of spending more time later on when things break or do you want to go stable and build stuff with stuff Java which, we believe – right now with Java 8 things are going to be better, but Scala 2.10 and Java 7, I think that if you look at the Collections options you have there, the immutability which is really built inside of it, really gives you nice concurrency.
Functional programming, in general, is great. I think, for me, who used to be a Java developer who moved to Scala, I cannot see myself going back to Java anymore. So, I think it was worth it. It was worth it, but it is expensive. It required a lot of time and it can be annoying at times, but I think that in the long run, I think it has proven itself. I think that the biggest thing is to keep doing the upgrades as soon as you get, because on our legacy code on Ruby, we are still on a very old version of Ruby, and I guess if we had kept updating as soon as it got out, it would have been much less painful. But now we have tons of legacy code and it is like somewhat impossible. So, basically, we are probably going to rewrite all the parts that are in Scala, eventually. But it is too tough.
I think that the way it handles monads, in general. The Collections, like using map over and over and the immutability – making sure that everything is immutable. I cannot see code that is mutable. It kills me, it is really painful right now. And I like the way that traits are built. I am not that familiar with Java 8 so I can’t say how much of the stuff that I am mentioning is already available there. But, just having to depend on multiple traits – I really like the way the futures, Akka, is integrated inside of Scala right now and I would say the way the Collections and in general the way the monads are integrated in Scala is the best in them.
Me, personally, I use futures a lot. Actors – I have a lot of peers that are huge fans of that. I still haven’t got into the classic use case because that would be great for that, but I can see scenarios – I personally did not need it yet, but I can totally see scenarios of it and it would be useful, especially when you can break it down into very small tasks and you want to keep going. We use it for our notifications systems in Gilt, when you need to fire millions of notifications at the same time. It seems like a good case for that. Every place that needs to process a lot of data, and I want to do it in parallel, I would definitely want to go with actors. Futures, for me, are really great, when I want to make a bunch of asynchronous calls and I do not care when they will get done, I just want them to get done as fast as possible, it has been really really good for me. Also, if there is any task that I do not care about at all – for example, if I want to send some analytics data, I want the guarantee that it is going to happen but I do not want to slow down the user, I put it in a future and use that and just take care of that.
Werner: I guess, in microservices, asynchronous programming is very, very important, but also very painful and so if Scala provides monadic future support and things like that, that's useful.
Yes. I feel like Scala Play, specifically, is really built for that, like the fact that you can have asynchronous calls that gather requests from the web requests and basically just few threads to basically to handle like thousands of requests, but at the same time handle all the other futures and once everything has come back and is ready to respond then release the thread. I think that this is really smart and it is built for that.
Akka of course, but another Scala library that we use a lot at Gilt is Spray, which is a very nice REST API and I think they are merging into Play and they are going to be in the next release. If we use Mongo, Salat is by far the best Mongo library out there that I really enjoy working with. Other stuff is Mongo Reactive – it is upcoming and it is pretty good. Scalaz if you are into functional programming, for sure, you should pick it up. It has a high learning curve, but once you get it, I think it is great. I think that is about it.