On this week’s podcast, Wes Reisz talks with Ben Kehoe of iRobot. Ben is a Cloud Robotics Research Scientist where he works on using the Internet to allow robots to do more and better things. AWS and, in particular, Lambda is a core part of cloud enabled robots. The two discuss iRobot’s cloud architecture. Some of the key lessons on the podcast include: thoughts on logging, deploying, unit/integration testing, service discovery, minimizing costs of service to service calls, and Conway’s Law.
Key Takeaways
- The AWS Platform, including services such as Kinesis, Lambda, and IoT Gateway were key components in allowing iRobot to build out everything they needed for Internet-connected robots in 2015.
- Cloud-enabled Roombas talk to the cloud via the IoT Gateway (which is MQTT) and are able to perform large file uploads using mutually authenticated certificates signed via an iRobot Certificate Authority. The entire system is event-driven with lambda being used to perform actions based on the events that occur.
- When you’re using serverless, you are using managed infrastructure rather than building your own. So that means, when they exist, you have to accept the limitations of the infrastructure. For example, until recently Lambda didn’t have an SQS integration. So because of that limitation, you have to have inventive ways to make things work as you want.
- Serverless is all about the total cost of ownership. It’s not just about development time, but across on areas that need to support operating the environment.
- iRobot takes an approach of unit testing functions locally but does integration testing on a deployed set of functions. A library called Placebo helps engineers record events sent to the cloud and then replay them for local unit tests.
- For logging/tracing, iRobot packages up information that a function uses into a structured record that is sent to CloudWatch. They then pipe that into SumoLogic to be able to trace executions. Most of the difficulties that happen tend to happen closer to the edge.
- iRobot uses Red/Black deployments to have a completely separate stack when deploying. In addition, they flatten (or inline) their function calls on deployment. Both of these techniques are used as cost optimization techniques to prevent lambdas calling lambdas when not needed.
- Looking towards the future of serverless, there is still work to be done to offer the same feature set that more traditional applications can use with service meshes.
Subscribe on:
Sponsored by Goldman Sachs
Goldman Sachs Engineers don’t just make things—we make things possible. Our engineers are innovators and problem-solvers, building solutions in risk management, big data, mobile and more. Interested? Find out how you can make things possible at goldmansachs.com/careers.
Show Notes
What does a cloud robotics research scientist do?
- 1:15 It’s a very buzz-wordy title, and would only get more buzz-wordy if I were to add serverless to it.
- 1:25 It’s using the internet to allow robots to do more and better things.
- 1:30 Giving devices access to perform massive computing processing in the cloud.
- 1:35 Giving them access to large databases of information to make better sense of the environment that they are in.
- 1:40 When one robot encounters a new situation and learns from it, you want all the robots to get smarter.
- 1:50 Those are some of the techniques - also, it enables you to connect robots to humans when the robot can’t solve the problem on its now.
- 1:55 There are lots of things the cloud can bring to robotics to level up the capability of robots.
Why did you decide to move to a serverless infrastructure?
- 2:20 I joined iRobot in 2015 and as a cloud roboticist, I worked on how we can use the internet to enable robots to do bigger and better things.
- 2:30 iRobot was attractive to me because it has the most robots - as a cloud roboticist, scale is important.
- 2:40 That was in the middle of the launch of the Roomba 980 - the first connected Roomba from iRobot.
- 2:45 At launch we were using a turnkey IoT solutions provider.
- 2:50 They were dealing with the connectivity, back-end, firmware update delivery processes - all of those pieces.
- 3:00 We released even before launch that while the provider had been chosen a few years before, it didn’t have the scale or extensibility needed.
- 3:15 We had decided that we wanted to own the back end on the cloud and own that capability as part of core technology of iRobot.
- 3:25 We wanted to choose a different connectivity provider that manages the connections back to the cloud.
- 3:40 It turned out that AWS IoT that was launching was a good choice for the connectivity layer.
- 3:50 While iRobot is historically a device company, and while we have experience with networking robotics and some cloud connected robotics, we hadn’t really done cloud-connected robots at the scale of Roomba before.
- 4:00 We really didn’t want to be in the business of building scalable cloud applications - we make robots, and we wanted to focus on that.
- 4:10 AWS lambda had just come out, and AWS IoT is a serverless offering - there are no scale knobs to tune.
- 4:20 We looked at the services available from AWS and thought we could build this without having to run any servers or any containers.
- 4:35 We were able to go all in because switching off the turnkey provider didn’t require porting any existing applications.
- 4:50 It allowed us to take on a very complex project and deliver it in a timely fashion and support it all with a small team.
What did it look like afterwards?
- 5:20 We use around 30 different AWS services in production as part of that single line of business.
- 5:30 That is probably a large number of services used to apply to a single business problem.
- 5:40 There are many organisations which likely use a lot of AWS services across many different business solutions.
- 5:45 We made a decision early on that we were not going to be afraid of those services, and become comfortable with cloud formation custom resources.
- 5:55 Not all of these services have cloud formation support.
- 6:00 What it required to become serverless native is to bend everything to your will.
- 6:10 Taking the services that are out there and finding out how to use them for the problem, rather than trying to reach an ideal solution.
- 6:20 We chose the technology and figured out how to duck-tape it together to use AWS lambda to solve the problem.
How do devices communicate back?
- 6:45 AWS IoT uses an MQTT based solution - an IoT protocol.
- 6:50 Devices have a bidirectional persistent connection to the cloud - so we can push data down to the robot as well as receiving events from them.
- 7:05 Events are mostly related to device life-cycle, so when it starts a cleaning mission, it starts transmitting some data about what it is doing.
- 7:15 At the end of the cleaning cycle, it sends a report that said what it has done.
- 7:20 It also sends a request to upload a map, which is too large to transmit via MQTT.
- 7:35 It doesn’t use the MQTT to authenticate; rather, it goes by a certificate-authenticated TLS connection to the cloud.
- 7:45 Those certificates are signed by iRobot certificate authority - which we have registered with AWS - so any robots that are connecting can be connected to the cloud.
- 8:100 We don’t have to go through a step of batch-sending all of the robot identities to AWS; we send the authority that signs them once, and the manufacturing process is unaffected.
- 8:25 When we switched over before, with robots in the field, whose cryptographic identities we didn’t record at the factory, and just having a chain from manufacturers in China to US-East-1 could be problematic.
- 8:45 That was a key choice in selecting AWS IoT, since they allowed you to bring your own device identities, which since then has become a feature of more IoT cloud providers.
- 9:00 What they don’t have is standard AWS credentials, so you can’t directly upload to S3 from the robot directly.
- 9:10 This is where pre-signed URLs come in - AWS created temporary URLs which have the credentials baked in - so we can send those URLs to the robots and tell them to upload the file.
- 9:25 The robot does an HTTP PUT to the URL (which is opaque to it) and uploads the file.
- 9:30 Based on that, we can receive events from S3 to determine that we have received that file, and update the mission record so that the app knows there is a map to be got.
- 9:50 All of that behaves in an event-driven way, and in IoT, many things are event driven.
- 9:55 It’s therefore a natural fit for an event driven architecture and therefore a serverless architecture.
What are some of the gotchas in an event-driven infrastructure?
- 11:20 There’s a couple of things to think about when you’re serverless.
- 11:25 You’re using managed services instead of running your own software whenever possible.
- 11:35 You need to be able to accept the limitations of these services as they exist.
- 11:40 A consequence is that (for example) until recently, AWS Lambda didn’t support SQS.
- 11:45 If you wanted to integrate these, you had to go through a process of setting up a recurring event to drive a Lambda to probe the queue and trigger operations on any messages found.
- 11:55 That worked and was a pattern that you could build up into a cloud formation custom resource.
- 12:05 It certainly wasn’t elegant, and so there’s a lot of things about going to war with the services that you have instead of that you want.
- 12:10 You don’t end up having as much when you have more control of the software that you’re putting on the cloud.
- 12:20 We’re using DynamoDB as an operation store - not even for data that is relational.
- 12:30 We then do the join in the client side in the Lambda - because there isn’t any HTTP support for RDS, and so maintaining a database connection in a highly scaled out Lambda is going to choke your database.
- 12:45 The cold start time is going to be high because you have to set up a database connection.
- 12:50 Those are architectural choices which - at the small scale - narrowly scoped to just that choice limiting.
- 13:05 Serverless is looking at the total cost of ownership across all of the system; not just the development time and the bill, but your operations, your time to market and so on.
- 13:35 If AmazonRDS ends up getting an HTTP interface, we will reconsider.
- 13:40 This happened with Athena - it originally started with just a JDBC connection, then it gained an HTTP connection.
- 13:50 It was much harder to use with Lambda before that happened.
Your serverless functions need to be on the cloud to work?
- 14:30 In the ideal case, the amount of code you’re writing is minimal, so the function development isn’t the primary focus when you’re creating serverless architecture.
- 14:40 You need to focus on the services; how you bring it all together.
- 14:45 If you’re using Lambda to handle all of the computation but accessing databases and message queues and authentication that’s running on Kubernetes, that’s less serverless than a system that’s using managed services.
- 15:10 The overall system where the code should ideally just be your business logic that you’re doing differently from everybody else.
- 15:30 The starting point of creating a service for us is starting in cloud formation, not starting in Lambda; figuring out the building blocks that you need to wire together.
- 15:40 Once you know how it is structured, and where you’re using Lambda, you can then decide what code goes in there.
- 15:45 Once you’re in Lambda, we test locally but integration testing occurs on the deployed system.
- 16:00 We want to be able to use any service that we need to means that we can’t rely on everything being locally mocked - so we integrate tests in the cloud.
- 16:15 On the other hand, unit testing takes the approach that we can use Placebo with our Python code and hooks into the AWS SDK.
- 16:35 This allows you to record and replay AWS calls with your credentials, and records the state of the database, S3 etc at that time.
- 17:10 That means your unit tests can run locally without needing to mock out services, because it’s using the SDK calls to intercept requests.
- 17:45 Both of those work - it’s developer preference - but they both do local unit testing and remote integration testing.
What was the experience of getting to the cloud?
- 18:10 When we started in 2015, there weren’t many options out there.
- 18:15 The serverless framework (JAWS) was available at the time, but the capabilities that it had didn’t meet our needs at that time.
- 18:50 We preferred having a cloud-deployment strategy using cloud formation, where the developer says: here’s what I want on the cloud.
- 19:00 The developer can then close their laptop and the automation continues.
- 19:05 Terraform has this problem as well; there’s a local tf state file that could be on a CI/CD server somewhere, but needs to be managed by the client.
- 19:20 We liked the approach which is a service that handles all the work for you.
- 19:30 The downside to this approach is that it ties you in to your provider and only allows what they make available.
- 19:35 We quickly became proficient in cloud formation custom resources.
- 19:40 We’ve made some of those libraries that help us open source.
- 19:50 We built tooling that essentially does what the macro functionality provides.
- 19:55 We allow developers to write (with some syntactical sugar) that looks like AWS SAM tool.
- 20:10 That gets expanded with defaults with the tool, and now that’s possible to do cloud-side.
- 20:20 The transforms could be installed at an organisational level.
- 20:30 The developers are then using local tools to perform the deployments.
- 21:05 Often you want your long lived data resources to be separate from your short lived resources, even in the same service.
- 21:20 You’d want your template that contains the function code to be separate from the template that defines the database, because the database doesn’t change frequently - or may be defined externally.
- 11:35 We allow developers to define - in a single template - all of the resources that they depend on, and when we deploy it, say that the resources may be defined elsewhere through a reference.
What are some of the issues with running services in production?
- 22:25 We rely on logging as a mechanism.
- 22:30 We package up the information that a service is getting into a structured message which gets sent to cloudwatch.
- 22:35 Then we can use a tool to dive in and trace executions.
- 22:50 A lot of the difficulties that we encounter tend to be towards the edge, where we have interaction between robot and cloud.
- 22:55 They tend to be less about complex execution traces cloud side, and more about what’s happening in the messages themselves.
- 23:00 Those are a little easier to pick off and recognise.
- 23:10 We don’t have as many of the extended trace situations.
- 23:15 We have found things we expected to have to take on as architectural/operational concerns that didn’t materialise - like circuit breaking logic or handling provider service outages.
- 23:30 It turns out that when you run a hyper-scale cloud provider, you build your services in a way that can deal with customers not worrying about how easy or hard it is to come back after an outage.
- 23:45 So we don’t have to do that - we allow our Lambdas to point at the services that they want to, and the SDK does exponential back-off, and our queues back up.
- 24:00 When the service comes back on-line, the queues empty and it’s all fine.
- 24:05 You only need to have circuit breaking logic when you’re forced to be a good neighbour.
What about rate-limiting?
- 24:20 Our primary situations tend to be around robot-side concerns, where an individual robot is being too chatty.
- 24:30 Those concerns tend to be within the AWS IoT service, which provides unique capabilities for dealing with individual clients, because they are individually identifiable.
- 24:45 We can black-list a misbehaving robot certificate, which may not be as available in more general web application scenarios.
What about service discovery?
- 25:00 That was one of the first questions we encountered - how do we do client-side service discovery.
- 25:20 We decided to go with a red/black deployment paradigm.
- 26:00 Red/black deployment means standing up two different copies of the infrastructure, including the load balancer, and have the client transferring who they are talking with.
- 26:50 On deployment, we stitch all of them together to in-line calls to Lambdas, so that we don’t keep going back out through the API gateway.
- 27:10 Lambda is able to access the DynamoDB directly.
- 27:20 If you look into the code repository, there is very clear isolation between the different functionality.
- 27:40 The functionality that we provide for Roomba today, we don’t charge for.
- 28:20 There was a very interesting transition from the turnkey provider - they have a fixed cost, per device, per year.
- 28:35 AWS is all pay-per-use - and when we went to the cost folks with predictions of what it was likely to be.
- 28:45 They’re used to buying parts and hardware and know the costs years in advance - but pay-as-you-go is a paradigm shift for a hardware manufacturer for cost forecasting.
What does the deployment flattening work like?
- 29:45 Every service provides an SDK - we’re Python across the whole thing.
- 29:55 Normally the SDK would be a thin client and talk back to the HTTP interface.
- 30:00 We can make that a thick client that includes most of the logic behind the API, and at deployment time, the thick SDK gets pulled over into the corner and is deployed with the service so that it can access the resources directly.
- 30:20 The separation of ownership is still present - that service owns the data but is shifted to the function itself.
- 30:45 When the app makes a call to the cloud, it’s talking to an API which is then talking to a wrapper around the service in the cloud.
- 30:50 I wouldn’t recommend it as a normal pattern - we have unique requirements because of our business and our scale that make it a wise choice for us.
- 31:00 We’re looking at our next generation functionality - where it’s going to be in the hot path - we still want to put it in this model where we’re avoiding Lambdas calling other Lambdas.
- 31:20 While you don’t pay for the Lambda function if it’s not running, you do pay for it if it’s waiting for something else.
- 31:30 However, the monolith has an impact on our cadence - it’s still a model that’s working for us and is mature.
- 31:50 As we want to implement and prototype more features, we need to be able to deploy experimental or testing features alongside our monolith.
- 32:05 We’re looking for how to deploy them independently, and where the rough edges are for authentication, service deployment and discovery - all those problems we will need to solve.
What is the reverse Conway’s law?
- 32:20 Conway’s law says that your organisation reflects the architecture of software that you build.
- 33:00 When you change from traditional architectures to serverless architectures, you want to change the communication patterns to be more event driven.
- 33:15 What that implies is that if you don’t think about how to make your organisation communication patterns match what you want to do now in your software, you won’t be successful in changing.
- 33:30 You’re going to end up trying to fit traditional architectural patterns into a serverless world, and there will be an impedance mismatch.
- 33:40 You can build synchronous event driven serverless services, but it’s a more natural fit to think about event-driven services.
- 33:50 Serverless architecture is ephemeral, and events are also ephemeral.
- 34:00 It’s useful to think in those patterns of how to pass information between services in an event-driven fashion, rather than thinking of writing to a database and periodically polling that database.
- 34:20 So instead of thinking about it in those terms, think of it as changes to the database being a stream of events which can be used to notify someone that something has happened.
- 34:25 This broadens what you can consider an API - if you have a serverless infrastructure, there will be an API gateway somewhere.
- 34:35 There also may be a kinesis stream of events that you can hook into.
- 34:50 I like step functions as a model, workflow orchestration - state as a service, which goes hand-in-hand with stateless computation.
- 35:00 I like the idea that there’s federated state machines, where perhaps these state machines or workflows are part of the API between services and events that can be worked on, or be fired out.
- 35:30 These state machines can then be connected to each other to compute in a distributed fashion, while still being surrounded by a bounded context.
What are still some of the unsolved questions for serverless?
- 38:05 One thing I’m envious of Kubernetes is Envoy and STL.
- 36:15 They get this model from being able to deploy a side-car alongside their code.
- 36:20 In serverless, you don’t get the option to run side-cars — they are only running when they are invoked.
- 36:30 That means if you put something inside your function to be a side-car, it’s only going to run when your service is running.
- 36:35 That means it’s going to spend most of its time catching up to what happened while it was off.
- 36:40 All of these things that are being solved in Kubernetes around a service mesh, authentication, authorisation, services discovery - these aren’t solved yet in serverless infrastructure.
- 36:55 AWS generally now have to think about multi-account environments - what does authentication and authorisation look like between accounts?
- 37:05 Who is defining authorisation policy - is it the caller, or the called service?
- 37:10 Doing discovery as a service using AWS’ parameter stores - how does that work across accounts?
- 37:40 Right now, you can cobble it together with various services, but it’s not something that happens out of the box, and it’s not plug-and-play.
- 37:50 At the same time, exactly what it should look like is an open question.
- 38:00 We in the community need to do more work about what should it look like - that should be the next big step.