BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Internet of Tomatoes: Building a Scalable Cloud Architecture

Internet of Tomatoes: Building a Scalable Cloud Architecture

Bookmarks
49:25

Summary

Flavia Paganelli tells the story of 30MHz’s platform (developed for the agriculture sector) and how they ended up helping growers in 30 countries, deploying 3.5K sensors and process data at 4K events per second. She shares their architecture, how it grew, the challenges, and how they are continuing to transform it - for example - to learn how to grow the best tomatoes.

Bio

Flavia Paganelli is co-founder and CTO of 30MHz, a data platform for agriculture with a mission to grow food more sustainably and efficiently. Before 30MHz, she founded 9Apps, a cloud boutique which helped startups scaling their Amazon Web Services infrastructure. She also worked at TomTom, where she developed their web route planner, and at Layar to build their mobile Augmented Reality app.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Paganelli: I'd like to tell you about my wedding night. It was a beautiful day, 21st of April 2011: friends, family, lovely food, dancing. In those days, I had a company with this guy who is now my husband, which we had to keep cloud applications online, 24/7, all the time. These applications were on Amazon Web Services. It was that same day that the largest AWS outage happened. Of course, some of our customers' applications went offline. It didn't only affect us. Massive services were knocked offline or degraded. It took three days for everything to go back to normal. Some people talk about it as the day when it was apparent, what the consequences of the cloud were. My wedding night, I ended up fixing servers.

30MHz Ecosystem

Over the last 10 years, part of my responsibility has been to keep cloud applications running all the time. I know that one of the things that you need to make sure you succeed, not to have these awful alerts, awful nights, is you have to design robust and scalable applications. It sounds obvious. I would like to share some of the lessons I learned and how I apply them with my team, in 30MHz, to build what is now a data-driven horticulture platform. To not talk in abstract terms, I took this slide from the marketing team. This is how our users who are growers see our application with these nice visualizations. The data is coming from different sources including 5000 sensors that we provide ourselves. It can also be other inputs from different systems needed by the growers.

Just to give you an idea what the system is, or the platform. We have 400 customers. We have around 5000 sensors. We write 4000 events per minute of streaming data. We have 6 TBs of data viewable in real-time. All those graphs that you were seeing, they can be accessed immediately. What makes me most proud, our uptimes are normally 100%. The user doesn't see any problems. The system is always online. We do get notifications when something goes wrong, but the user doesn't. The data durability is also 100%. We don't lose data, normally.

I'd like to tell you a little bit of the story. I hope not to bore you. We didn't jump in to make a horticulture platform. It was a coincidence, because we started with the idea of making a web monitoring service. We had all these problems, alarms. We needed to be able to know exactly what was going on all the time with our customers' applications. There were existing web monitoring tools, if we wanted to use those and monitor everything, it will become very expensive. We said, "We can build it for ourselves," typical developer. Then we can make it into a product that hundreds of thousands of users were going to use, in our dreams.

Did anyone use a web monitoring tool ever? The concept is super simple. I have this URL. I want to know if it's up or down. I want to get notifications, and tell me the response time. Maybe you do it from different regions in the world. You would see things like these, where you see the response time and if there was downtime.

For us, the functionality was clear. We wanted it to have a series of characteristics that were very important for us. This monitoring platform had to be resilient. On top of waking up for our customers' applications, we didn't want to wake up for this platform that we were going to build. It had to be reliable. We didn't want false alarms. We didn't want to miss any alarms. It had to be performant. If something is down, I want to know immediately. It had to be scalable because we were planning to have 100,000 clients with 1 million events per minute. It had to be affordable because we had to be able to maintain it, and sell it as a service.

Lesson 1: Modularity

How did we go about designing it? I'm going to use these lessons too, that I learned to explain how we designed it. First, we thought, what are the main responsibilities that our system has to have? Make, for each of those, one component. That would give us the possibility of using the best programming language for each task, for example, be more maintainable. In the future, maybe have different teams working on different components. Yes, big dreams. I will explain the core of the architecture that we built. The responsibilities we had in this system were something that would probe the web applications, a ping. We needed something to schedule when to probe what? It's a cron, let's say. We need something to send notifications, SMS, emails. We need something to save the results because we want to make those graphs with the response time. We want to see when it was that something went wrong.

Lesson 2: Asynchronous Communication

Next to this we need to connect the components. The idea is that if you decouple your components using asynchronous communication instead of synchronous communication, you will not have the problem of components failing because other components are down. You can let each component be independent, and grow, and scale by itself. We used messaging software for this. For example, the scheduler would tell the probe when to probe certain web applications via a queue, so sending messages in a queue. In our case, we used Amazon SQS, the obvious choice at the moment. This was 2013.

Lesson 3: Self-Healing and Self-Scaling

Each component needed to be self-healing and self-scaling. Of course, we want the whole system to be resilient and we have individual components. We want them to be smart enough to make sure they are alive. For that we used autoscaling. We run everything on EC2 instances. With autoscaling, we made sure we had several instances per group. If one instance would fail, then there's always another. Then configure it to scale based on different parameters like load, CPU, size of the queue. For all these, you have to make sure that the instances are immutable or stateless. You have to always be able to replace the instances.

This is the core architecture with the core components where we have the scheduler, which is like the cron, is all the time filling these queues of the probes telling them what to do. The probes, which probe the websites or web applications, send the results to these two queues for the postman to send the notifications and for the historian to save the results. It's a constant flow of messages where all these queues have to be empty all the time. In that way, you have quick notifications. You have the data available immediately in your datastore, so streaming data.

Where do we store all these data? We also had some requirements for the datastore because we wanted to scale and we wanted to be big. We are handling time-series data. We need to write very fast to the database, because we need to see the results immediately and read immediately. We want something distributed and scalable, because otherwise we are going to be limited by the size of the database. We need to be able to scale horizontally by adding more nodes. If we have a flexible, not predefined schema, that was useful because we didn't know exactly how we were going to end. It had to be affordable. This is why we chose Elasticsearch. Did any of you have contact in any way with Elasticsearch? You probably know it from logging. Elasticsearch is an analytics and search engine based on Lucene. It's open-source. It's very fast, very scalable, designed for scale. It checks all the boxes. To be honest, it has been great, this choice, because we were able to scale. We were able to get very fast responses. All the data that we show immediately in dashboards is queried in Elasticsearch. The queries are super fast, if it's configured right, of course.

Lesson 4: Piggyback on Scalable Software and Services

This leads me to the next lesson. Don't build everything yourself like us. Don't build everything, piggyback on scalable software and services. There's a lot out there that you can use and you can grow with it. Without any work from your side, you will have a lot of advantages. For us, Elasticsearch, we started when it was I think version 1, and they are now in version 7. The improvements are immense. It's much more stable, much more performant. Those are some of the other services that we used. It can be software that you rely on. We're running Elasticsearch ourselves not as a service, but it can be software. It can be services. Choose something that you think is promising. Just take advantage of that.

Lesson 5: Monitor Everything

The next thing, very important: monitor everything. Because if you do then you know exactly what is going on with your whole living thing, which is your application. You should be able to monitor each and every component, the queues, everything that you use, the database, memory, CPU. Because even though maybe it's not something that you have to wake up for, you will know in advance what is starting to fail first. If you know what's going to fail first you can prepare and you can maybe re-implement whatever you need to re-implement, or change some component, or re-architect even. You need time for that. Of course, we relied a lot on CloudWatch. Sorry, I'm mentioning a lot of service names and so on, but I want to make it, in a way, a little bit generic, but also to have a reference of the services that exist on Amazon at least. We used also our own platform to monitor our platform, if that makes sense.

Of course not everything was perfect. One of the problems in this architecture was it was doable. It was not keeping us awake. The scheduler is a single point of failure. If you ever try to write a scalable cron software, then you know that it's difficult to do that, because you cannot just replicate and make many nodes because otherwise you duplicate the jobs. I think now there are more options for doing this, back then there weren't. I will get back to this because we have other solutions. We also had to rewrite components because the original language of choice was not good, like Python for multithreading and diverse reasons. We have hardware failures, but that was ok because we had autoscaling taking care of that. It wasn't usually a big problem. We had a lot of Elasticsearch load issues because it was a little bit less mature technology and we were learning how to work with it. Sometimes, basically the cluster would go red, or yellow, or red. Yellow, something needs to be changed, red, you're in trouble.

Because we had this queue in between, we were able to just stop the historian. Don't write anything anymore. We fix the database. Then we can continue, and no data was lost. In the end, we were pretty Zen because we were able to know exactly what was going on with all our customers, and we had a lot of control. In this slide you can see that we are monitoring our own components.

This is all not so interesting because later on, we had a friend who worked at a museum. Because we were probably there, we talked wonders about our system. He said, "If it's so good, your monitoring tool, why don't you monitor the projectors in the museum where I am? Because they are always failing and we don't know. We know too late when this happens." We took the challenge. We said, "We could do that." Because in the end, the probe is just something that is pinging a URL, so if the projectors are exposed with a URL, even if it's a local host URL, we can do it.

What we did was to run the probe in the museum on a Raspberry Pi. Actually, the architecture didn't change. What changed is where we were running. In what machine we were running one component. Then, all the communication to the rest of the platform was the same, via queues. We had this, that's a Raspberry Pi and that's a projector. It worked.

Given our successes, we had another friend. He was in the business of vertical farms, growing plants in a very small space in vertical rows. He wanted to add sensors to understand what was going on in there. I don't know whose idea it was. He said, "If you monitor projectors, you can also have sensors and you can help me out." We took the challenge again. Instead of connecting to projectors, we took sensors for temperature, humidity, CO2, and we connected them to the Raspberry Pi via a wireless protocol. Then the rest was the same. The probe only had to just send the data up like it always did only now it was other numbers, instead of response times, other values. It's the same. For free, we had notifications, and the dashboards we already had. That's how we became a sensor platform, very simplified. Then it looked like this. The sensors, and the probes were locally on the Edge, and the rest stayed in the cloud.

Of course, nothing is perfect. What are some of the problems of this architecture with this new functionality? The Raspberry Pi's need an internet connection continuously, because everything that it does is just determined by the cloud with queues. Ping this, ping that, give me the results. That is a problem that we are tackling and we'll get back. Of course, as we increase the number of devices, then we have to increase how much we pay for the queuing system. Worse if you want to connect your Raspberry Pi using a 3G or 4G connection, then you have to pay for a mobile data. SQS has a big overhead. Also, of course, we have different focus. Now the focus is on the data, before it was on the notifications.

The postman, which is the component which sends notifications, was involved in everything. It had to know everything that was going on. All the messages go through it. Now we don't need it anymore, only 5% maybe of the customers set alarms if the temperature is above a certain range, so 5% of the sensors have notification. This component was growing. The load was growing more, unnecessarily. We were able to change that with a relatively easy fix without changing the architecture by deciding if something needed to go to the postman or not. Then this was dragged to the next queue.

Actually, we became a sensor platform and we said we can do all sensors for you, whatever you want to sense we will do it, anything. We realized that the people who were more interested in our sensors were the farmers, more specifically growers, doing indoor farming, with greenhouse especially. We morphed again and we became a horticulture platform. That comes with a lot of new challenges.

I talked about the streaming data and how we handled it, how we could manage to make it very responsible, available very fast. Streaming data, sensor data is only part of it. We're starting to have more data sources. Another thing is that we are not anymore just a data platform. We do collect data and we do visualize data but we saw the users want to do more with it. Let's say, if you have all this data you can predict what's going to happen. What my yield is going to be, typical example. You can help me optimize. Can I use less water, or less CO2? We need to start using other tools and services to apply data science to the data and to get more insights. We also have some challenges for building applications on top of the platform itself. One example, we are building a climate profiler, which takes historical data and tries to advice what's the best climate profile for a day in your greenhouse for a given crop. That, of course, requires more than just collecting and showing data. We add intelligence.

About the data in particular, I wanted to mention a couple of the challenges. Still, all of the data is time-based. Elasticsearch is still a great tool for it. The fact that we can combine data from different sources, and then collect it in a generic way allows us to, for example, compare data in one graph. How was the day on Monday? For example, how was the humidity on Monday? How was the insect population? The humidity has an impact on the insect population. Then you can learn from that. This is an example of the data that we have, which is collected specifically by the users themselves. They walk the greenhouse and they look to find, how many insects are in each compartment, each week? Then they can learn, what were the causes of the outbursts of these pests, for example?

Manual input data has nothing to do with streaming data. It's sparse. It can be modified. It's not read-only. The sensor data is write once and then read many times. Maybe they want to add data in the future like weather forecast. We also have positives to these. We are allowing the users to communicate about what's going on. Then they can write comments and have conversations in the platform. Elasticsearch is ideal for that because everything you put in there you can search it very easily. For many of the new challenges, which I'm not covering, everything has its limits.

Lesson 6: Use Serverless Solutions

We use serverless solutions because it's the same idea as piggybacking on other services. You use something that is already taken care of, that you don't have to monitor. It just runs. It solves a problem. You concentrate more on your functionality and less in the operations. Of course, remove your single point of failures, which is something we are doing now by getting rid of the scheduler. What we are doing is moving this functionality, even to the sensors, because the sensors are all the time just sending their data, pushing their data to the gateway. We don't have to have someone saying you have to check this. It's already there. This is one of the things that is going to bring a lot of advantages, because the restriction of having internet connectivity all the time is not there anymore. Going back to the talk of the keynote, if you can do everything on the Edge, why not? You can collect the data. Then batch it whenever you need it, whenever you're done with it. Then you reduce your queuing costs down. Your mobiles costs go down, because you can use other things like MQTT to have very light messaging. We are using IoT Greengrass, and that allows us to do more. We can have functionality on the Edge servers and not just depend for everything we do on the cloud. We can have intelligence. We can have machine learning models, Lambda functions running on the Edge device.

Summary

What I say is that if you follow simple architecture principles like these, I say from my experience that you will be able to adapt to new functionality, because that's what we did. We went from web monitoring to a horticulture platform. If you monitor everything, if you have something that is robust and reliable, you will have peace of mind and you will be able to look ahead. You will know, what is the thing that is going to fail first and work on that in advance? Design it in advance. Of course, you will have better wedding nights.

This photo, which is very bad quality, I wanted to put it there because we are currently growing cherry tomatoes in a greenhouse, remotely. We get data from sensors to know what is going on. What's the climate? What are the climate conditions, how the plants are doing? Based on that, we apply certain algorithms with certain machine learning models, and we decide how to change the environment. Change the temperatures. Switch on the heating pipes. Open windows. Irrigate more or less. Throw more or less CO2. This is happening now. The photo is so bad because it's from a webcam. We are not allowed to go in the greenhouse. Last week we had the first harvest with quality A tomatoes, which is quite amazing. It's a competition. We are participating with four other teams. This talk didn't fit this topic.

Questions and Answers

Participant 1: Following you rely on a lot of AWS services, do you see that as a risk? If so, how do you deal with that risk?

Paganelli: I think you have to choose one cloud service and stick with it. I don't see a solution body having some servers in one cloud provider and then some in another. I don't see a risk in Amazon. It's just growing and growing. They're adding services. Functionality-wise, it's perfect. We get support from them as well. I don't see a lot of risk. I think it's also like there's no choice. You either go for one or the other, and then you stick with it and you go blindly. I don't know. Do you see other choices?

Participant 1: I can imagine that you stick with one but that you choose, for example, for layer of abstraction above these services so that you can move to another vendor later on as a mitigating mechanism. That's something I can think of.

Paganelli: I think some of these concepts help you with that. Because if you have modularity, you have queues in between. You can replace your queues. You can replace your components to put them somewhere else. It is possible. Then you don't just blindly use everything there is, because for example, Elasticsearch as a service, it exists in Amazon. What I read and heard about it, it's not very good. You can't do everything you want. You can't configure everything the way you want it. Then we go and we host it ourselves on Amazon, but we host it ourselves. Of course, you have to be intelligent to choose your tools. This is what we do.

Participant 2: How do you detect sensor failure and handle it?

Paganelli: The sensors are continuously sending data every minute. We immediately see if there is no data. Then we said, if it's important, we can set alarms. We can say, if a number of the sensors are offline, send the alarm. We'll send an alarm to the user itself. Also, the sensors send diagnostic information like, what's the battery level? Sometimes there are sensors with several super sensors, and then it sends information about how each of them is doing.

Participant 3: It so happens that I'm involved in something very similar for the average farmer, not in a greenhouse setting, in India. You mentioned Raspberry Pi as an example. Then we are looking for something that is far more cost effective and that can be implemented at each farmer's plot. Have you looked at cost rationalization of the sensors? You talked about disconnected systems, wherein you don't have always-on internet connection. Similarly, any savings in terms of sensor cost?

Paganelli: First, about the gateway, the Raspberry Pi. We actually don't use the Raspberry Pi anymore. Now we use BeagleBone Black, which is more scalable, more industrial ready. About cost, it hasn't been our focus. It means there's a lot of potential. We could reduce price a lot if we innovate there. We focus more on quality. Yes, there should be potential for improvement there. Also, depends on what are the customers you have and expectations.

Also, it's interesting that you mentioned working without an internet connection. Because if you look at that, the autonomous greenhouse, you're going to need really to work, I think in a real production environment. You're going to need to work without an internet connection and to make decisions on the device, because otherwise your tomatoes will die basically. If you depend on your internet connection or in the platform, which might not be 100% available, then maybe you lose hours of irrigation, and your plants dried up. It was interesting in the keynote because he was talking about the power of these devices. There's a lot more that can be done.

Participant 4: You talked about removing this postman component in the end. Who decides when to send alerts? The probes, or?

Paganelli: For the sensors itself, the user can decide if they want to receive a notification, say, if the temperature is above a certain threshold or some other condition. That's configured in a database, in this case, DynamoDB. Then the scheduler reads this, and the messages are written based on that.

Participant 5: We work on a project where we get about 16,000 readings a minute, at the moment. Only problems we've got is data retention. We don't necessarily need every single sensor ID. Do you have any strategy in place to minimize how much data you retain and at what point do you aggregate if at any point?

Paganelli: We are basically working on it. First, we are deleting all the old data that we don't need anymore, from old customers and things like that. Usually, you don't need to access all the data in real-time. It's very nice to have, but you should be able to decrease your performance for older data. One of the things we are looking into is to have perhaps different Elasticsearch clusters with different power, different capacity. Then we can have this distribution, old data, newer data. Yes, you could use others. One of the things that we are also considering is replacing SQS by something else like Kafka. Maybe you can do your recent data queries on that instead of on the Elasticsearch cluster. There are so many things. It's interesting.

 

See more presentations with transcripts

 

Recorded at:

May 12, 2020

BT