BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Podcasts John DesJardins on Continuous Intelligence and In-Memory Computing

John DesJardins on Continuous Intelligence and In-Memory Computing

In this podcast, John DesJardins, chief technology officer at Hazelcast, met with InfoQ podcast co-host Thomas Betts to discuss the idea of continuous intelligence. This is a paradigm shift from traditional business intelligence, and relies on a corresponding move from batch-based ETL and reporting to continuous processing of streaming data. Although the languages being used, such as Python and SQL, will be familiar, developers must pay special attention to the characteristics of time-series data, especially in near-real-time scenarios. We cover the current state of the tools and technologies in use, why companies are adopting continuous intelligence to remain competitive, and we even get a bit into what the future of data processing and analysis will look like.

Key Takeaways

  • Streaming data analysis is a new paradigm, and companies will need to think differently as they move from traditional business intelligence to continuous intelligence.
  • Developers and data scientists can use languages they are familiar with, such as Python, C#, or SQL, but when analyzing time-series data they have to always be considering the time factor.
  • Continuous intelligence is quickly becoming a requirement for companies to remain competitive.
  • As the technology matures, more capabilities will be accessible by non-technical people, and the systems will take advantage of automated machine learning to provide new insights with less effort.

Introduction [00:15]

Thomas Betts: Hello, and thank you for tuning into another episode of the InfoQ podcast. I'm Thomas Betts, co-host of the podcast. Lead editor for architecture and design at InfoQ, and a senior principal software engineer at Blackbaud.

I had the chance to speak with John DesJardins of Hazelcast. John was previously on the podcast to talk about in-memory data grids. We brought him back to discuss the concept of continuous intelligence. This is a paradigm shift from traditional business intelligence and relies on a corresponding move from batch based ETL and reporting to continuous processing of streaming data. Although the languages being used, such as Python and SQL will be familiar to developers and data scientists, you must pay special attention to the characteristics of time series data, especially in near real-time scenarios.

We cover the current state of the tools and technologies in use, why companies are adopting continuous intelligence to remain competitive, and we even get a bit into what the future of data processing and analysis will look like. I hope you enjoy our discussion.

John, thank you for joining me on the InfoQ podcast today.

John DesJardins: Thanks for having me. Excited to be here.

Thomas Betts: I wanted to start with that first line from your bio, the phrase, ultra fast in-memory computing platform. Now, when I hear in-memory, the first thing that jumps to my mind is caching. I get a sense that it's more than just that. Can you tell me a little bit about what the platform provides?

Overview of in-memory computing [01:33]

John DesJardins: Sure, absolutely. Caching was certainly one of the first applications of keeping data in memory to make things run faster. Actually, Hazelcast has long been known as a solution to help with distributed caching. Instead of just cashing on your local server, you could have that distributed and scaled out over a cluster of servers, so you can put more data in and have other benefits.

But Hazelcast and in-memory computing have come a long way since then. Really the reason is that memory became a lot cheaper over, say, the last 15 years. As it became cheaper, you're able to put more data into a server, more DRAM, and that opened up more ability to have higher density within these kind of solutions. That also opened up the potential for more use cases.

Today, people are reading and writing to Hazelcast in order to have sub millisecond response time from wherever their application code is. That's the first and foremost thing that in-memory computing was used for. It goes beyond just the cache, which is more about the reads, but also having fast writes. Then we can asynchronously write the data through to a system of record. Your changes get written to Hazelcast, and then we write them out asynchronously. That speeds up the rights as well as reads.

But the next thing you start doing, you have all this data and memory, you want to start to query the data or look at grouping of data, aggregation of data, and filtering the data. You start to kind of use it like an in-memory database. And so that's kind of like the first phase of the evolution.

The next phase of the evolution was, well, you could do a lot more than just aggregation or querying on that data, but really if you're running code on the cluster, you can take advantage of data locality, meaning that the code is running together with the data. Also, if you make the code data aware, you can make sure that it's running in the exact partition and server where a particular subset of data is. That allows you to do super fast computations.

Now, we're able to not only read and write the data or do basic queries, but we can even do relatively complex computations or machine learning on the data. So that's kind of where the evolution of data grids moved to, but we've taken it a whole step further and said low latency compute isn't just about atomic operations. But what about the fact that data is getting continuously created every moment? You can think about things like internet of things, or even just the wide adoption of mobile devices and the fact that people are doing more and more on their phones.

The amount of data that's getting created has exploded. The velocity of that data has exploded. It's more data and it's being created all the time. Being able to sense that data and respond instantaneously is another advantage that in-memory computing can bring. We've taken this to the level of low-latency stream processing or streaming analytics, it's sometimes called. Nowadays, a lot of data is distributed through an event store like Kafka or Apache Pulsar. Being able to immediately react to these events as the data is being created is a unique capability that in memory computing can deliver. That really opens up a whole new range of use cases.

Thomas Betts: That's quite the journey. I like the, here's where we started at and here's how it moved along, because it is clearly more than caching. I like the idea of getting in there and doing that real-time analysis.

Real-time, continuous intelligence vs. batch-based ETL [05:24]

Thomas Betts: One of the trends we've seen over the last few years is companies moving away from traditional ETL extract, transform and load, to unpack the acronym. Less batch processing, more data streaming. We also have, as you described, IOT and mobile devices. But just within your network, you have a distributed system with lots of microservices sending messages around. Being able to watch that in real time, as opposed to you wait for a nightly job, what advantages are you seeing of being able to check that data real-time for doing the analysis versus having to wait in a traditional?

John DesJardins: One of the advantages of now that we can actually respond in the moment of creation of data is that we were able to start to identify patterns in more real time. This was often applied to looking for things such as fraud within payments or to identify risks within securities trading or maybe undesired trading behavior.

Those were some of the use cases, and then of course in IOT, you have sensor data. Being able to instantaneously respond to that sensor data was another big use case. But as we've built out this capability to be even lower latency, and so Hazelcast, it's the only streaming analytics platform that was built on top of a data grid. You have technologies like Apache Spark, which we had at Cloudera when I was at Cloudera before. Cloudera is more optimized for batch or even micro batch and fast batch type processing, but it's not necessarily optimized for continuous intelligence.

When you get into a low latency, continuous processing paradigm, now you can start to actually think about changing the behavior of applications to make them more intelligent. For example, you could be looking at credit card data. Not only are you looking for fraud, but maybe we could even start to think about real-time offers. Somebody is at checkout, but what if you could push out an offer and say, by the way, you're at Best Buy and you're buying a TV, but did you know there's a sale on X, Y, and Z?

Well, at that moment of checkout, you're still in the store and you might be able to add something. In the same way, in a retail online store, it's even easier to identify what someone is clicking on, which products are they searching? Analyzing every single click you do with your phone or browser or tablet, and being able to say, this guy is looking at monitors, and he might need a monitor arm, but it used to be like you would often get offers afterwards, or sometimes you'd even just get those creepy ads that would follow you after you bought something, "Hey, do you need a monitor arm?" You're like, I just bought the monitor and I bought a cheap monitor arm, but if you'd have told me that he could have given me a deal on a fancy monitor arm, I might have done that. But now you're making me an offer after the fact."

It's kind of like in the old days at CVS, you used to, I don't even know if they still do this, but it shows you how often I go out and shop in person these days. You'd get that long printout of coupons after you bought and half of them were for coupons for stuff you just bought. You're like, I would have liked to have had coupons before and not had to save them up for weeks and come back.

So this kind of real-time offers is a thing that we're seeing as a big uptick in banking and retail and even in travel, as we're getting back to traveling. Real-time personalization is a great application of this technology. Then of course, sensing data in real time and responding at low latency is of particular value for things like industrial monitoring. One of the things that Hazelcast can do is our platform also runs anywhere, which means we have customers running it inside of industrial facilities, so that they're actually in that local area network and therefore able to respond in the milliseconds or microseconds window of time that could really prevent a catastrophic failure or maybe dial down things to maximize the life of a component before it needs maintenance.

That whole low latency real-time use case, it just opens up so many more ways that you can make an application behave differently as opposed to traditional data science is like, you analyze a lot of the data, you create a model, you can then apply it and use it to predict behavior. Then if you get like the model really dialed in, maybe you hire a developer to take that model and create an optimized version of that algorithm that can run, but it used to be a very custom effort to take machine learning and predictive analytics and embed them into an application so that they could really make that application more intelligent.

With this in-memory computing and stream processing, you're able to do that more naturally and repeatedly. So that's kind of what has opened up. Some people think about it as there is that moment or window of opportunity and the narrower the window of opportunity is to make an offer or to take an action to prevent a problem, the more these technologies really come into play and add value.

Examples from retail businesses [10:53]

Thomas Betts:  think you said something about the new paradigm and that's one of the things we don't see too often that things have to shift, and it's going to take a little while for companies to adopt. We now have the technology that can help us do this, how can we change our business? What were we not able to do before? Because it might've been, going back to the retail scenario, a lot people would read the receipts and say, "Oh, here's what we sold last week. Or these aren't selling, we need to put these on sale to find ways to entice buyers," but it's always that slow feedback loop. And so you're talking about making that feedback much quicker.

John DesJardins: Retail has all kinds of nuances to how you can apply that. Oh, we can do a real-time offer, but hey, why don't we also analyze what inventory we maybe have too much of? That can vary from time to time. Making the real-time offer of the right product for the customer, but maybe also a product that we'd like to offer at a lower price to move that inventory.

Then even when you get into the fulfillment side, real-time analytics can mean that I'm going to make sure that I deliver that product from a location that has a high stock level, or that is close to a distribution center. You can build algorithms in to find out what is the best place to source the order and ship from. There's so many different nuances to the real time.

One of the biggest use cases actually in retail is just making sure that there is the products available to promise for the order, because you can think about like on a Black Friday, right? Lots of people are putting stuff into shopping carts. Some of it is things like clothing or other things that have size and color. You might have stock, but you may not have the particular color and size of a product because someone else just put it in their cart.

You don't want to make a mistake and commit to an order and then have to back that out and contact the customer and tell them that it's not available. That can be very costly, in terms of lost revenue, as well as a bad experience. At the same time, when millions of people are on your site, putting stuff in different shopping carts and things like that, there's a tremendous amount of just very simple calculations that have to be done in order to accurately keep track of what's available within your inventory.

Thomas Betts: The fact that it's always happening 24/7 now. It used to be your store was open for 8, 10, 12 hours a day and you could predict, oh, it's going to be busy on these days. But now people are shopping around the clock. You have spikes during the day, but you don't have the lulls, and so how does that change? Especially on major shopping days. So that's kind of opened up a lot of the business scenarios. Clearly there's a lot for retail and I'm sure other examples we could talk about for while, what are the obstacles that people are facing? What's getting in the way of everyone doing this right now? Is it a technology problem? It's a business problem? What are you seeing there?

John DesJardins: That's a great question. I think one of the inhibitors is people just getting more aware of what's possible with the latest technology. And then I think another inhibitor is this expectations of it's going to be difficult. That's particularly true in traditional retailers who may have a lot of investment in their existing infrastructure, and they may have what some people might call technical debt or whatever, that prevents them from acting and responding agilely as quickly to adopting technology as some of the purely online.

You could see like an Amazon or a Wayfair, probably more likely to adopt these kinds of technologies more quickly, but that's actually now, with COVID, shifting and brick and mortar and traditional retailers are all realizing that they have to respond to these disruptions. You're starting to see that shift where people are really wanting to learn about these technologies and how they can help them to be more competitive.

The other barrier is sometimes just getting the right infrastructure in place. This is where cloud comes into play and the ability to just quickly spin the technology up in the cloud, often leveraging Kubernetes, and the fact that you can even leverage container-based deployment models, not only in the cloud, but in the edge. So if you want to do buy online pickup in store and have real time offers, making that technology work both in store and in the cloud used to be challenging, but now it's possible to just deploy things using containers and modern CICD dev ops tooling, and that allows you to not only deploy new technologies more quickly, but also make changes because part of the value of this newer approach of microservices and real time processing and real-time machine learning and all of these technologies is being able to fine tune them. You deploy on machine learning model.

The real power is if you could deploy two versions of that model and do some AB testing and see which one actually gets a better conversion rate. Or maybe you deploy a model and then you try deploying another one every two weeks to see if you're improving the conversion rate and you can roll them back. That kind of agility that's possible today I think it is really helping to pick the pace up a bit.

Another thing that we're investing in is trying to work more with data science vendors, particularly our partnership with IBM, with their Watson suite and also our partnerships with the cloud providers is we're trying to make it easier for you to bring your machine learning and your Python into the real-time streaming and be able to run that in a low latency platform. That is another aspect or barrier that we're working to help people overcome because the time that it takes from a data science to create a new machine learning algorithm or model to being able to get it into production and then find out how well it's working, as you shorten that cycle, then of course you open up a lot of possibilities for doing a lot more.

Thomas Betts: Is there also the possibility that some of those models, I think some of them are being made public. You can download models for image recognition and I think I've heard some for fraud analysis that you might be able to find something and just throw it at your data and see if it works. That's another way to just get faster.

John DesJardins: Yes. In fact, most of the machine learning platforms are starting to offer autonomous ML, which is really that ability to have predefined proven techniques. You just tell it, this is the type of data I have, and it will throw a bunch of different models at it and figure out which one will work best. So there's more and more tooling to make this a lot easier to adopt.

How to get started with Hazelcast for data stream processing [17:32]

Thomas Betts: Yeah. I think you've now mentioned a few different ways that you can get started. If someone wants to get started with this, what's the easiest way? Are they going to host themselves or are they going to look at a managed service offering in one of the cloud vendors? Let's say they have, for example, Kafka already set up, what does it take to get started from there?

John DesJardins: Yeah. It's great if you already have Kafka or if you're using Pulsar or even other types of message technologies, such as JMS based and things. If you're already getting the data and you're set up, basically you can get Hazelcast as a managed service on all three clouds.

John DesJardins: We also are in confluence connector, hub marketplace, and we're available in the marketplaces of the cloud providers. But a lot of people are just doing Kubernetes and so it may just be as easy as using either Helm or an operator. We have both which will allow you to just spin up Hazelcast very quickly.

Hazelcast is available open source as well as enterprise, which offers a lot more features around zero downtime and operational excellence and security, and things like that. But you could get started and have Hazelcast up and running within minutes. We have got the connectors to all of the different places where you might want to get the real time data, like Kafka, so we're able work very quickly. We've got partnerships with Mongo, DataStax, Confluent, as well as all those cloud providers. So we're situated with an ecosystem that will help you jump in quickly. There's a lot of examples on our website as well, tutorials. We've even got prebuilt solution demos that help you solve some of the top use cases.

Thinking differently when working with continuous, time-series data [19:09]

Thomas Betts: Pulling on that, I mean, you said there's demos out there, what are you seeing as best practices? If someone wants to do reporting, they've got their data into your system, how do they get it out? How do you do real-time data analysis? Is it just Python? Is it something more advanced than you've done before? Do you have to think about it differently than you had when it was a batch static dataset?

John DesJardins: Well, you do have to think differently when you think about event stream processing or streaming analytics, because the data is in motion that you're trying to respond to. You need to start to think about like, what is the window of time that I'm trying to analyze over and respond by? If I'm analyzing data related to credit cards and trying to do fraud, my window of time to respond might be milliseconds.

In fact, for a lot of the card processing companies out there, their window for authorization in fraud is like 40 to 50 milliseconds, which does change how you think about the logic and how you express it. With regard to the languages, we can execute Python and other major languages like C# or C++ within the platform.

Then we also have our basic pipeline API is written in Java, but then we can also express that logic in SQL. For a lot of people, the SQL is the easiest to adopt, being fairly familiar, but there's a lot of people out there who are more comfortable with Java, and so that works pretty well. You have a number of different ways to reach it from a language perspective and we're constantly looking to enhance the APIs to provide more ways to be able to adopt the technology.

Thomas Betts: If you're accessing it via the APIs, it's still running close to the metal, so there isn't a whole lot of network transit because like you can say, you can run a query in process, that seems like the closest to it, but you're still running the data close.

John DesJardins: Yeah, absolutely. It's not the same as running SQL and retrieving the data back. You're running the SQL on the cluster. In the traditional database world, it would be kind of like a stored procedure where a user defined function in that world, but you're basically wanting to think about tiny windows of data and enriching that as well with the historical data. So you've got what's happening now and what's been happening over a period of time. Like the customer, what they're doing right at this moment, and then what have they been doing over recent years that we know might be of interest? It could be, if they're looking at media and you're Comcast or Netflix, then you're thinking about what shows do they like and what shows should I recommend to them? So it can vary depending on that use case.

But you have the window of data where you were wanting to respond by, and then you have broader windows of data where you're looking at related insights. For example, with stock trading, you might be looking at a moving average and you might have several moving averages, like a one hour moving average will tell you if the stock market is crashing. But then you also want to look at a one month or a six month moving average, maybe to see what's the trend for a particular security to think about a longer term bet. Those are examples of these windows that you're looking at with data in a stream processing context.

Thomas Betts: I'm glad you made the comparison to store proc or user defined functions because in my head, that's what I was thinking and I didn't want to incorrectly apply that metaphor. That, oh, I write a store proc to run the thing on the server, because I think the inclination of when you see a developer adopt some new technology, you tend to bring how you've done processing in the past. You can do things wrong, not suited for the right tools, but there's still that idea of you can write in SQL, you can write in Java, but what you need to change, how you think about is the time series data.

John DesJardins: Exactly. With event streaming, you're thinking a lot about time series and sequential. If this happens and then that happens, that means something different than if it's the reverse. And so there is this whole aspect of looking at events and understanding the patterns, the sequence, the timing, and what does that mean? That's kind of the different thinking that you need. Because you're running on the platform and because it is both in memory and has the data locality, you're going to get the performance, especially with Hazelcast. So that's kind of a given almost. It's really the thinking differently about how you write queries.

Future projections [23:48]

Thomas Betts: What's the current state of this continuous intelligence idea and where do you see it going in the next three to five years?

John DesJardins: Yeah, it's a good question. I think today you still need to kind of think a lot about your data and about, you've got the technical code, you've got the data level, and then the understanding of the business context, such as am I thinking about someone on a retail site, or am I thinking about sensors in a factory?

John DesJardins: You have to bring all of these different things together to make this stuff work well. What we're seeing things move towards is this combination of things like auto ML, combined with more frameworks and accelerators to help you solve use cases better and more easily. I think in the future, it's just going to be something that will require less technical people to, there may be still a little bit of technical work to get the thing up and running, maybe someone technical to watch it and troubleshoot if it breaks, but a lot of the logic and the functionality I think will be easier to define in the future. I think that's where things are headed.

I think the other thing that we're seeing is that everybody's trying to make these platforms not only able to deliver automatic processing of data to create intelligent applications, but they want the platforms themselves to be intelligent. So that scaling up and down or dealing with failures and resilience or tuning performance and identifying bottleneck code, all of that's going to become more autonomous as well. The platforms themselves are moving towards more self tuning, self management, as opposed to having to understand what's going on under the covers.

Thomas Betts: Definitely sounds like a great roadmap. This has been a really good discussion, I learned a lot, but unfortunately we're out of time. So John, where can people go to find out more about you or Hazelcast?

John DesJardins: Just come to hazelcast.com and you can find on that site, all kinds of information about customer use cases. You can click on and get started and get tutorials and access to the software very quickly. So yeah, come check us out, we're excited to help you. I think there's a lot you can do just by landing on our website and getting started on your own as well.

Thomas Betts: Great. Thanks again to John DesJardins for joining me today.

John DesJardins: Thank you for having me. It's been a pleasure. Great questions and great discussion, and look forward to the next time.

Thomas Betts: To our listeners, I hope you'll join us again soon for another InfoQ podcast.

About the Guest

John DesJardins is Chief Technology Officer at Hazelcast, where he promotes product innovation and champions adoption of our ultra-fast in-memory computing platform. His expertise in large-scale computing spans Machine Learning, Microservices, Big Data, Internet of Things, and Cloud. He is an active writer and speaker. John brings over 25 years of experience in architecting and implementing global scale computing solutions with top Global 2000 companies while at Hazelcast, Cloudera, Software AG, and webMethods. He holds a BS in Economics from George Mason University, where he first built predictive models, long before that was considered cool.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Rate this Article

Adoption
Style

BT