InfoQ Homepage Interviews Mike Stolz on NoSQL and Big Data Design Patterns

Mike Stolz on NoSQL and Big Data Design Patterns

Bookmarks

Download

14:11

Bio

Mike Stolz is a Global Field Architect for the vFabric division of VMware. He came to VMware in 2010 as part of the GemStone acquisition after serving 4 years at GemStone as VP of Architecture. Stolz leverages his expertise in targeting, developing and delivering innovative technology solutions to expand VMware’s Emerging Products Division within the financial services, travel and health care.

About the conference

Software is changing the world; QCon aims to empower software development by facilitating the spread of knowledge and innovation in the enterprise software development community; to achieve this, QCon is organized as a practitioner-driven conference designed for people influencing innovation in their teams: team leads, architects, project managers, engineering directors.

Transcript Interactive Show All Hide all Full Page Transcript

1. Hi, my name is Srini Penchikala. I am with InfoQ team. In this interview recorded at QCon New York Conference, we’ll be speaking with Mike Stolz with VMware. Hi Mike, how are you? Can you introduce yourself to our audience and tell us what you’re currently working on?

Sure, I’m a Senior Software Architect at VMware with a specialty around the data management product that VMware now sells.

2. You are speaking at this conference on the design patterns for combining Fast Data and Big Data in financial applications. To set the context for this discussion and for our audience, can you define these two terms, Big Data as well as Fast Data. How are they defined ?

Sure, big data is really all about finding opportunities that you didn’t know you had buried somewhere in your historical data. Fast data, on the other hand, is all about being able to take advantage of opportunities before they get away.

3. Makes sense, so it’s almost like real time. What are some of the design patterns that help with processing and analyzing the data – both structured and unstructured data. How do we process and analyze this in a timely manner; what are the patterns that developers can use to make this happen?

Well, the real pattern that we see emerging is that if you are able to dig around in that big data and find some patterns that yield good results. Then if you’re able to codify rules in your interaction with the customer so that maybe you can steer the customer a little bit towards the good results that oftentimes that combination working together really does yield much better overall throughput, sales or whatever you are looking for.

4. Any other design patterns or best practices that are useful in this context?

Well of course the other one that’s pretty critical, it has to do with the overall architecture of such a system; this has to be a tiered sort of data architecture; otherwise, you would not want to keep everything in memory and you would not move petabytes worth of stuff in memory; and on the other hand, disk isn’t fast enough to do extreme transactions.

5. So you kind of get the best of both worlds in memory as well as the storage. Again, Big Data has been getting a lot of attention lately to make sense of these vast amounts of data in the enterprises. Do you see any gotchas or cautions that developers need to be aware of when they work on Big Data type of use cases?

I think the first thing to recognize is that since big data is about finding opportunities that you didn’t know you had, that means that you don’t know what and where that opportunity is hidden. The mistake that people have been making for many, many years is they have been filtering the data upfront, throwing away what ends up being the important stuff and they don’t know it.

So, the most important thing probably today about the whole big data movement is that if you save it all, you’ll be able to go ask questions of it later; if you throw it away at the beginning, it’s gone forever.

6. They are already filtering out what could be useful information, right? Also, you talked about the multi-tiered data strategy that combines in-memory data management for performance and scalability aspects for the deep analysis of the data? Can you discuss a little bit more about this data strategy and how can we use this for analyzing the data?

Sure, there’s really I think four parts to that: the first thing, of course, is the place to capture the transactions in the first place and that’s obviously going to be where you will use RAM, you know, real memory; and the strategy that we use on that is actually to have that transaction be captured into and copy into two memories on two separate servers; so that’s the way we get the durability and the high availability right up front.

Getting it written to disc is something that we do lazily behind the scenes outside of the bounds of the transaction; and then we feed it into a data warehouse or a Greenplum appliance to be specific and that then becomes the place where we will do the analytic processing.

Now in that world, what happens is that you first put it in as row based data so that you can query against it rapidly but still make changes to it easily; then once it’s kind of aged out a little bit, and you’re not going to be changing it so much anymore, we can convert it into row (column?) based storage so that begins to compress significantly or even faster queries for some classes of query; and then from there, the next stage is actually to really compress it in addition to having it be columnar, so you have kind of four-tiered strategy there.

7. It’s interesting to see how you’re able to take the advantage of the different type of database products, relational, columnar in NoSQL. So what are some of the business models that can take advantage of these concepts you discussed in financial applications as well as outside? What are some of the business use cases?

In the financial world, it’s clearly all about risk management; that’s where we’ve been spending our money these days; you know, all of the regulations around data retention and being able to reproduce any report that was ever shown to anybody, that’s where the big data is in the financial services. But what we’re seeing is some really interesting use cases in things like mobile technology, so your cell phone is always telling your carrier where you are; they know to within five meters what your location is at any given moment.

So they are beginning to be able to do some interesting things with that knowledge and so if they can farm that knowledge to understand your normal behaviors, "I go to work every day at 9; I take the same train every day," or that kind of thing. Then they can notice if something is deviating from the norm and at the point when something deviates, then they may be able to offer you some help of some sort. For instance, if you have deviated and they detect that you’re now not on a train route but rather you’re on an automobile route, maybe they can actually start up the navigator for you so that you get some help getting to where you’re going.

8. Right, very much context driven as well as help the users; so that’s where I see the big data, NoSQL, and social as well as mobile paradigms kind of converging to basically help the customers.

Yes, there’s another great mobile use case that’s going to show itself in the next couple of weeks which is they’re able to now because again, they can detect where the cellphones are, they can also detect crowding as a result and that the one in the Olympics, there’s worries about a lot of crowding at places like the tube stations and so forth.

9. So switching gears a little bit, in-memory databases are getting a lot of attention lately. They’ve been around but now they’re getting more focus in the NoSQL space. What do you see as the role of in-memory databases in NoSQL space? Where do they bring value and what are the limitations of that approach?

Actually, there’s a couple of different things about NoSQL that are interesting. First, a lot of the NoSQL databases, the ones that are derivatives of the Dynamo DB are really all about being always available for write but never, hardly ever used for a read-modify-write kind of transaction.

There’s certainly are a lot of those, your shopping cart and your web browser are certainly one of those but there’s a lot more traditional data used still in the world where a NoSQL database, more like the GemFire database makes more sense because it was designed from the ground up to be able to deal with heavy amounts of read-modify-write kind of transactions.

10. GemFire is one product from VMware and also there is SQLFire product. How is that different from GemFire? Are they in the same space or are they used for different types of data?

GemFire is the object oriented version of the product, so it’s a true NoSQL kind of a store. SQLFire is really a relational database with heavy memory orientation; it’s basically built on top of the same GemFire architecture but it has a SQL interface, standard SQL.

11. What are some emerging trends in big data as well as in-memory databases. What’s coming up?

We have to think about that a little bit; I think that the fast data is getting bigger and bigger; certainly we are being pushed to hold more and more data in memory. We’re starting to reach into the terabyte kind of scale and we’re looking at probably by next year ten terabyte kind of scale. At the same time, the big data is getting up into the peta scale; so there doesn’t seem to be any end to how much data we’re going to have to mine.

12. So they’re going to get bigger and bigger; so in terms of VMware products, GemFire, SQLFire or new product, what is the future road map, anything new coming up in this space and for helping the customers?

Well actually a lot of work going on in the data management group within VMware; so we only talked a little bit about GemFire and SQLFire, we’ve not even mentioned our Postgres implementation which gives you the ability to have an implementation of Postgres that runs better on virtual than it does on physical; I’ve not talked at all around the database as a service work that we’re doing with our Data Director product; and you know, these are some pretty exciting areas of change within VMware.

13. Database as a Service sounds very interesting in the Cloud Computing paradigm like Application (Software?) as a service or Platform as a Service so that seems to be very interesting; so do you have any thoughts on that on how it would be used?

Yes, so I think the killer use case for Database as a Service is going to be that poor QA guy that has already got six databases because he is QA’ing six different versions of the software, at different stages in their lifetimes and now a patch fix comes through and he has to put up yet another database to be able to get the patch fix tested.

14. So it mainly helps with the continuous integration and quality assurance testing?

Yes, so he’ll be able to just go to a web-based console now and say, "I need another database; I need it to be a clone of that database over there," oh by the way, don’t actually copy the data; make a copy on write sort of a thing using VMware linked clone capability. I’d like it to be backed up every night at such and such a time," with all the policies around it. "Oh, and by the way, I’d like it to go away in 30 days."

Srini: So it makes it easy for them to provision and deprovision the database instances?

Mike: Yes.

15. Can you talk about the tool support in this space that developers and administrators can use for not only working on this data management tasks but also monitor the products at run time?

VMware management business unit is actually a separate business unit in its own right at this point. We have tools for monitoring and managing everything from deployment of, creation of a blueprint of your application maybe it’s a three-tiered application, you’ve got web-tier, app-tier, database-tier or whatever and be able to make that blueprint then go and be able to deploy that into any Cloud whether it be your own internal Cloud or some external Cloud.

And then the ability to use Hyperic and application performance monitoring tools, APM and visual statistics display to see what’s going on with your data at all times. At the end of the day the important part of what’s going on is there actually the data and how it’s being processed.

16. Definitely, Hyperic is a popular monitoring solution that works nicely with the spring tools as well. Another important aspect is security especially when it comes to data management and especially in the financial applications. Can you talk a little bit about the security aspect of how the data is managed, encrypted and how authentication and authorization work with the VM ware products?

So what we decided to do is to give you an implementation of some basic authentication and some basic authorization as a sort of a demonstration of what the system is capable of but we made so that it is pluggable at every level. One of the things that we kind of realized early on is that if we were to pick for instance JAAS as an implementation that we insisted upon, that might not make all of our customers happy; so being able to have your own whatever it is you’re used to be able to be plugged in there and having a bullet proof framework around security seems like the right way to go and our customers are telling us that yup, we did make the right decision in that space.

17. So that way they can leverage their existing security infrastructure and not have to use another security model.

Yes, you’ve already got an LDAP, you don’t want us to force another one on you.

Jul 27, 2012

Interview with