Sure, I’m a Senior Software Architect at VMware with a specialty around the data management product that VMware now sells.
Sure, big data is really all about finding opportunities that you didn’t know you had buried somewhere in your historical data. Fast data, on the other hand, is all about being able to take advantage of opportunities before they get away.
Well, the real pattern that we see emerging is that if you are able to dig around in that big data and find some patterns that yield good results. Then if you’re able to codify rules in your interaction with the customer so that maybe you can steer the customer a little bit towards the good results that oftentimes that combination working together really does yield much better overall throughput, sales or whatever you are looking for.
4. Any other design patterns or best practices that are useful in this context?
Well of course the other one that’s pretty critical, it has to do with the overall architecture of such a system; this has to be a tiered sort of data architecture; otherwise, you would not want to keep everything in memory and you would not move petabytes worth of stuff in memory; and on the other hand, disk isn’t fast enough to do extreme transactions.
I think the first thing to recognize is that since big data is about finding opportunities that you didn’t know you had, that means that you don’t know what and where that opportunity is hidden. The mistake that people have been making for many, many years is they have been filtering the data upfront, throwing away what ends up being the important stuff and they don’t know it.
So, the most important thing probably today about the whole big data movement is that if you save it all, you’ll be able to go ask questions of it later; if you throw it away at the beginning, it’s gone forever.
Sure, there’s really I think four parts to that: the first thing, of course, is the place to capture the transactions in the first place and that’s obviously going to be where you will use RAM, you know, real memory; and the strategy that we use on that is actually to have that transaction be captured into and copy into two memories on two separate servers; so that’s the way we get the durability and the high availability right up front.
Getting it written to disc is something that we do lazily behind the scenes outside of the bounds of the transaction; and then we feed it into a data warehouse or a Greenplum appliance to be specific and that then becomes the place where we will do the analytic processing.
Now in that world, what happens is that you first put it in as row based data so that you can query against it rapidly but still make changes to it easily; then once it’s kind of aged out a little bit, and you’re not going to be changing it so much anymore, we can convert it into row (column?) based storage so that begins to compress significantly or even faster queries for some classes of query; and then from there, the next stage is actually to really compress it in addition to having it be columnar, so you have kind of four-tiered strategy there.
In the financial world, it’s clearly all about risk management; that’s where we’ve been spending our money these days; you know, all of the regulations around data retention and being able to reproduce any report that was ever shown to anybody, that’s where the big data is in the financial services. But what we’re seeing is some really interesting use cases in things like mobile technology, so your cell phone is always telling your carrier where you are; they know to within five meters what your location is at any given moment.
So they are beginning to be able to do some interesting things with that knowledge and so if they can farm that knowledge to understand your normal behaviors, "I go to work every day at 9; I take the same train every day," or that kind of thing. Then they can notice if something is deviating from the norm and at the point when something deviates, then they may be able to offer you some help of some sort. For instance, if you have deviated and they detect that you’re now not on a train route but rather you’re on an automobile route, maybe they can actually start up the navigator for you so that you get some help getting to where you’re going.
Yes, there’s another great mobile use case that’s going to show itself in the next couple of weeks which is they’re able to now because again, they can detect where the cellphones are, they can also detect crowding as a result and that the one in the Olympics, there’s worries about a lot of crowding at places like the tube stations and so forth.
Actually, there’s a couple of different things about NoSQL that are interesting. First, a lot of the NoSQL databases, the ones that are derivatives of the Dynamo DB are really all about being always available for write but never, hardly ever used for a read-modify-write kind of transaction.
There’s certainly are a lot of those, your shopping cart and your web browser are certainly one of those but there’s a lot more traditional data used still in the world where a NoSQL database, more like the GemFire database makes more sense because it was designed from the ground up to be able to deal with heavy amounts of read-modify-write kind of transactions.
GemFire is the object oriented version of the product, so it’s a true NoSQL kind of a store. SQLFire is really a relational database with heavy memory orientation; it’s basically built on top of the same GemFire architecture but it has a SQL interface, standard SQL.
11. What are some emerging trends in big data as well as in-memory databases. What’s coming up?
We have to think about that a little bit; I think that the fast data is getting bigger and bigger; certainly we are being pushed to hold more and more data in memory. We’re starting to reach into the terabyte kind of scale and we’re looking at probably by next year ten terabyte kind of scale. At the same time, the big data is getting up into the peta scale; so there doesn’t seem to be any end to how much data we’re going to have to mine.
Well actually a lot of work going on in the data management group within VMware; so we only talked a little bit about GemFire and SQLFire, we’ve not even mentioned our Postgres implementation which gives you the ability to have an implementation of Postgres that runs better on virtual than it does on physical; I’ve not talked at all around the database as a service work that we’re doing with our Data Director product; and you know, these are some pretty exciting areas of change within VMware.
Yes, so I think the killer use case for Database as a Service is going to be that poor QA guy that has already got six databases because he is QA’ing six different versions of the software, at different stages in their lifetimes and now a patch fix comes through and he has to put up yet another database to be able to get the patch fix tested.
14. So it mainly helps with the continuous integration and quality assurance testing?
Yes, so he’ll be able to just go to a web-based console now and say, "I need another database; I need it to be a clone of that database over there," oh by the way, don’t actually copy the data; make a copy on write sort of a thing using VMware linked clone capability. I’d like it to be backed up every night at such and such a time," with all the policies around it. "Oh, and by the way, I’d like it to go away in 30 days."
Srini: So it makes it easy for them to provision and deprovision the database instances?
Mike: Yes.
VMware management business unit is actually a separate business unit in its own right at this point. We have tools for monitoring and managing everything from deployment of, creation of a blueprint of your application maybe it’s a three-tiered application, you’ve got web-tier, app-tier, database-tier or whatever and be able to make that blueprint then go and be able to deploy that into any Cloud whether it be your own internal Cloud or some external Cloud.
And then the ability to use Hyperic and application performance monitoring tools, APM and visual statistics display to see what’s going on with your data at all times. At the end of the day the important part of what’s going on is there actually the data and how it’s being processed.
So what we decided to do is to give you an implementation of some basic authentication and some basic authorization as a sort of a demonstration of what the system is capable of but we made so that it is pluggable at every level. One of the things that we kind of realized early on is that if we were to pick for instance JAAS as an implementation that we insisted upon, that might not make all of our customers happy; so being able to have your own whatever it is you’re used to be able to be plugged in there and having a bullet proof framework around security seems like the right way to go and our customers are telling us that yup, we did make the right decision in that space.