Splunk’s user conference has drawn to a close. After three days with over 160 sessions ranging from security and operations to business intelligence to even the Internet of Things, the same central theme kept appearing over and over again: the key to Big Data is machine learning.
Storage is no longer an issue. From specialized storage hardware running Hadoop compatible nodes to commodity hard drives clustered over hundreds of machines, there is no doubt that we have the ability to handle kind of storage problem. On the other side, analysis and visualization tools such as Splunk are well established. If you know what you are looking for, these tools can quickly get you the answers you need.
But what should you be looking for? For the vast majority of vendors on the floor, the answer to that question in machine learning. It doesn’t matter if you are talking about network traffic, user behavior, or consumer trends; the way to gain real insights into what you are monitoring is to find the patterns and correlations in the data. And while a human operator can stumble across these by trial and error, they believe that a computer can be trained to find them much faster and without bias.
Of course that isn’t to say the humans are obsolete. Someone has to verify the correlations are not just coincidence and figure out a way to act upon the information. And that’s where the aforementioned visualization tools come into play.
Primary Use Cases for Big Data and Machine Learning
While the potential for big data is nearly limitless, it is inevitable that one or two industries are going to lead the charge. Ask me again in a year and I may say something different, but for now prediction is that either security or operations will be in the forefront.
Every company larger than a cash-only coffee stand needs to think about information security. Even if they have no intellectual property to speak of, they all deal with sensitive information such as credit card numbers. Having ways to reliably detect and stop a breach while it is happening is critical to the long-term success of a company. Security products based on machine learning promise to provide this capability, with ease of use approaching turnkey levels.
In a similar vein, operations analysis is going to be popular. Right now you can buy tools that monitor your network, decode the packets, and show you exactly how a given REST call flows through your middle tier servers all the way to the database or file system, and then compare it to how it was behaving a week, month, or year ago. This isn’t a future concept, this is something you can by off-the-shelf today and have running within a week.
Other areas of research will continue, but not at such a rapid rate. Fraud detection is incredibly important, but most companies are going to rely on their financial institutions to design and implement the necessary controls. I don’t expect to see many commercial, off-the-shelf products in this area.
Business intelligence is another area that will see a lot of money spent on research. But the algorithms needed by Coke-a-Cola and Pepsi to determine the next popular flavor will look nothing like what Gm and Ford are using to predict how many vehicles to make of each size. So again, commercial products are probably going to be mostly limited to basic analytics and visualization for the time being.
Other Conference Thoughts
All in all Splunk put on a great conference. Everything was well organized and there were sessions for everyone from the complete beginner to the most advanced data-mining engineer. My only complaint is that the sessions weren’t recorded. With so much content, one is bound to miss an important session or two due to conflicts.
Even if you are not interested in Splunk itself, this is an important conference for anyone interested in Big Data, machine learning, and related topics.