A Data Lake is a data store used for storing and processing large volumes of data. They are often used to collect raw data in native format before datasets are used for analytics purposes. Data Lake-as-a-Service solutions provide enterprise big data processing in the cloud for faster business outcomes in a very cost effective way.
InfoQ spoke with Lovan Chetty, Director of Product Management and Hannah Smalltree, Director of Product Marketing at Cazena, a Big Data as a Service company, about Data Lake solution to store and process Big Data and how these solutions work.
InfoQ: Can you define the term "Data Lake"?
Chetty & Smalltree: A data lake is a horizontally scalable data store that stores and processes large volumes and/or variety of data. Many data lakes are based on Apache Hadoop, an open-source framework and software ecosystem for storing and processing data. However, data lakes may use other types of software.
Data lakes are often used to collect raw data before datasets move into a production analytic environment, like a data warehouse. The main difference with a data lake vs. a data warehouse has to do with how data is structured (or not) and stored, which in turn affects load times, pre-processing requirements and analytic performance. A data warehouse is based on relational database technology, which can only store consistent, structured data. A data lake is based on technologies that allow one to store raw data and then incrementally apply structure, as defined by analytic requirements. Data lake characteristics often include fast ingest/write speeds and low-cost storage, as they are designed to manage high-volume, high-velocity raw data (think millions or billions of records per day). Data lakes have widely-varied analytic capabilities.
InfoQ: How is it different from other concepts like Database, Data Mart, Data warehouse and Data Cloud?
Chetty & Smalltree: Database is a very generic term refers to software that stores and organizes data, though in modern usage, it's often used as shorthand for a relational database. A data warehouse is a database that's designed to support business intelligence and advanced analytics. A data mart is generally a domain-specific data warehouse, used for a specific type of data, analytic or business process. For example, a marketing data mart might just have data of interest to that department. "Data cloud" is an emerging term that companies use in different ways, but generally refers to a cloud deployment of any of these styles of databases, often with some kind of integration tooling as well for data movement. Some companies have private, secure data clouds for sharing data with customers, partners or other stakeholders.
InfoQ: How does "Data Lake-as-a-Service" work?
Chetty & Smalltree: "Data Lake-as-a-Service" is a data lake that leverages cloud resources, which are managed and maintained by a vendor "as a service." It's often advantageous to deploy data lakes in the cloud because of easy scalability for large data volumes, inexpensive storage and because raw big data is increasingly generated in the cloud from sources like sensors, mobile apps or social media. However, it can be challenging to learn, install and maintain the complex software used for data lakes in a cloud environment.
A "Data Lake-as-a-Service" provides a pre-built cloud service that abstracts the complexity of the underlying platform and infrastructure layers, so a company can use a data lake without having to install or maintain the technology themselves. This category is emerging, so the specific software and services provided by vendors varies greatly. Common capabilities generally include automated provisioning, scalable data storage, varying levels of analytic functions and a simplified interface for management. Beyond those, Data Lake-as-a-Service providers can differ greatly, with different features catered to different use cases. For example, Cazena's Data Lake-as-a-Service includes end-to-end security, with data movers that make it simple to encrypt, move and load data into the cloud.
InfoQ: What are some typical use cases that are good candidates for a Data Lake-as-a-Service solution?
Chetty & Smalltree: There are many, many ways to use a Data Lake-as-a-Service, but two use cases tend to get a lot of attention.
- Many companies use a Data Lake-as-a-Service to collect and process incoming raw data from cloud, mobile or external sources. For example, manufacturers can collect sensor data in a Data Lake-as-a-Service, so that research and development teams can collect specific information about product usage or common error patterns or operational problems. Some companies create "data pipelines" where they collect raw data in a Data Lake-as-a-Service, then cleanse, filter or query the data to create a valuable subset, which they move into another analytic environment like a data mart in the cloud or a data warehouse on-premises.
- Other companies use a Data Lake-as-a-Service to consolidate large volumes of data for data science or other analytic activities. It is often advantageous to have all data in one place, where it can be combined, queried and analyzed to discover new patterns or insights. Some refer to this as a "sandbox" environment, where analysts can explore data with no impact to other production processes or systems.
InfoQ: Advantages and limitations of running data analytics in the Data Center v. on the cloud?
Chetty & Smalltree: There's not necessarily one right answer to whether it's "better" to run data analytics in the data center vs. the cloud. The cloud has many benefits, but other factors such as compliance or regulatory requirements may dictate storing and analyzing data on-premises in a data center. The cloud offers elastic scalability, so it's generally fast and easy to add more storage for incoming data or get more compute power for advanced analytics. The cloud also makes sense for the growing volume of big data now generated outside a company's firewall. It's expensive to drag raw big data back to an on-premises data center, then pay for the extensive storage and compute power required for analysis. That's why more analytic processing is moving to the cloud, closer to the big data sets and cloud platforms where storage and compute are significantly cheaper than on-premises.
For the near future, most companies will have hybrid architectures, with some data and analytic processing in the cloud (say, for large raw datasets) and some data and analytic processes kept on-premises (say, for highly regulated data.) Given that this is the trend, data movement and integration technologies are critical considerations when choosing a data lake technology or designing hybrid architectures.
InfoQ: Performance and SLA implications of Data Lake-as-a-Service approach?
Chetty & Smalltree: The performance and SLA requirements of a data lake are highly dependent on its role within production data processes and the importance of analytics to a company's success. If the processes are important to the bottom line, it's critical to ensure reliable performance. One problem with data lakes to date has been highly complex software that's difficult to optimize for performance, a problem that can be compounded when installing software on unfamiliar cloud infrastructure. No one wants highly skilled, expensive data scientists and analysts spending time troubleshooting software, or waiting around, when they could be focused on analytics. This is another reason companies are using Data Lakes-as-a-Service, which can often offer more predictable performance and experts to help troubleshoot. Enterprises also report major challenges in working with new cloud vendors that lack an understanding or appreciation of Service Level Agreements (SLAs.) This monitoring can be very hard to build yourself or integrate with existing management systems.
InfoQ: What about security and privacy aspects in this area? How is data kept secure during its storage, transfer and usage?
Chetty & Smalltree: Data security in the cloud suffers from a major lack of understanding. In many cases, data is most secure when stored in a cloud provider's sophisticated data center. That said, security in the public cloud is different, with new terms, technologies and rules, which presents a challenge for enterprises mapping existing policies and systems to a new cloud environment. This is changing, with more enterprise-friendly cloud services and increasing emphasis on the importance of cloud security. It's very important to consider data in transit as well as at rest, which is obviously directly related to how data movement to the cloud is architected and managed. Encryption techniques can be used to keep data safe, though it can be challenging to manage end-to-end security. Companies are advised to work closely with their cloud providers' security experts.
They also talked about what to focus when working on data lake projects.
Chetty & Smalltree: Data lakes have been a hot topic, fraught with differences of opinion, definitional disparity and skepticism. Some companies have found major value in lakes, while others have struggled with the software and processes required for success. In 2014, analyst firm Gartner published a note titled "The Data Lake Fallacy: All Water and Little Substance" and while they seemed to soften their stance by May 2015, those analysts still predicted: "Through 2018, 90% of deployed data lakes will become useless as they are overwhelmed with information assets captured for uncertain use cases." (Defining the Data Lake, May 2015)
Like any data analytics project, it's critical to plan for how data will be used in production to impact the bottom line. Enterprise workers report that this is happening now, as part of the trend of demanding measurable value from big data projects. While early data lake projects focused on hoarding data, with no clear plan on how to use it, today's projects often have very specific goals in mind.
About the Interviewees
Lovan Chetty is responsible for product management of Cazena’s Data Mart as a Service and Data Lake as a Service, end-to-end solutions including big data movement and secure cloud processing. Lovan has been involved in data warehousing and big data for over 15 years. He began his career building analytic systems for the global 2000, later applying this knowledge to build products that accelerate the time it takes large enterprises to use data for analytic success.
Hannah Smalltree is a director focusing on product content and education for Cazena, which leads the Big Data as a Service market with a secure cloud services designed for enterprises. Hannah held similar product roles for several other big data software companies, regularly authors articles, speaks at events and follows the industry closely. She spent over a decade as technology journalist, interviewing hundreds of companies about their data and analytics projects.