Data Lake: what is it and why it’s important for modern companies

A Data Lake is a new way of working that simplifies and enhances the storage, management and analysis of Big Data, using data from diverse and inhomogeneous sources, in their native format, or in an almost exact copy of the native format.

In essence, a data lake is:

  • A place to store structured and unstructured data;
  • A tool for analysing Big Data;
  • A resource to access, share and correlate data for business activities.

This is a new way of working because the systems historically used to store, process and analyse data are defined and structured according to the intended end use, through a Data Warehouse type architecture.

In a Data Warehouse type system, starting from a set of raw data, these are structured and processed through a so-called Schema-on-write approach, whereby first the structure of the database that is to host the data is defined, then the data is written within the predefined structure and, when it is taken for analysis, is returned in the predefined format.

A Data Lake type system, on the other hand, adopts a so-called Schema-on-read approach: data are acquired in their native format according to policies that standardise, for the different types of data, the methods, times and rules for entering data into the Data Lake. Each element is associated with an identifier and a set of metadata qualifying it so that, when it is necessary to access the data in search of a specific result, the Data Lake can be queried to return all relevant data.

It is the analysis question that determines the selection of data from which to draw information, and the search is not limited to a database set up for that type of analysis, but accesses all available information, regardless of the source that generated it.

What are the benefits of adopting a data lake?

  • Reduced storage costs and infinite storage space

Managing large volumes of data through database-type systems is expensive and inefficient. The same data set can be replicated several times if the database structure is different for each of the analysis applications used. Different business roles have different analysis needs and are looking for different insights. An onwrite schema forces one to predict in advance all the uses that might be made of the data, but we know that as business goals and needs evolve, analysis requirements evolve with them. Increasing the volume of data collected in a database and constantly updating its structure is an expensive and time-consuming process. The use of data storage methods on distributed file systems (HDFS in the cloud) typical of a Data Lake type system makes the space available for data storage implicitly infinite.

  • Reduction of data consolidation costs

Bringing together databases with different structures is complex and requires considerable data modelling effort. Furthermore, in order to stem the danger of rapid obsolescence of the data model, it is necessary to foresee the new data sets that one presumably wants to integrate. An almost impossible task when the amount of data to be acquired is constantly growing.

  • Reducing Time-to-Market

Database expansion and consolidation projects can take a long time, which often prevents a timely response to the business question. By the time the data is ready to be analysed, it may be too late to derive value from it. Furthermore, the volume of unstructured data useful for analysis can far exceed that of structured data, and the ability to access the information contained in unstructured data in real time can be central to the success of a marketing or targeting activity.

  • Information sharing

Analyses performed on data can generate results that help to further qualify the data and increase its value. Let us assume, for example, that we can associate a purchase propensity score with each user whose profile we have. In a Data Warehouse type structure, the score will remain the exclusive preserve of the personnel using the application that generated it, unless the information is also copied into the databases in use by other applications, after intervention in the structure of the receiving database and the data model. The Data Lake eliminates duplication of information and allows the insights gained to be treasured, shared and made accessible to anyone with permissions to consult them.

Is building a Data Lake the ideal solution for all companies?

No. Building a Data Lake is the ideal solution for companies that need to do cross-functional analysis on Big Data, that have structured internal processes to ensure data governance, that have staff skilled both in the technologies used to build the platform and in data analysis, or that can utilise external consultants specialising in areas where they are lacking. While it is true that the great advantage of the Data Lake over a Data Warehouse type model is that it allows huge amounts of data to be stored without having to structure it in the acquisition phase and regardless of the use to be made of it, a certain degree of organisation of the data is necessary to make it accessible and for information to be drawn from it. Precisely because a data lake is capable of containing data without any limits (or almost), access to the data must be carefully regulated, both for obvious privacy reasons and because only expert and competent personnel (typically data scientists and data engineers) are able to interrogate it and extract relevant information from it. Before the data contained in a data lake can be used to produce a BI report, for instance, or a customisation rule for content delivered on a site, it is necessary to go through complex steps that only experienced programmers and data scientists can perform to guarantee the quality of the output. Basically, precisely because the universe of data at our disposal is immense, it is necessary to know how to navigate through it in order to derive useful information, and experience in this field is not something that can be improvised. In most companies, 80 per cent of users are ‘operational’: they use reports, check predefined KPIs or use spreadsheets in Excel to examine relatively simple data sets. For these users, a Data Warehouse type system is more than sufficient: it is structured, easy to use and built specifically to answer specific questions.

About 10 to 15 per cent of users perform more in-depth analyses of data. They often access source systems to use data that are not available in the database, or they acquire other data from external sources. It is often these users who generate the reports that are then distributed within the company.

Only a very small percentage of users, on the other hand, perform in-depth analyses of the data, integrate new data sources, mix inhomogeneous data, and know how to read them. In most cases, these users do not even use data warehouses, because they work on the data at a different level, before it is structured, to offer an answer to a specific question. These users formulate new questions and explore the data in search of possible answers, then selecting those that are relevant and discarding unconfirmed hypotheses. These users know how to do statistical analysis and exploit analysis techniques such as predictive modelling for example.

The Data Lake can be the data source that feeds the reports accessed by the first group or the databases accessed by the second, but it can only be queried and managed by expert users, which not all companies need or can have in-house.

How to build a Data Lake

A data lake is a solution assembled using advanced and complex data storage and data analysis technologies. To simplify, we could group the components of a Data Lake into four categories, representing the four phases of data management:

  • Data Ingestion and Storage, i.e. the ability to acquire data in real time or in batch; and the ability to store and access structured, semi-structured and unstructured data in the original format in which it is produced and through a configurable role system;
  • Data Processing, i.e. the ability to work on raw data so that it is ready to be analysed using standard procedures; and also the ability to engineer solutions for extracting value from data, through automated and periodic processes, which are the result of analysis operations;
  • Data Analysis, i.e. the ability to create models for the systematic extraction of information from data, which may be done in real time or through processes executed periodically;
  • Data Integration, i.e. the ability to hook up to the platform applications that allow the Data Lake to be interrogated and data to be extracted in formats that can be used for specific purposes.

To build a Data Lake there is no universally valid recipe; it is necessary to use a technology supplier who knows how to design the architecture of the platform on the basis of the requirements shared by the customer, equipping it with the hardware and software components that allow it to be managed with maximum efficiency – i.e. providing the best result, in the best possible time, saving costs – and with the best possible performance.

Neodata AI Team
+ posts

As Neodata, we provide data, insight, articles, and news related to AI and Big Data.

Keep Your AI Knowledge
Up-to-Date

Subscribe to our newsletter for exclusive insights, cutting-edge trends, and practical tips on how to leverage AI to transform your business. No Spam, promised.

 

By signing up you agree to our privacy policy.