Applying artificial intelligence (AI) to data analytics for deeper, better insights and automation is a growing enterprise IT priority. But the data repository options that have been around for a while tend to fall short in their ability to serve as the foundation for big data analytics powered by AI.
Traditional data warehouses, for example, support datasets from multiple sources but require a consistent data structure. They’re comparatively expensive and can’t handle big data analytics. However, they do contain effective data management, organization, and integrity capabilities. As a result, users can easily find what they need, and organizations avoid the operational and cost burdens of storing unneeded or duplicate data copies.
Newer data lakes are highly scalable and can ingest structured and semi-structured data along with unstructured data like text, images, video, and audio. They conveniently store data in a flat architecture that can be queried in aggregate and offer the speed and lower cost required for big data analytics. On the other hand, they don’t support transactions or enforce data quality. If those in charge of managing the data lake don’t create precise processes and metadata for organizing data, the lake can quickly devolve into what’s come to be known as a “data swamp”—a data lake that makes it hard for users to locate data.
If only there were a best-of-both-worlds compromise.
Warehouse, data lake convergence
Meet the data lakehouse. It’s a modern repository that stores all structured, semi-structured, and unstructured data as a data lake does. However, it also supports the quality, performance, security, and governance strengths of a data warehouse. As such, the lakehouse is emerging as the only data architecture that supports business intelligence (BI), SQL analytics, real-time data applications, data science, AI, and machine learning (ML) all in a single converged platform.
The open lakehouse architecture implements data structures and management features similar to those in a warehouse directly on top of low-cost cloud storage in open formats, providing:
Support for diverse data types, ranging from unstructured to structured data, big data workloads, analytics, and AI
Consistency as multiple parties concurrently read or write data
BI supportdirectly on source data, reducing staleness, latency, and the operational cost of having two copies of data in both a data lake and a warehouse
Open storage formats with API to a variety of tools and engines, including ML and Python/R libraries, which can access data directly
End-to-end streaming to enable real-time reportingand eliminate the need for separate systems dedicated to serving real-time data applications
Schema enforcement and evolution
Robust governance and auditing mechanisms
Decoupled storage and compute resources to enable asynchronous scaling.
Challenges of supporting multiple repository types
It’s common to compensate for the respective shortcomings of existing repositories by running multiple systems, for example, a data lake, several data warehouses, and other purpose-built systems. However, this process frequently creates a few headaches. Most notably, data stored in one repository type is often excluded from analytics run on another, which is suboptimal in terms of the results.
In addition, having multiple systems requires the creation of expensive and operationally burdensome processes to move data from lake to warehouse if required. To overcome the data lake’s quality issues, for example, many often use extract/transform/load (ETL) processes to copy a small subset of data from lake to warehouse for important decision support and BI applications. This dual-system architecture requires continuous engineering to ETL data between the two platforms. Each ETL step risks introducing failures or bugs that reduce data quality.
Second, leading ML systems, such as TensorFlow, PyTorch, and XGBoost, don’t work well on data warehouses. Data stored in warehouses, then, can’t be part of the multistructured, aggregate dataset, which yields the most comprehensive results. Many of the recent advances in AI/ML have been in improving models for processing unstructured data, which warehouses can’t run. Unlike BI, which extracts a small amount of data and for which warehouses are optimized, ML systems process huge datasets using complex, non-SQL code.
On the data lake side, lack of data consistency makes it almost impossible to mix appends and reads, and batch and streaming jobs. As a result, much of the hoped-for data lake business outcomes haven’t materialized.
Pulling it all together
Data lakehouses are enabled by a new, open system design with data structures and data management features of a warehouse but implemented directly on the modern, low-cost storage platforms used for data lakes. Merging them into a single system means that data teams can move faster, as they can get to data without accessing multiple systems. Data lakehouses also ensure that teams have the most complete and up-to-date data available for data science, AI/ML, and business analytics projects.
Data analytics is the key to unlocking the most value you can extract from data across your organization. To create a productive, cost-effective analytics strategy that gets results, you need high performance hardware that’s optimized to work with the software you use.
Modern data analytics spans a range of technologies, from dedicated analytics platforms and databases to deep learning and artificial intelligence (AI). Just starting out with analytics? Ready to evolve your analytics strategy or improve your data quality? There’s always room to grow, and Intel is ready to help. With a deep ecosystem of analytics technologies and partners, Intel accelerates the efforts of data scientists, analysts, and developers in every industry. Find out more about Intel advanced analytics.