Data Warehouses, Data Lakes and Data 'LakeHouses'

Posted on October 25th, 2016 by Pedro González

The data warehouse (DW) is the cornerstone of the analytical framework in many analytical pipelines in organizations. The design of the data warehouse is a very studied activity during years and it is still very challenging. Although there are different methodologies and strategies to face the whole (or part) design of the DW, the two main inputs of the DW design process are the available data sources and the analytical requirements.

From a classical point of view, the data warehouse data model is based on the relational logical schema, extended in some cases with multidimensional constructors to enrich the multidimensional conceptual schema and ease the use of OLAP tools. In supply-driven strategies, the schema is derived departing from the data sources while in demand-driven approaches it is derived from the analytical requirements. Additionally, hybrid methods combine in different degrees the two strategies. Anyway, all these methods assume that data sources and/or analytical requirements (present and futures) are known at design time. Nevertheless, many organizations must deal with highly dynamic environments characterized by rapid clockspeed markets, shorter product lifecycles, uncertainly conditions and increasing competitive pressure in their markets and industries. In this conditions, the assumption of known and stable input data sources and analytical requirements is not longer valid. Evolving data sources and requirements are currently not the exception, but the norm in many organizations.

Despite the robustness and maturity of the classical DW systems, they do not provide enough flexibility to accomodate the analytical pipeline in such highly dynamic environments. The relational schema is too rigid to deal with continuous changes, while sometimes, the design must accomodate changes in the data sources or/and requirements that involve costly restructure and development processes.

Data Lakes (DL), follow a totally different approach. Sources are incorporated raw into schemaless data stores and are annotated into some metadata repository. The schema is provided on the fly by processing the data contained in the data lake according the current analytical requirements. This is called 'Schema on Read' approaches and contrasts with the 'Schema on Write' followed by classical DW systems when processing the data sources. In words of James Dixon from Pentaho "If you think of a datamart as a store of bottled water - cleansed and packaged and structured for easy consumption - the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples".

An approach in between the DW and DL aproaches are the Data 'LakeHouses' (DLW). A DLW is a "Set of pools of water with different shapes. Some of them are interconnected with a flexible structure of pipes, and some of them are connected to the main output so that the user can easile regulate the composition of the mixture of outgoing water".

The data store in a DLW is based on NOSQL databases (usually document stores, but not restricted to an only database model) and combines the 'Schema on Read' and 'Schema on Write' approaches from DW and DLW. In a DLW the schema is provided with three different levels of data definition: Source Layer, Global Schema Layer and Domain Layer, described by a global Ontology (OWL DL). In a DLW the DL TBOX component defines the properties and classes of each layer and the DL ABOX component represent the schema and the structure. While in the Domain Layer the definition is more descriptive, in the Source and Global layers the definition is more prescriptive (structure and schema).

The engagement point between Schema on Read and Schema on Write is achieved between the mappings between the Source Layer and the Global Schema Layer and between the Global Schema Layer and the Domain Layer. Mappings, have also a TBOX and ABOX component. The classes and properties of the mappings between Source and Global schemas define filters, fold and unfold operations, projections and referential operations, while the artifacts of the mappings between Global schema and Domain define filters, projections and set operations (unions, splits, etc).

Schema on Write processes are ontology-driven wrappers while Schema on Read processes are ontology-driven processes implementing analytical requirements (posed by the user on top the Domain component of the ontology) to automate the target schema (the DDL) and the ETL derivation needed to transport data from the Global Schema to the Target Schema.