Big Data Analytics Lab (BDA)
January, 2015 → December, 2015Victor Herrero, Alberto Abelló, Besim Bilalli, Oscar Romero
This project is carried out in a company with the aim of providing better data processing capabilities in order to perform data analysis on the data that their systems constantly produce. Given the business market this company works in, data analysis is vital for its survival. The challenge is that their data analysis algorithms cannot be run quite as often as desired, as the extraction and transformation processes from data sources are too slow. Data are initially extracted from sources and, afterwards, some data cleansing and several other data transformations need to be applied in order to shape these data according to the algorithm input requirements. The technological goal then is to reduce this overall time by means of Big Data technologies and, all in all, gain agility in data analysis within the whole company. As a consequence of this, a non-technological goal that arises next is the knowledge transfer in Big Data from university to industry.
The solution finally implanted comes through technologies that are all part of the Hadoop ecosystem. Physically, such ecosystem is built upon several commodity machines that are exclusively dedicated to the project. We set up a data lake as the main storage system where data are rawly stored. One of the main differences regarding the previous situation is that this data lake centralizes many departmental data sources in a single repository. This, in turn, helps departments to share data and it therefore enriches their data analysis processes. We also use a NOSQL database to store data in a semi-prepared manner so they can be immediately consumed by the analysis applications after analysts define the very last (personalized) data transformations at query time. To do so, an ontology is defined. Its core consists of the business concepts that enable users to work with them and thus avoid working with physical attributes from sources and, hence, it makes the technical designs transparent to them.
Achieving this means that such ontology also needs to contain, given any business concept, the map backwards to the source and the map to the aforementioned NOSQL database and therefore, all data are traceable. In addition, predefined transformations are also stored in a metadata repository, not only to help analysts query the data lake and the NOSQL database, but also to provide them with a certain level of automation. The main benefit from having such ontological representation is the share of domain knowledge between analysis experts in the company.
inLab FIB constitutes the team to support the previous technical and data analysis processes and it consists of experts in Big Data and data mining who train the technical staff in the company to ensure the continuity of the service beyond the project duration.