Database Technologies and Information Management Group

Welcome! The DTIM research group in Universitat Politècnica de Catalunya (UPC) conducts research in many fields of data and knowledge management, with particular emphasis on big data management, NOSQL, data warehousing, ETL, OLAP tools, multidimensional modeling, conceptual modeling, ontologies and services. DTIM is a research subgroup of the Software and Service Engineering Group (GESSI) UPC research group and its members belong to the ESSI and EIO departments of the same university.

Nobody educates anybody -nobody educates himself-, men educate each other under the mediation of the World.
Paulo Freire. Pedagogia del oprimido. Montevideo: Tierra Nueva, 1970.

Latest News (see all)

Latest Blog Posts

In the age of Machine Learning and AI, companies are racing towards better services and innovative solutions for better customer experiences. Businesses realize the need to take their big data insights further than they have before in order to serve, retain and win new customers. 2017 has been a big year for Big data analytics with lots of companies understanding the value of storing and analysis huge stream of data collected from different sources. Big data is in a constant mode of evolution. It is expected that the big-data market will be worth $46.34 billion by 2018.

Mining similarity between text-documents using Apache Lucene

One of the main challenges in Big Data environments is to find all similar documents which have common information.To handle the challenge of finding similar free-text documents, there is a need to apply a structured text-mining process to execute two tasks: 1. profile the documents to extract their descriptive metadata, 2. to compare the profiles of pairs of documents to detect their overall similarity. Both tasks can be handled by an open-source text-mining project like Apache Lucene.

Use of Vertical Partitioning Algorithms to Create Column Families in HBase

HBase is a NoSQL database. It stores data as a key-value pair where key is used to uniquely identify a row and each value is further divided into multiple column families. HBase reads subset of columns by reading the entire column families (which has referred columns) into memory and then discards unnecessary columns. To reduce number of column families read in a query, it is very important to define HBase schema carefully. If it is created without considering workload then it might create inefficient layout which can increase the query execution time. There are already many approaches available for HBase schema design but in this post, I will present to use of vertical partitioning algorithms to create HBase column families. 

Our favourite tweets