One of the main challenges in Big Data environments is to find all similar documents which have common information.To handle the challenge of finding similar free-text documents, there is a need to apply a structured text-mining process to execute two tasks: 1. profile the documents to extract their descriptive metadata, 2. to compare the profiles of pairs of documents to detect their overall similarity. Both tasks can be handled by an open-source text-mining project like Apache Lucene.
HBase is a NoSQL database. It stores data as a key-value pair where key is used to uniquely identify a row and each value is further divided into multiple column families. HBase reads subset of columns by reading the entire column families (which has referred columns) into memory and then discards unnecessary columns. To reduce number of column families read in a query, it is very important to define HBase schema carefully. If it is created without considering workload then it might create inefficient layout which can increase the query execution time. There are already many approaches available for HBase schema design but in this post, I will present to use of vertical partitioning algorithms to create HBase column families.
Big data introduces many challenges and one of them is how to physically store data for better access time. For this purpose, researchers have proposed many data formats which store data into different layouts to give optimal performance in different workloads. However, it is really challenging to decide which format is best for a particular workload. In this article, I am presenting latest research work on these formats. It covers research paper, benchmarking, and videos of the data formats.
Apache Spark is becoming the most popular data analysis tool recently. And hence understanding its internals for better performant code writing is very important for a data scientist. GraphX is a specific Spark API for Graph processing on top of Spark. It has many improvements for graph specific processing. Along with some basic popular graph algorithms like PageRank and Connected Components, it also provides a Pregel API for developing any vertex-centric algorithm. Understanding the internals of the Pregel function and other GraphX APIs are important to write well-performing algorithms. Keeping that in mind, in this blog, we will elaborate the internals of Pregel as implemented in GraphX.
NUMA is a memory access paradigm used to build processors with stronger computing power, where each CPU has its own memory but access to other CPU memories is also granted. The pay-off is that accessing local memory is faster than remote. As a consequence, in-memory database performance appears to be subject to the data placement.
Data warehouses and data lakes are on the opposite ends of the spectrum of approaches of data stores for combining and integrating heterogeneus data sources, supporting the business analytical pipeline in organizations. These data stores are the cornerstone of the analytical framework. Both approaches have advantages and drawbacks, thus, is there room in this spectrum for approaches combining both strategies?
The semantic web of data and the realm of Linked Open Data (LOD) is growing every day … and at its core is the Resource Description Framework (or RDF for short). RDF is a W3C recommended standard for representing data and metadata.
TPCDI-PDI: overcoming the shortcoming of the unformatted TPC-DI ETL descriptions by generating a repository of ETL processes that comply with the TPC-DI specification and are developed using open-source technologies.
Before the talk about research and blog posts about the work we do, it might be nice to share a piece of atmosphere in a few words about us. This post brings some insights from the student perspective.
Welcome to the DTIM research group website! and specifically to the research group blog.