Data Stream Management

Alberto Abelló, Rohit Kumar, Sergi Nadal, Oscar Romero

  • Description

    With the current advancement of technology and Internet of Things (IoT) devices, more and more data are getting generated everyday. This huge amount of generated data is becoming very difficult to store and process in required time. Even with the current advancement in big data technology, like Map/Reduce and distributed storage and processing, the time it takes to store and process these data is very high. It is now becoming important to do some processing while data are still arriving before they are stored in a storage system. This requires development of an online data stream mining system. It is important to design new algorithms which can work on a single pass of the data and provide some meaningful results. The focus of this research project is to work towards developing systems and algorithms to enable data stream mining and processing.

    Research line: Self-Optimizing Data Stream Processing

    Major Big Data players, such as Google and Amazon, have developed large Big Data systems aligning their business goals with complex data management and analysis. These companies exemplify an emerging paradigm shift towards data-driven organizations, where data are turned into valuable knowledge becoming a key asset for their business. Big Data is a natural evolution of decision support systems (DSS), and inherits from them the ultimate goal of transforming raw data into valuable knowledge. Indeed, the famous characterization of Big Data in terms of the three V's (Volume, Velocity and Variety)" implies the inability of data warehousing (DW) architectures, the most popular DSS architecture, to deal with such new challenges. The Lambda-architecture is currently the most widespread framework for scalable and fault-tolerant Big Data processing. Its goal is to enable efficient batch and real-time data management and analysis with a three-tier architecture composed of: a) the batch layer managing master data and pre-computing batch views; b) the serving layer indexing batch views for low latency querying; and c) the speed layer dealing with data streams and near real-time processing. However, massive and fast ingestion entails storing raw data as they come without further formatting, tightly related to the load-first model-later principle supported by schema-less databases, which creates a lack of semantics that complicates managing Variety.
    Currently, organizations strive for faster, usually near-real time, decision making processes including complex event processing and data stream analysis techniques. However, the inclusion of continuous data streams in such process entails new challenges not present in traditional data management systems. First, the data stream management, relying on a sliding window buffering model to smooth arrival irregularities. Second, the data stream processing, relying on linear or sublinear algorithms to provide near real-time analysis. Recently, many efforts have been put to provide adaptation mechanisms to the multiple, continuous, rapid, time-varying nature of data streams into Data Stream Management Systems (DSMS). Nonetheless, these works do not have a holistic view of the architecture they are part of, missing potential optimizations. Even though some works consider Linked Data streams, overcoming Variety issues, they are not exploiting such semantics for optimization. To this end, in this doctoral project we propose to extend the Lambda-architecture enabling it with semantic aware self-optimizing features for optimal real-time processing. Several data stream characteristics must be considered, e.g. arrival rate, data distribution or schema evolution. Combining those with system requirements (e.g. window buffer size) with user requirements (e.g. time constraints), will provide a complete knowledge base to enable run-time optimization.

    Research line: Graph Stream Mining Using Approx Sketches and Distributed Parallel Processing

    Graph data is one of the most widely used data structures in computer science. Massive graphs are generated by applications which maintain a relationship between different data entities such as social media network, publications and authors in DBLP, web pages and hyperlinks or sensor data. Analyzing such graphs to compute graph properties like connectivity, shortest path or node distance is a well known problem. Traditional methods used to address this problem require multiple passes over the huge graphs. Hence analyzing them under the streaming model is becoming more and more popular. Unlike static graph mining techniques for large graphs, graph stream mining pose a more complex challenge when the large graph is constantly updating and evolving over time with the continuous addition of new edges and the deletion of old edges. Calculating properties such as those mentioned above on the evolving graph is a very interesting research problem in graph data stream mining. For example, finding the shortest path between all pairs of nodes in a graph is an interesting problem for large graphs like social media, as it could be used to solve numerous graph analysis problems such as centrality computation, community detection or node separation. In this research study, we will be using data streaming model where the input(edge of the graph) arrives as a stream and need to be processed as it arrives. We shall also assume that the data-sets are so huge that it can't be stored in memory, storing it in secondary disk and then accessing it again will be very costly. Hence the algorithms to process the stream should create and maintain a summary in the memory. These summaries should be of the order of logarithm of the data (number of edges in case of graphs) in size then only storing it in memory would be feasible. Because of the extra complexity of the structure of the graphs most recent work in graph stream mining focuses on the semi-streaming model. In this model a space complexity of O(n polylog n) is permitted to be stored in memory, where n is the number of nodes in the input graph. We will develop new synopses of large graphs which can evolve efficiently with new updates. For example, we will work on creating a synopsis graph which has the neighborhood profile information for each node in a graph and an efficient method to update this synopsis. We will also work on extending this profile to study the information flow in a dynamic graph. We will analyze the existing graph stream mining algorithms and propose extensions with sliding window model when applicable. We will also develop a new framework which will support dynamic distributed graph processing. This framework will provide a platform for creating graph synopsis of large dynamic graphs in a very efficient manner. As part of our work we will provide some implementation of the existing algorithms and the algorithms we developed in this new framework. This will enable the graph mining community to easily extend there work on a distributed platform for better performance.

    Related publications
    Rohit Kumar 0002, Toon Calders, Aristides Gionis, Nikolaj Tatti: Maintaining Sliding-Window Neighborhood Profiles in Interaction Networks. ECML/PKDD (2) 2015