Swiss Informatics Society

Special Interest Group

on Information Systems


DBTA Workshop on Stream Processing

Registration and Information
10:00-17:00 Wednesday 3 December 2014
Sky Lounge
Wankdorf Stadium, Berne

Organisers: Dr. Martin Wunderli, Trivadis, Prof. Heiko Schuldt, University of Basel

The proliferation of software and hardware sensors has significantly changed the requirements for data processing and data management. Rather than having static collections, data are continuously generated. This, in turn, strongly affects the way data are queried. Essentially, the types of queries on data streams invert the basic model of a traditional database system where data are static and queries are transient. In Data Streams, queries are static (continuous queries) and data are created dynamically. In the context of Big Data, aside of the sheer size of data, also the frequency in which new data comes in (also known as 'Data Velocity') is an important challenge that has to be dealt with, especially when applications have (near) realtime requirements. High frequency data streams and/or streams of data from multiple sources require novel paradigms for query processing, especially when complex semantic events need to be detected online.

This workshop aims at bringing together researchers and practitioners actively working on aspects of Data Streams in the context of Big Data to foster discussions on novel scientific trends and recent developments from leading-edge industry and academic institutions.

After two workshops on Big Data, one entitled Big Data, Cloud Data Management, and NoSQL (with focus on 'Volume of Big Data') which was held in October 2012, and one entitled Semantic Data Processing (with focus on 'Variety of Big Data') which was held in February 2014, this workshop will be dedicated to 'Velocity in Big Data'.

Please note that the workshop is for SI/DBTA members only. There is no fee to attend the workshop but registration is necessary. Further details and registration are available here.

Agenda and Abstracts
  • Big Data and Fast Data combined – is it possible?
    Ulises Fasoli, Consultant Trivadis Lausanne

    Big Data (volume) and real-time information processing (velocity) are two important aspects of Big Data systems. At first sight, these two aspects seem to be incompatible. Are traditional software architectures still the right choice? Do we need new, revolutionary architectures to tackle the requirements of Big Data.
    This presentation discusses the idea of the so-called lambda architecture for Big Data, which acts on the assumption of a bisection of the data-processing: in a batch-phase a temporally bounded, large dataset is processed either through traditional ETL or MapReduce. In parallel, a real-time, online stream processing is constantly calculating the values of the new data coming in during the batch phase. The combination of the two results, batch and online processing is giving the constantly up-to-date view.

  • Privacy-Preserving Event Stream Processing in the Cloud
    Prof. Dr. Pascal Felber, Université de Neuchâtel

    Stream processing provides an appealing paradigm for building large-scale distributed applications. Such applications are often deployed over multiple administrative domains, some of which may not be trusted. Recent attacks in public clouds indicate that a major concern in untrusted domains is the enforcement of privacy. In this talk we will primarily focus on the problem of content-based routing (CBR), which is at the core of many event stream processing systems. By routing data based on subscriptions evaluated on the content of publications, CBR systems can expose critical information to unauthorized parties. Information leakage can be avoided by the means of privacy-preserving filtering, which is supported by several mechanisms for encrypted matching. Unfortunately, existing approaches have in common a high performance overhead and the difficulty to use classical optimization. We will present and discuss mechanisms that greatly reduces the cost of supporting privacy-preserving filtering based on encrypted matching operators.
    Pascal Felber received his M.Sc. and Ph.D. degrees in Computer Science from the Swiss Federal Institute of Technology. Between 1998 and 2004 he has worked at Oracle Corporation and Bell-Labs (Lucent Technologies) in the USA, and at Institut EURECOM in France. Since 2004, he is a Professor of Computer Science at the University of Neuchâtel, Switzerland, working in the field of concurrent, dependable, and distributed systems. He is also leading the research center for Complex Systems and Big Data. He has published over 100 research papers in various journals and conferences.

  • Imputing Missing Values in Time Series
    Prof. Dr. Michael Böhlen, University of Zürich

    Time series data are frequently incomplete and before any data analysis can take place missing values must be imputed. This talk describes different solutions to impute blocks of missing values in irregular time series. Specifically we compare effectiveness, efficiency and application of simple and more complex statistical approaches that take advantage of the correlation between time series.
    Michael Böhlen is a professor of computer science at the University of Zürich. His research interests include various aspects of data management, and have focused on time-varying information, data warehousing, data analysis, and similarity search. He received his M.Sc. and Ph.D. degrees from ETH Zürich in respectively 1990 and 1994. Before joining the University of Zürich he visited the University of Arizona for one year, and was a faculty member at Aalborg University for eight years and the Free University of Bozen-Bolzano for six years.

  • Challenges in Stream Processing for the Web of Data
    Dr. Jean-Paul Calbimonte, EPF Lausanne

    Stream processing is ubiquitous nowadays. In different research areas of Computer Science, a lot of work has been dedicated to conceive ways of representing, transmitting, processing and understanding infinite sequences of data. Nevertheless, in era of the Web of Data, new challenges need to be addressed. In particular, heterogeneity in stream data management and event processing is both a challenging topic and a key enabler for the rising Web of Things, where smart devices continuously sense properties of the surrounding world. Different proposals on RDF and Linked Data streams have shown promising results for managing this type of data, while keeping explicit semantics on the data streams, and linking them to other datasets in a web-friendly way. With time, these efforts led to the emergence of initiatives that aim at specifying a base RDF stream model and query language. Although these works produced interest results in defining overarching model definitions, there are still multiple orthogonal challenges that need to be addressed. In this work we identify some of these challenges, and we link them to the characteristics of what are nowadays called reactive systems. This paradigm includes natively supporting event-driven asynchronous message passing, non-blocking data communication and processing through all layers, and on-demand flexible scalability. We argue that RDF stream systems, combined with reactive techniques can lead to powerful, resilient and interoperable systems at Web scale.
    Jean-Paul Calbimonte is a postdoctoral research fellow in the Distributed Information Systems Laboratory at EPFL, Switzerland. His work focuses on Web data integration and streaming data sources. His work addresses some of the emerging challenges of streaming data processing for the Internet of Things, and the Semantic Sensor Web. He has worked on ontology-based access for streaming data, the SPARQLStream language and the Morph-streams evaluator. He also helps coordinating the W3C Community Group on RDF Stream Processing (RSP). He currently works on the EU OpenIoT project, the Swiss-Experiment/OSPER project, and the OpenSense2 and D1namo projects funded by Nano-Tera.ch. He has previously worked in the EU FP7 projects SemsorGrid4Env and PlanetData. He holds a PhD in Artificial Intelligence from Universidad Politécnica de Madrid and a MSc degree in Computer Science from EPFL and has been a research visitor at the University of Manchester and EPFL. He has also worked in the software industry, in the areas of data integration and database systems for medical and radiology information systems.

  • Steamdrill: Real Time Data Analysis Patterns
    Dr. Mikio Braun: blog.mikiobraun.de

    Processing huge volume event streams in realtime in a robust and efficient fashion poses quite some challenges. Throwing raw processing power at the problem is one way to solve them, but there are more efficient ways, in particular when the specific analysis task focusses on interesting points or allows to deal with approximate results. In this talk we’ll cover what we call realtime data analysis patterns, covering all aspects from data acquisition, processing, to storage of historic data, always making sure that the resulting system can provide constant performance. The resulting architecture uses approximative algorithms at its core, and uses a combination of in-memory and disk based storage. We deal with such questions like: How to make sure we can ingest several 10k events per second? How to keep track of millions of objects with bounded resources? How to integrate with existing infrastructure? Finally, we will discuss several use cases, including social media data, user realtime profiling and recommendation, and realtime analytics.

  • Streaming across the Web of Things
    Prof. Dr. Cesare Pautasso, Università della Svizzera italiana, Lugano

    The Web offers a uniform technology platform that provides access to more and more Web-enabled smart devices, including sensors, microcontrollers and mobile phones. Emerging Web protocols such as WebSockets and WebRTC also are starting to enable new kinds of real-time Web applications that were very challenging to build over plain request-response HTTP. In this presentation we discuss the design of a new stream processing framework for the Web, whereby operators are written in JavaScript and their execution can be transparently shifted between Web browsers and Web servers, running on all kinds of things.
    Cesare Pautasso is associate professor at the Faculty of Informatics at the University of Lugano, Switzerland. Previously he was a researcher at the IBM Zurich Research Lab (2007) and a senior researcher at ETH Zurich (2004-2007). He completed his graduate studies with a Ph.D. from ETH Zurich in 2004. His research group focuses on building experimental systems to explore the intersection of model-driven software composition techniques, business process modelling languages, and autonomic/Cloud computing, Web 2.0 Mashups, and Self-Organizing, Liquid Service Oriented Architectures. His teaching, training, and consulting activities both in academia and in industry cover advanced topics related to emerging Web Technologies, Business Process Management and Enterprise Integration Architectures. His book on "SOA with REST" was published in 2012. He served as the program co-chair of ICSOC 2013, ECOWS 2010 and Software Composition 2008. He has also started the series of International Workshops on RESTful Design (WS-REST) at the WWW conference. He regularly referees for Swiss, EU and US funding agencies. Since 2010 he is an advisory board member of EnterpriseWeb.

  • Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
    Guido Schmutz, Technology Manager & Partner Trivadis, Oracle ACE director

    Storm as well as Spark Streaming are Open-Source Frameworks supporting distributed stream processing. Storm has been developed by Twitter and is a free and open source distributed real-time computation system that can be used with any programming language. It is written primarily in Clojure and supports Java by default. Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala. This presentation shows how you can implement stream processing solutions with the two frameworks, discusses how they compare and highlights the differences and similarities.
DBTA Mailing List

If you would like to be notified about news of this and other DBTA events by email, you can join the DBTA mailing list.