Computer Science - Databases Publications (50)


Computer Science - Databases Publications

The growth in variety and volume of OLTP (Online Transaction Processing) applications poses a challenge to OLTP systems to meet performance and cost demands in the existing hardware landscape. These applications are highly interactive (latency sensitive) and require update consistency. They target commodity hardware for deployment and demand scalability in throughput with increasing clients and data. Read More

We propose hMDAP, a hybrid framework for large-scale data analytical processing on Spark, to support multi-paradigm process (incl. OLAP, machine learning, and graph analysis etc.) in distributed environments. Read More

In this paper, we propose a plugin-based framework for RDF stream processing named PRSP. Within this framework, we can employ SPARQL query engines to process C-SPARQL queries with maintaining the high performance of those engines in a simple way. Taking advantage of PRSP, we can process large-scale RDF streams in a distributed context via distributed SPARQL engines. Read More

The ability of the RDF data model to link data from heterogeneous domains has led to an explosive growth of RDF data. So, evaluating SPARQL queries over large RDF data has been crucial for the semantic web community. However, due to the graph nature of RDF data, evaluating SPARQL queries in relational databases and common data-parallel systems needs a lot of joins and is inefficient. Read More

Many practical scenarios make it necessary to evaluate top-k queries over data items with partially unknown values. This paper considers a setting where the values are taken from a numerical domain, and where some partial order constraints are given over known and unknown values: under these constraints, we assume that all possible worlds are equally likely. Our work is the first to propose a principled scheme to derive the value distributions and expected values of unknown items in this setting, with the goal of computing estimated top-k results by interpolating the unknown values from the known ones. Read More

This paper studies dynamic complexity under definable change operations in the DynFO framework by Patnaik and Immerman. It is shown that for changes definable by parameter-free first-order formulas, all (uniform) $AC^1$ queries can be maintained by first-order dynamic programs. Furthermore, many maintenance results for single-tuple changes are extended to more powerful change operations: (1) The reachability query for undirected graphs is first-order maintainable under single tuple changes and first-order defined insertions, likewise the reachability query for directed acyclic graphs under quantifier-free insertions. Read More

Context: Information Technology consumes up to 10\% of the world's electricity generation, contributing to CO2 emissions and high energy costs. Data centers, particularly databases, use up to 23% of this energy. Therefore, building an energy-efficient (green) database engine could reduce energy consumption and CO2 emissions. Read More

Despite the fact that JSON is currently one of the most popular formats for exchanging data on the Web, there are very few studies on this topic and there are no agreement upon theoretical framework for dealing with JSON. There- fore in this paper we propose a formal data model for JSON documents and, based on the common features present in available systems using JSON, we define a lightweight query language allowing us to navigate through JSON documents. We also introduce a logic capturing the schema proposal for JSON and study the complexity of basic computational tasks associated with these two formalisms. Read More

XML data sources are more and more gaining popularity in the context of a wide family of Business Intelligence (BI) and On-Line Analytical Processing (OLAP) applications, due to the amenities of XML in representing and managing semi-structured and complex multidimensional data. As a consequence, many XML data warehouse models have been proposed during past years in order to handle hetero-geneity and complexity of multidimensional data in a way traditional relational data warehouse approaches fail to achieve. However, XML-native database systems currently suffer from limited performance, both in terms of volumes of manageable data and query response time. Read More

Privacy-preserving record linkage (PPRL) aims at integrating sensitive information from multiple disparate databases of different organizations. PPRL approaches are increasingly required in real-world application areas such as healthcare, national security, and business. Previous approaches have mostly focused on linking only two databases as well as the use of a dedicated linkage unit. Read More

Aggregate analysis, such as comparing country-wise sales versus global market share across product categories, is often complicated by the unavailability of common join attributes, e.g., category, across diverse datasets from different geographies or retail chains, even after disparate data is technically ingested into a common data lake. Read More

We study rewritability of monadic disjunctive Datalog programs, (the complements of) MMSNP sentences, and ontology-mediated queries (OMQs) based on expressive description logics of the ALC family and on conjunctive queries. We show that rewritability into FO and into monadic Datalog (MDLog) are decidable, and that rewritability into Datalog is decidable when the original query satisfies a certain condition related to equality. We establish 2NExpTime-completeness for all studied problems except rewritability into MDLog for which there remains a gap between 2NExpTime and 3ExpTime. Read More

We present an extension of the declarative programming language Curry to support the access to data stored in relational databases via SQL. Since Curry is statically typed, our emphasis on this SQL integration is on type safety. Our extension respects the type system of Curry so that run-time errors due to ill-typed data are avoided. Read More

OpenRuleBench is a large benchmark suite for rule engines, which includes deductive databases. We previously proposed a translation of Datalog to C++ based on a method that "pushes" derived tuples immediately to places where they are used. In this paper, we report performance results of various implementation variants of this method compared to XSB, YAP and DLV. Read More

The methods to access large relational databases in a distributed system are well established: the relational query language SQL often serves as a language for data access and manipulation, and in addition public interfaces are exposed using communication protocols like REST. Similarly to REST, GraphQL is the query protocol of an application layer developed by Facebook. It provides a unified interface between the client and the server for data fetching and manipulation. Read More

Bottom-up evaluation of Datalog has been studied for a long time, and is standard material in textbooks. However, if one actually wants to develop a deductive database system, it turns out that there are many implementation options. For instance, the sequence in which rule instances are applied is not given. Read More

Modern knowledge base systems frequently need to combine a collection of databases in different formats: e.g., relational databases, XML databases, rule bases, ontologies, etc. Read More

This paper explores the effect that changing access patterns has on the performance of database management systems. Changes in access patterns play an important role in determining the efficiency of key performance optimization techniques, such as dynamic clustering, prefetching, and buffer replacement. However, all existing benchmarks or evaluation frameworks produce static access patterns in which objects are always accessed in the same order repeatedly. Read More


Data warehouse architectural choices and optimization techniques are critical to decision support query performance. To facilitate these choices, the performance of the designed data warehouse must be assessed, usually with benchmarks. These tools can either help system users comparing the performances of different systems, or help system engineers testing the effect of various design choices. Read More

The data warehousing and OLAP technologies are now moving onto handling complex data that mostly originate from the Web. However, intagrating such data into a decision-support process requires their representation under a form processable by OLAP and/or data mining techniques. We present in this paper a complex data warehousing methodology that exploits XML as a pivot language. Read More

Process-Aware Information System (PAIS) are IT systems that manages, supports business processes and generate large event logs from execution of business processes. An event log is represented as a tuple of the form CaseID, TimeStamp, Activity and Actor. Process Mining is an emerging area of research that deals with the study and analysis of business processes based on event logs. Read More

We propose a new formalism for specifying and reasoning about problems that involve heterogeneous "pieces of information" -- large collections of data, decision procedures of any kind and complexity and connections between them. The essence of our proposal is to lift Codd's relational algebra from operations on relational tables to operations on classes of structures (with recursion), and to add a direction of information propagation. We observe the presence of information propagation in several formalisms for efficient reasoning and use it to express unary negation and operations used in graph databases. Read More

Graph similarity search has received considerable attention in many applications, such as bioinformatics, data mining, pattern recognition, and social networks. Existing methods for this problem have limited scalability because of the huge amount of memory they consume when handling very large graph databases with millions or billions of graphs. In this paper, we study the problem of graph similarity search under the graph edit distance constraint. Read More

Graph data model and graph databases are very popular in various areas such as bioinformatics, semantic web, and social networks. One specific problem in the area is a path querying with constraints formulated in terms of formal grammars. The query in this approach is written as grammar, and paths querying is graph parsing with respect to given grammar. Read More

Privacy-preserving record linkage (PPRL), the problem of identifying records that correspond to the same real-world entity across several data sources held by different parties without revealing any sensitive information about these records, is increasingly being required in many real-world application areas. Examples range from public health surveillance to crime and fraud detection, and national security. Various techniques have been developed to tackle the problem of PPRL, with the majority of them considering linking data from only two sources. Read More

Big data trend has enforced the data-centric systems to have continuous fast data streams. In recent years, real-time analytics on stream data has formed into a new research field, which aims to answer queries about what-is-happening-now with a negligible delay. The real challenge with real-time stream data processing is that it is impossible to store instances of data, and therefore online analytical algorithms are utilized. Read More

A traditional database systems is organized around a single data model that determines how data can be organized, stored and manipulated. But the vision of this paper is to develop new principles and techniques to manage multiple data models against a single, integrated backend. For example, semi-structured, graph and relational models are examples of data models that may be supported by a new system. Read More

This paper provides a technical introduction to the PATSTAT Register database, which contains bibliographical, procedural and legal status data on patent applications handled by the European Patent Office. It presents eight MySQL queries that cover some of the most relevant aspects of the database for research purposes. It targets academic researchers and practitioners who are familiar with the PATSTAT database and the MySQL language. Read More

Providing machine learning (ML) over relational data is a mainstream requirement for data analytics systems. While almost all the ML tools require the input data to be presented as a single table, many datasets are multi-table, which forces data scientists to join those tables first, leading to data redundancy and runtime waste. Recent works on "factorized" ML mitigate this issue for a few specific ML algorithms by pushing ML through joins. Read More

We study distributed graph algorithms that adopt an iterative vertex-centric framework for graph processing, popularized by the Google's Pregel system. Since then, there are several attempts to implement many graph algorithms in a vertex-centric framework, as well as efforts to design optimization techniques for improving the efficiency. However, to the best of our knowledge, there has not been any systematic study to compare these vertex-centric implementations with their sequential counterparts. Read More

Analyzing textual data is a very challenging task because of the huge volume of data generated daily. Fundamental issues in text analysis include the lack of structure in document datasets, the need for various preprocessing steps %(e.g. Read More

Knowledge bases such as Wikidata, DBpedia, or YAGO contain millions of entities and facts. In some knowledge bases, the correctness of these facts has been evaluated. However, much less is known about their completeness, i. Read More

In this paper we introduce an interface for supporting ordered maps that are augmented to support quick "sums" of values over ranges of the keys. We have implemented this interface as part of a C++ library called PAM (Parallel and Persistent Augmented Map library). This library supports a wide variety of functions on maps ranging from basic insertion and deletion to more interesting functions such as union, intersection, difference, filtering, extracting ranges, splitting, and range-sums. Read More

Abstraction without regret refers to the vision of using high-level programming languages for systems development without experiencing a negative impact on performance. A database system designed according to this vision offers both increased productivity and high performance, instead of sacrificing the former for the latter as is the case with existing, monolithic implementations that are hard to maintain and extend. In this article, we realize this vision in the domain of analytical query processing. Read More

Complex Event Processing (CEP) is an emerging field with important applications in many areas. CEP systems collect events arriving from input data streams and use them to infer more complex events according to predefined patterns. The Non-deterministic Finite Automaton (NFA) is one of the most popular mechanisms on which such systems are based. Read More

Record linkage is the process of identifying records that refer to the same entities from several databases. This process is challenging because commonly no unique entity identifiers are available. Linkage therefore has to rely on partially identifying attributes, such as names and addresses of people. Read More

We investigate parameterizations of both database instances and queries that make query evaluation fixed-parameter tractable in combined complexity. We introduce a new Datalog fragment with stratified negation, intensional-clique-guarded Datalog (ICG-Datalog), with linear-time evaluation on structures of bounded treewidth for programs of bounded rule size. Such programs capture in particular conjunctive queries with simplicial decompositions of bounded width, guarded negation fragment queries of bounded CQ-rank, or two-way regular path queries. Read More

We consider the problem of representing multidimensional data where the domain of each dimension is organized hierarchically, and the queries require summary information at a different node in the hierarchy of each dimension. This is the typical case of OLAP databases. A basic approach is to represent each hierarchy as a one-dimensional line and recast the queries as multidimensional range queries. Read More

Most density based stream clustering algorithms separate the clustering process into an online and offline component. Exact summarized statistics are being employed for defining micro-clusters or grid cells during the online stage followed by macro-clustering during the offline stage. This paper proposes a novel alternative to the traditional two phase stream clustering scheme, introducing sketch-based data structures for assessing both stream density and cluster membership with probabilistic accuracy guarantees. Read More

Huge amount of data with both space and text information, e.g., geo-tagged tweets, is flooding on the Internet. Read More

Recent works on bounding the output size of a conjunctive query with functional dependencies and degree bounds have shown a deep connection between fundamental questions in information theory and database theory. We prove analogous output bounds for disjunctive datalog queries, and answer several open questions regarding the tightness and looseness of these bounds along the way. The bounds are intimately related to Shannon-type information inequalities. Read More

Scientific discoveries are increasingly driven by analyzing large volumes of image data. Many new libraries and specialized database management systems (DBMSs) have emerged to support such tasks. It is unclear, however, how well these systems support real-world image analysis use cases, and how performant are the image analytics tasks implemented on top of such systems. Read More

Predictive business process monitoring methods exploit logs of completed cases of a process in order to make predictions about running cases thereof. Existing methods in this space are tailor-made for specific prediction tasks. Moreover, their relative accuracy is highly sensitive to the dataset at hand, thus requiring users to engage in trial-and-error and tuning when applying them in a specific setting. Read More

Privacy-preserving Near-neighbor search (PP-NNS) is a well-studied problem in the literature. The overwhelming growth in the size of current datasets and the lack of any truly secure server in the online world render the existing solutions impractical either due to their high computational requirements or the non-realistic assumptions which potentially compromise privacy. PP-NNS with multiple (semi-honest) data owners having query time sub-linear in the number of users has been proposed as an open research direction. Read More

Affiliations: 1Hasso Plattner Institute at the University of Potsdam, 2Hasso Plattner Institute at the University of Potsdam

Nowadays, graph databases are employed when relationships between entities are in the scope of database queries to avoid performance-critical join operations of relational databases. Graph queries are used to query and modify graphs stored in graph databases. Graph queries employ graph pattern matching that is NP-complete for subgraph isomorphism. Read More

Our main goal is to discover the main factors influencing students' academic trajectory and students' academic evolution within such environment. Our results indicate strong correlation in this virtual learning environment between student average and some factors like: student's English level (despite the fact that Arabic language is the teaching language), student's age, student's gender, student's over-stay and student's place of residence (inside or outside Syria). Our results indicate also a need to modify the academic trajectory of students by changing the prerequisites of few courses delivered as a part of BIT diploma like Advanced DBA II, Data Security. Read More

Recent tools for interactive data exploration significantly increase the chance that users make false discoveries. The crux is that these tools implicitly allow the user to test a large body of different hypotheses with just a few clicks thus incurring in the issue commonly known in statistics as the multiple hypothesis testing error. In this paper, we propose solutions to integrate multiple hypothesis testing control into interactive data exploration tools. Read More

The traditional algorithms do not meet the latest multiple requirements simultaneously for objects. Density-based method is one of the methodologies, which can detect arbitrary shaped clusters where clusters are defined as dense regions separated by low density regions. In this paper, we present a new clustering algorithm to enhance the density-based algorithm DBSCAN. Read More

Incentivized social advertising, an emerging marketing model, provides monetization opportunities not only to the owners of the social networking platforms but also to their influential users by offering a "cut" on the advertising revenue. We consider a social network (the host) that sells ad-engagements to advertisers by inserting their ads, in the form of promoted posts, into the feeds of carefully selected "initial endorsers" or seed users: these users receive monetary incentives in exchange for their endorsements. The endorsements help propagate the ads to the feeds of their followers. Read More