Computer Science - Databases Publications (50)


Computer Science - Databases Publications

Nowadays, a hot challenge for supermarket chains is to offer personalized services for their customers. Next basket prediction, i.e. Read More

In this paper we study the use of coding techniques to accelerate machine learning (ML). Coding techniques, such as prefix codes, have been extensively studied and used to accelerate low-level data processing primitives such as scans in a relational database system. However, there is little work on how to exploit them to accelerate ML algorithms. Read More

Complex Event Recognition applications exhibit various types of uncertainty, ranging from incomplete and erroneous data streams to imperfect complex event patterns. We review Complex Event Recognition techniques that handle, to some extent, uncertainty. We examine techniques based on automata, probabilistic graphical models and first-order logic, which are the most common ones, and approaches based on Petri Nets and Grammars, which are less frequently used. Read More

We consider the task of enumerating and counting answers to $k$-ary conjunctive queries against relational databases that may be updated by inserting or deleting tuples. We exhibit a new notion of q-hierarchical conjunctive queries and show that these can be maintained efficiently in the following sense. During a linear time preprocessing phase, we can build a data structure that enables constant delay enumeration of the query results; and when the database is updated, we can update the data structure and restart the enumeration phase within constant time. Read More

Understanding the influence of a product is crucially important for making informed business decisions. This paper introduces a new type of skyline queries, called uncertain reverse skyline, for measuring the influence of a probabilistic product in uncertain data settings. More specifically, given a dataset of probabilistic products P and a set of customers C, an uncertain reverse skyline of a probabilistic product q retrieves all customers c in C which include q as one of their preferred products. Read More

The Apriori algorithm that mines frequent itemsets is one of the most popular and widely used data mining algorithms. Now days many algorithms have been proposed on parallel and distributed platforms to enhance the performance of Apriori algorithm. They differ from each other on the basis of load balancing technique, memory system, data decomposition technique and data layout used to implement them. Read More

Differentially-private histograms have emerged as a key tool for location privacy. While past mechanisms have included theoretical & experimental analysis, it has recently been observed that much of the existing literature does not fully provide differential privacy. The missing component, private parameter tuning, is necessary for rigorous evaluation of these mechanisms. Read More

Nowadays, various sensors are collecting, storing and transmitting tremendous trajectory data, and it is known that raw trajectory data seriously wastes the storage, network band and computing resource. Line simplification (LS) algorithms are an effective approach to attacking this issue by compressing data points in a trajectory to a set of continuous line segments, and are commonly used in practice. However, existing LS algorithms are not sufficient for the needs of sensors in mobile devices. Read More

The time complexity of data clustering has been viewed as fundamentally quadratic, slowing with the number of data items, as each item is compared for similarity to preceding items. Clustering of large data sets has been infeasible without resorting to probabilistic methods or to capping the number of clusters. Here we introduce MIMOSA, a novel class of algorithms which achieve linear time computational complexity on clustering tasks. Read More

Due to the growth of geo-tagged images, recent web and mobile applications provide search capabilities for images that are similar to a given query image and simultaneously within a given geographical area. In this paper, we focus on designing index structures to expedite these spatial-visual searches. We start by baseline indexes that are straightforward extensions of the current popular spatial (R*-tree) and visual (LSH) index structures. Read More

The sparsest cut problem consists of identifying a small set of edges that breaks the graph into balanced sets of vertices. The normalized cut problem balances the total degree, instead of the size, of the resulting sets. Applications of graph cuts include community detection and computer vision. Read More

Data analytics (such as association rule mining and decision tree mining) can discover useful statistical knowledge from a big data set. But protecting the privacy of the data provider and the data user in the process of analytics is a serious issue. Usually, the privacy of both parties cannot be fully protected simultaneously by a classical algorithm. Read More

Bizur is a consensus algorithm exposing a key-value interface. It is used by a distributed file-system that scales to 100s of servers, delivering millions of IOPS, both data and metadata, with consistent low-latency. Bizur is aimed for services that require strongly consistent state, but do not require a distributed log; for example, a distributed lock manager or a distributed service locator. Read More

Entity extraction is fundamental to many text mining tasks such as organisation name recognition. A popular approach to entity extraction is based on matching sub-string candidates in a document against a dictionary of entities. To handle spelling errors and name variations of entities, usually the matching is approximate and edit or Jaccard distance is used to measure dissimilarity between sub-string candidates and the entities. Read More

In this paper, we present a MapReduce-based framework for evaluating SPARQL queries on GPU (named MapSQ) to large-scale RDF datesets efficiently by applying both high performance. Firstly, we develop a MapReduce-based Join algorithm to handle SPARQL queries in a parallel way. Secondly, we present a coprocessing strategy to manage the process of evaluating queries where CPU is used to assigns subqueries and GPU is used to compute the join of subqueries. Read More

In this appendix we provide additional supplementary material to "A Collective, Probabilistic Approach to Schema Mapping." We include an additional extended example, supplementary experiment details, and proof for the complexity result stated in the main paper. Read More

Skyline queries enable multi-criteria optimization by filtering objects that are worse in all the attributes of interest than another object. To handle the large answer set of skyline queries in high-dimensional datasets, the concept of k-dominance was proposed where an object is said to dominate another object if it is better (or equal) in at least k attributes. This relaxes the full domination criterion of normal skyline queries and, therefore, produces lesser number of skyline objects. Read More

Our concern is the overhead of answering OWL 2 QL ontology-mediated queries (OMQs) in ontology-based data access compared to evaluating their underlying tree-shaped and bounded treewidth conjunctive queries (CQs). We show that OMQs with bounded-depth ontologies have nonrecursive datalog (NDL) rewritings that can be constructed and evaluated in LOGCFL for combined complexity, even in NL if their CQs are tree-shaped with a bounded number of leaves, and so incur no overhead in complexity-theoretic terms. For OMQs with arbitrary ontologies and bounded-leaf CQs, NDL-rewritings are constructed and evaluated in LOGCFL. Read More

The value proposition of a dataset often resides in the implicit interconnections or explicit relationships (patterns) among individual entities, and is often modeled as a graph. Effective visualization of such graphs can lead to key insights uncovering such value. In this article we propose a visualization method to explore graphs with numerical attributes associated with nodes (or edges) -- referred to as scalar graphs. Read More

Optimal location queries identify the best locations to set up new facilities for providing service to the users. For several businesses such as gas stations, cellphone base-stations, etc., placement queries require taking into account the mobility patterns (or trajectories) of the users. Read More

Today's storage systems expose abstractions which are either too low-level (e.g., key-value store, raw-block store) that they require developers to re-invent the wheels, or too high-level (e. Read More

In the field of exploratory data mining, local structure in data can be described by patterns and discovered by mining algorithms. Although many solutions have been proposed to address the redundancy problems in pattern mining, most of them either provide succinct pattern sets or take the interests of the user into account-but not both. Consequently, the analyst has to invest substantial effort in identifying those patterns that are relevant to her specific interests and goals. Read More

In this paper we propose a novel approach to manage the throughput vs latency tradeoff that emerges when managing updates in geo-replicated systems. Our approach consists in allowing full concurrency when processing local updates and using a deferred local serialisation procedure before shipping updates to remote datacenters. This strategy allows to implement inexpensive mechanisms to ensure system consistency requirements while avoiding intrusive effects on update operations, a major performance limitation of previous systems. Read More

The management and usage of state are issues of paramount importance in big data processing systems (BDPS) today. They are increasingly gaining attention due to their utility in supporting complex operations in various applications. In this paper, we survey state management in BDPS and introduce a taxonomy to classify current research in this field by key characteristics. Read More

A regret minimizing set Q is a small size representation of a much larger database P so that user queries executed on Q return answers whose scores are not much worse than those on the full dataset. In particular, a k-regret minimizing set has the property that the regret ratio between the score of the top-1 item in Q and the score of the top-k item in P is minimized, where the score of an item is the inner product of the item's attributes with a user's weight (preference) vector. The problem is challenging because we want to find a single representative set Q whose regret ratio is small with respect to all possible user weight vectors. Read More

The purpose of this document is to create a data model and its serialization for expressing generic time series data. Already existing IVOA data models are reused as much as possible. The model is also made as generic as possible to be open to new extensions but at the same time closed for modifications. Read More

Entity resolution (ER) is the task of identifying all records in a database that refer to the same underlying entity, and are therefore duplicates of each other. Due to inherent ambiguity of data representation and poor data quality, ER is a challenging task for any automated process. As a remedy, human-powered ER via crowdsourcing has become popular in recent years. Read More

This paper presents a new technique for automatically synthesizing SQL queries from natural language. Our technique is fully automated, works for any database without requiring additional customization, and does not require users to know the underlying database schema. Our method achieves these goals by combining natural language processing, program synthesis, and automated program repair. Read More

Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability. Read More

The digital revolution has brought most of the world on the world wide web. The data available on WWW has increased many folds in the past decade. Social networks, online clubs and organisations have come into existence. Read More

We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies existing qualitative data repairing approaches, which rely on integrity constraints or external data sources, with quantitative data repairing methods, which leverage statistical properties of the input data. Given an inconsistent dataset as input, HoloClean automatically generates a probabilistic program that performs data repairing. Read More

Data fusion has played an important role in data mining because high-quality data is required in a lot of applications. As on-line data may be out-of-date and errors in the data may propagate with copying and referring between sources, it is hard to achieve satisfying results with merely applying existing data fusion methods to fuse Web data. In this paper, we make use of the crowd to achieve high-quality data fusion result. Read More

Many scenarios require computing the join of databases held by two or more parties that do not trust one another. Private record linkage is a cryptographic tool that allows such a join to be computed without leaking any information about records that do not participate in the join output. However, such strong security comes with a cost: except for exact equi-joins, these techniques have a high computational cost. Read More

In-situ processing has been proposed as a novel data exploration solution in many domains generating massive amounts of raw data, e.g., astronomy, since it provides immediate SQL querying over raw files. Read More

We study the problem of edit similarity joins, where given a set of strings and a threshold value $K$, we need to output all the pairs of strings whose edit distances are at most $K$. Edit similarity join is a fundamental operation in numerous applications such as data cleaning/integration, bioinformatics, natural language processing, and has been identified as a primitive operator for database systems. This problem has been studied extensively in the literature. Read More

Incremental data mining algorithms process frequent updates to dynamic datasets efficiently by avoiding redundant computation. Existing incremental extension to shared nearest neighbor density based clustering (SNND) algorithm cannot handle deletions to dataset and handles insertions only one point at a time. We present an incremental algorithm to overcome both these bottlenecks by efficiently identifying affected parts of clusters while processing updates to dataset in batch mode. Read More

In April 2016, a community of researchers working in the area of Principles of Data Management (PDM) joined in a workshop at the Dagstuhl Castle in Germany. The workshop was organized jointly by the Executive Committee of the ACM Symposium on Principles of Database Systems (PODS) and the Council of the International Conference on Database Theory (ICDT). The mission of this workshop was to identify and explore some of the most important research directions that have high relevance to society and to Computer Science today, and where the PDM community has the potential to make significant contributions. Read More

Linked data portals need to be able to advertise and describe the structure of their content. A sufficiently expressive and intuitive schema language will allow portals to communicate these structures. Validation tools will aid in the publication and maintenance of linked data and increase their quality. Read More

Research in data warehousing and OLAP has produced important technologies for the design, management and use of information systems for decision support. With the development of Internet, the availability of various types of data has increased. Thus, users require applications to help them obtaining knowledge from the Web. Read More


The aim of this article is to present an overview of the major families of state-of-the-art data processing benchmarks, namely transaction processing benchmarks and decision support benchmarks. We also address the newer trends in cloud benchmarking. Finally, we discuss the issues, tradeoffs and future trends for data processing benchmarks. Read More


The aim of this article is to present an overview of the major XML warehousing approaches from the literature, as well as the existing approaches for performing OLAP analyses over XML data (which is termed XML-OLAP or XOLAP; Wang et al., 2005). We also discuss the issues and future trends in this area and illustrate this topic by presenting the design of a unified, XML data warehouse architecture and a set of XOLAP operators expressed in an XML algebra. Read More

Frequent itemset mining is a popular data mining technique. Apriori, Eclat, and FP-Growth are among the most common algorithms for frequent itemset mining. Considerable research has been performed to compare the relative performance between these three algorithms, by evaluating the scalability of each algorithm as the dataset size increases. Read More

Nearest neighbor search is known as a challenging issue that has been studied for several decades. Recently, this issue becomes more and more imminent in viewing that the big data problem arises from various fields. In this paper, a scalable solution based on hill-climbing strategy with the support of k-nearest neighbor graph (kNN) is presented. Read More

During genomics life science research, the data volume of whole genomics and life science algorithm is going bigger and bigger, which is calculated as TB, PB or EB etc. The key problem will be how to store and analyze the data with optimized way. This paper demonstrates how Intel Big Data Technology and Architecture help to facilitate and accelerate the genomics life science research in data store and utilization. Read More

Crowdsourcing is becoming increasingly important in entity resolution tasks due to their inherent complexity such as clustering of images and natural language processing. Humans can provide more insightful information for these difficult problems compared to machine-based automatic techniques. Nevertheless, human workers can make mistakes due to lack of domain expertise or seriousness, ambiguity, or even due to malicious intents. Read More

Maintenance of association rules is an interesting problem. Several incremental maintenance algorithms were proposed since the work of (Cheung et al, 1996). The majority of these algorithms maintain rule bases assuming that support threshold doesn't change. Read More

Since formulation of Inductive Database (IDB) problem, several Data Mining (DM) languages have been proposed, confirming that KDD process could be supported via inductive queries (IQ) answering. This paper reviews the existing DM languages. We are presenting important primitives of the DM language and classifying our languages according to primitives' satisfaction. Read More

Discovering the key structure of a database is one of the main goals of data mining. In pattern set mining we do so by discovering a small set of patterns that together describe the data well. The richer the class of patterns we consider, and the more powerful our description language, the better we will be able to summarise the data. Read More

XML data warehouses form an interesting basis for decision-support applications that exploit complex data. However, native-XML database management systems (DBMSs) currently bear limited performances and it is necessary to research for ways to optimize them. In this chapter, we present two such techniques. Read More

Erasure coding has been widely adopted in distributed storage systems for fault-tolerant storage with low redundancy. We present MemEC, an erasure-coding-based in-memory key-value (KV) store. MemEC is specifically designed for workloads dominated by small objects. Read More